Go to Recordable Macro section Go to PCRE Search tutorial Go to JavaScript section Go to Perl section Go to C section (get it?)
Where to find Perl, Python, etc.
  • Mac OS X - Perl and Python are built in
  • www.activestate.com for Windows versions or multi-platform IDE (including OS X)
  • Very fast, highly extensible scripting languages
Case Study: Extract data from HTML course
  • Challenges:
  • Thousands of screens of course data written in buggy pre-millennial HTML
  • Fails on modern browsers, sloow on others
  • Important data hidden in invisible <DIV>s
Case Study: Extract data from HTML course - Solution Part 1
  • Manually analyze sample pages, find patterns
  • Have Perl analyze patterns coursewide
  • Very useful for identifying extraneous information that is generally the same throughout the course but with subtle differences
  • (i.e., if all pages have similar but not identical header/footer/sidebar information, Perl is a great tool for identifying it and getting rid of it)
  • Manually analyze pages that don’t fit pattern
  • EXAMPLE: defineCourse.pl (course-specific demo at MIT talk)
Case Study: Extract data from HTML course - Solution Part 2
  • Use Perl to extract all <DIV>s (or <TABLE>s, depending on how the HTML is organized) for hundreds of files at a time
  • Have Perl remove extraneous HTML (comments, etc.)
  • Have Perl remove extraneous items (such as headers/footers/image frames/sidebars/etc.) found in solution part 1
  • Save all <DIV>s to single HTML file, which you can then load into a browser for a coursewide overview -- a large HTML file that is easy to quickly scroll through for an overview of
Case Study: Extract data from HTML course - Download
  • Download basic perl script to concatinate multiple HTML files into one file: HTML_concat.pl for Mac or Windows
  • Place the script into the folder of HTML files you want to concatenate into one file
  • In your Windows Command prompt (or OS X terminal), cd to that directory, then simply type perl html_concat.pl
  • All the HTML body data from those files will be concatenated into a single file
  • All headers, external style sheets, external scripts, etc. are stripped out
  • The resulting file is saved to a file named "html_concat [current date and time].htm"
  • Look at the script in a text editor for additional options
Case Study: Extract data from HTML course - Results
  • Will a conversion project from HTML be hellish?
  • Yes.
  • You will, however, enjoy a significantly higher circle of hell than you otherwise would.
Continue on to Code Red: C Section -->