Where to find Perl, Python, etc.
- Mac OS X - Perl and Python are built in
- www.activestate.com for Windows versions or multi-platform IDE (including OS X)
- Very fast, highly extensible scripting languages
Case Study: Extract data from HTML course
- Challenges:
- Thousands of screens of course data written in buggy pre-millennial HTML
- Fails on modern browsers, sloow on others
- Important data hidden in invisible <DIV>s
Case Study: Extract data from HTML course - Solution Part 1
- Manually analyze sample pages, find patterns
- Have Perl analyze patterns coursewide
- Very useful for identifying extraneous information that is generally the same throughout the course but with subtle differences
- (i.e., if all pages have similar but not identical header/footer/sidebar information, Perl is a great tool for identifying it and getting rid of it)
- Manually analyze pages that don’t fit pattern
- EXAMPLE: defineCourse.pl (course-specific demo at MIT talk)
Case Study: Extract data from HTML course - Solution Part 2
- Use Perl to extract all <DIV>s (or <TABLE>s, depending on how the HTML is organized) for hundreds of files at a time
- Have Perl remove extraneous HTML (comments, etc.)
- Have Perl remove extraneous items (such as headers/footers/image frames/sidebars/etc.) found in solution part 1
- Save all <DIV>s to single HTML file, which you can then load into a browser for a coursewide overview -- a large HTML file that is easy to quickly scroll through for an overview of
Case Study: Extract data from HTML course - Download
- Download basic perl script to concatinate multiple HTML files into one file: HTML_concat.pl for
Mac or Windows
- Place the script into the folder of HTML files you want to concatenate into one file
- In your Windows Command prompt (or OS X terminal), cd to that directory, then simply type
perl html_concat.pl
- All the HTML body data from those files will be concatenated into a single file
- All headers, external style sheets, external scripts, etc. are stripped out
- The resulting file is saved to a file named "html_concat [current date and time].htm"
- Look at the script in a text editor for additional options
Case Study: Extract data from HTML course - Results
- Will a conversion project from HTML be hellish?
- Yes.
- You will, however, enjoy a significantly higher circle of hell than you otherwise would.
|