For correcting the HTML, have you considered HTML Tidy? It already has tons of logic to deal with this stuff, and I think it's available as a Perl plugin. -- Aaron Swartz: http://www.aaronsw.com/