So, I was trying to improve Commentary with respect to its HTML processing. I was parsing the incoming page with tidy, reading it with ElementTree and then writing it back out. But tidy was leaving some entities in there that Expat didn't understand, and ElementTree was outputing XML that doesn't look like HTML. I think I have it fixed (maybe), but it was all much harder than it should have been.
Anyway, I felt okay about the algorithm by that time anyway, worked around the problems, and then reimplemented using ElementTree. This introduced some problems, because ElementTree doesn't use a model much like the DOM. In the DOM every node knows about its siblings, parent, etc. Elements in ElementTree don't know about any of that (which is conventional in Python and most languages, that you not know about your container). But that was inconvenient, so I had to make a wrapper to give me access to that information.
Then there's the issue that there's no code I know of that knows how to parse HTML (HTMLParser does, of course, but not in a useful way -- it doesn't create a tree). So everyone uses Tidy to normalize their code to XHTML, which works but feels really sloppy. HTML is parseable; in this case, I really only wanted to parse well-formed HTML anyway. Then, finally, there's builtin way to serialize ElementTree to HTML from what I can find. There's some hints, but they still leave you with empty elements like <a name="foo" />, which browsers do not like. I had to clone a write method in ElementTree and make edits to it.
When you run tidy on the way in, you need to use "-n" (numeric entities) and "-asxml". ElementTree's XML serializer isn't well suited for a tag soup parser (HTML needs special treatment of many tags), so you need to grab a HTML serializer for ET. There's a nice one in Kid.
Alternatively, you can use tidy on the way out to; feeding the XML through "tidy -xml" should work.
I looked at Kid's, and was a little confused by the iterating over events/tokens that it used. I wasn't really clear what that internal data structure was. I ended up creating an ElementTree subclass HTMLTree in dumbpath. It might leave out things that Kid does, but it mostly makes sure that empty elements don't get /> and that all other elements use both opening and closing tags. And it strips namespaces.
An HTML serializer would be a nice addition to elementtidy, since reading and writing HTML are operations that often go together.
maybe BeautifulSoup is what you need: http://www.crummy.com/software/BeautifulSoup/
AFAIK, BeautifulSoup structures aren't writable, which is what I'm doing -- parsing HTML, modifying the parsed form (adding comments), then writing it out again.
You can change BeautifulSoup structures. For example, you can insert raw html fragments:>>> from BeautifulSoup import BeautifulSoup >>> soup = BeautifulSoup('<html><body><p> text 1 <p> text 2 </html>') >>> print soup <html><body><p> text 1 </p><p> text 2 </p></body></html> >>> par2 = soup('p') >>> par2.name = 'div' >>> par2.contents = ['<p>'] + par2.contents + ['</p>'] >>> print soup <html><body><p> text 1 </p><div><p> text 2 </p></div></body></html># Alexander Kozlovsky
And yes, if the HTML isn't too obnoxious, the HTMLTreeBuilder module might help (and improvements to that module are welcome).
It looks like Meld3 includes an HTML parser now (and it uses ET internally); probably a good candidate for extraction at some point: http://www.plope.com/software/meld3/
lxml's implementation of ElementTree provides an extension to the ElementTree API that allows you to get the parent. (at least in svn).
One wishlist item is to expose libxml2's HTML parser to lxml somehow. Volunteers are welcome. :) We already have a patch lying around implementing serialization support. I should find the time to review/integrate it all and prepare another release...# Martijn Faassen
Take a look at this document for some coverage of HTML processing using different XML parsers (including PyXML):# Paul Boddie
getElementById is tricky:from xml.dom import minidom s = '<?xml version="1.0"?><foo><bar id="1" /></foo>' doc = minidom.parseString(s) assert None == doc.getElementById('1') s = '<?xml version="1.0"?><!DOCTYPE quote [ <!ATTLIST bar id ID #IMPLIED> ]> <foo><bar id="1" /></foo>' doc = minidom.parseString(s) assert None != doc.getElementById('1')
You've basically got to load in the HTML DTD if you expect getElementById to work.# Stephen Thorne
That, um, sucks. Geez... the only reason the DOM seems useful to me is that it is implemented in browsers. I'm sure it's implemented and widely used elsewhere (I guess, I don't actually hear people talking about it), but the primary implementation in my mind has always been browsers. Or raise an exception when getElementById can't return a meaningful value. At least it should indicate in the documentation how you make it act like the browser's implementation. But eh... ET is much more predictable and seems to have relatively few intricacies. And it's going to be in the standard library (w00t!), so I'll probably just choose to forget that xml.minidom even exists.
You probably already ran across it but Frederik Lundh also has TidyHTMLTreeBuilder which conveniently wraps up calling tidy on some HTML, and returns an ElementTree tree.
It'll be even much easier once all browsers support E4X :) http://en.wikipedia.org/wiki/E4X