When people think about web scraping in Python, they usually think BeautifulSoup. That’s okay, but I would encourage you to also consider lxml.
First, people think BeautifulSoup is better at parsing broken HTML. This is not correct. lxml parses broken HTML quite nicely. I haven’t done any thorough testing, but at least the BeautifulSoup broken HTML example is parsed better by lxml (which knows that <td> elements should go inside <table> elements).
Second, people feel lxml is harder to install. This is correct. BUT, lxml 2.2alpha1 includes an option to compile static versions of the underlying C libraries, which should improve the installation experience, especially on Macs. To install this new way, try:
$ STATIC_DEPS=true easy_install 'lxml>=2.2alpha1'
One you have lxml installed, you have a great parser (which happens to be super-fast and that is not a tradeoff). You get a fairly familiar API based on ElementTree, which though a little strange feeling at first, offers a compact and canonical representation of a document tree, compared to more traditional representations. But there’s more…
One of the features that should be appealing to many people doing screen scraping is that you get CSS selectors. You can use XPath as well, but usually that’s more complicated (for example). Here’s an example I found getting links from a menu in a page in BeautifulSoup:
from BeautifulSoup import BeautifulSoup
import urllib2
soup = BeautifulSoup(urllib2.urlopen('http://java.sun.com').read())
menu = soup.findAll('div',attrs={'class':'pad'})
for subMenu in menu:
links = subMenu.findAll('a')
for link in links:
print "%s : %s" % (link.string, link['href'])
Here’s the same example in lxml:
from lxml.html import parse
doc = parse('http://java.sun.com').getroot()
for link in doc.cssselect('div.pad a'):
print '%s: %s' % (link.text_content(), link.get('href'))
lxml generally knows more about HTML than BeautifulSoup. Also I think it does well with the small details; for instance, the lxml example will match elements in <div class="pad menu"> (space-separated classes), which the BeautifulSoup example does not do (obviously there are other ways to search, but the obvious and documented technique doesn’t pay attention to HTML semantics).
One feature that I think is really useful is .make_links_absolute(). This takes the base URL of the page (doc.base) and uses it to make all the links absolute. This makes it possible to relocate snippets of HTML or whole sets of documents (as with this program). This isn’t just <a href> links, but stylesheets, inline CSS with @import statements, background attributes, etc. It doesn’t see quite all links (for instance, links in Javascript) but it sees most of them, and works well for most sites. So if you want to make a local copy of a site:
from lxml.html import parse, open_in_browser
doc = parse('http://wiki.python.org/moin/').getroot()
doc.make_links_absolute()
open_in_browser(doc)
open_in_browser serializes the document to a temporary file and then opens a web browser (using webbrowser).
Here’s an example that compares two pages using lxml.html.diff:
from lxml.html.diff import htmldiff
from lxml.html import parse, tostring, open_in_browser, fromstring
def get_page(url):
doc = parse(url).getroot()
doc.make_links_absolute()
return tostring(doc)
def compare_pages(url1, url2, selector='body div'):
basis = parse(url1).getroot()
basis.make_links_absolute()
other = parse(url2).getroot()
other.make_links_absolute()
el1 = basis.cssselect(selector)[0]
el2 = other.cssselect(selector)[0]
diff_content = htmldiff(tostring(el1), tostring(el2))
diff_el = fromstring(diff_content)
el1.getparent().insert(el1.getparent().index(el1), diff_el)
el1.getparent().remove(el1)
return basis
if __name__ == '__main__':
import sys
doc = compare_pages(sys.argv[1], sys.argv[2], sys.argv[3])
open_in_browser(doc)
You can use it like:
$ python lxmldiff.py \
'http://wiki.python.org/moin/BeginnersGuide?action=recall&#038;rev=70' \
'http://wiki.python.org/moin/BeginnersGuide?action=recall&#038;rev=81' \
'div#content'
Another feature lxml has is form handling. All the cool sexy new sites use minimal forms, but searching for "registration forms" I get this nice complex form. Let’s look at it:
>>> from lxml.html import parse, tostring
>>> doc = parse('http://www.actuaryjobs.com/cform.html').getroot()
>>> doc.forms
[<Element form at -48232164>]
>>> form = doc.forms[0]
>>> form.inputs.keys()
['thank_you_title', 'City', 'Zip', ... ]
Now we have a form object. There’s two ways to get to the fields: form.inputs, which gives us a dictionary of all the actual <input> elements (and textarea and select). There’s also form.fields, which is a dictionary-like object. The dictionary-like object is convenient, for instance:
>>> form.fields['cEmail'] = 'me@example.com'
This actually updates the input element itself:
>>> tostring(form.inputs['cEmail'])
'<input type="input" name="cEmail" size="30" value="test2">'
I think it’s actually a nicer API than htmlfill and can serve the same purpose on the server side.
But then you can also use the same interface for scraping, by filling fields and getting the submission. That looks like:
>>> import urllib
>>> action = form.action
>>> data = urllib.urlencode(form.form_values())
>>> if form.method == 'GET':
... if '?' in action:
... action += '&#038;' + data
... else:
... action += '?' + data
... data = None
>>> resp = urllib.urlopen(action, data)
>>> resp_doc = parse(resp).getroot()
Lastly, there’s HTML cleaning. I think all these features work together well, do useful things, and it’s based on an actual understanding HTML instead of just treating tags and attributes as arbitrary. (Also if you really like jQuery, you might want to look at pyquery, which is a jQuery-like API on top of lxml).
Automatically generated list of related posts:
- lxml.html Over the summer I did quite a bit of work...
- Making a proxy with WSGI and lxml You can use WSGI to make rewriting middleware; WebOb specifically...
- Python HTML Parser Performance In preparation for my PyCon talk on HTML I thought...
- A new way to deploy web applications Deployment is one of the things I like least about...
- A Python Web Application Package and Format (we should make one) At PyCon there was an open space about deployment, and...
One major benefit of BeautifulSoup being pure Python is that I can (and do) run it on Google AppEngine.
Yeah, I’m actually using it on GAE as well. I don’t think lxml will be on GAE anytime in the near, middle, or maybe distant future. This is very sad for me.
nice writeup & examples. thanks.
I used to use Beautiful Soup but found it failed too often for bad markup. Then I tried lxml.html and haven’t looked back.
Thanks to this blog post, I now have lxml installed on OS X. I wasn’t able to get it compiled before. Thanks! Looking forward to playing with it. Especially looking forward to using CSS selectors.
Both BeautifulSoup and lxml have their merits. If someone wants to try a (still partial but perfectly working) XPath support for BeautifulSoup, it can be found here: http://www3.itu.edu.tr/~uyar/bsoupxpath/.
It was developed by H. Turgut Uyar to be used in the IMDbPY project, and I think it’s an excellent piece of code (and it would be nice if it could be fully developed and maybe even included in BeautifulSoup).
From the lxml documentation: “BeautifulSoup is a Python package that parses broken HTML. While libxml2 (and thus lxml) can also parse broken HTML, BeautifulSoup is a bit more forgiving and has superiour support for encoding detection. lxml can benefit from the parsing capabilities of BeautifulSoup through the lxml.html.soupparser module”
[ed note: noted, and I've changed the docs here to be a little clearer lxml can often parse better than BeautifulSoup, but the note on encoding detection remains correct]
Hmm, while I agree with you, I feel you are being extremely misleading here. Your first example comparing the two libraries has the superfluous
menu
andlinks
temporary variables in the BeautifulSoup code whereas you forgo them in the lxml example. Granted, the lxml code is slightly cleaner.lxml is completely awesome, though. Mirroring other comments, it is too bad there isn’t a way to make it run on GAE.
Nice and timely as I need to do a bit of parsing soon. One feature I would kill for is something that could convert all CSS to inline (for HTML emails). Anyone seen any code to do this?
[ed note: it's been discussed on the lxml mailing list, mostly combining cssutils with the CSS selector support, but it's never quite happened]
Nice summarization of lxml features, thanks
@Andy: do you mean something like this? http://mahonata.com/maho/css2inline.pl
not to forget xpath, which is extremely powerful for scraping and easy to use with lxml
@Avinash Vora: “extremely misleading [...] superfluous menu and links temporary variables in the BeautifulSoup code whereas you forgo them in the lxml example”—i do not feel those variables to be misleadingly superfluous, although they could be skipped.
i’ve done a bit of screen-scraping with BeautifulSoup myself, and this is approximately the way i would’ve written the snippet. the main reason is documentation; it is just so much clearer to say ‘ok i get the menu with [incantation a]’, so lets step over its elements, and from each element, i ‘get the relevant links using [incantation b]’, and so on (granted, the expression filtering out the links is a very short and clear one, but it is a fact that BeautifulSoup would greatly profit from an overall nicer, simpler, API, and a jQuery/pyquery-like selector syntax).
i second that installing lxml can be a hassle; i only managed to make it run on windows by manually downloading and installing a binary package. definitely one great advantage of BeautifulSoup: a single fairly short *.py file, and you’re good to go.
I’ve just installed virtualenv. Using the activate script so now I have
Where the heck is src at this point? Better yet, any clue how I can fix this? Or should I try without STATIC_DEPS? Thanks
[ed note: if you don't see it say "Downloading libxml2 into libs/libxml2-2.7.2.tar.gz" then STATIC_DEPS didn't work, and probably you don't have the appropriate libxml2-dev and libxslt-dev packages; probably I should have specified
easy_install lxml>=2.2alpha
]I’m also experiencing these errors — Can’t seem to figure out how to overcome them.
As is usually the case with these kind of compile errors, the actual error message is at the top of that long stream of errors. (The error itself is, sadly, probably not that readable either — but it will probably be able some missing file.) It might be, for instance, that you don’t have the
python-dev
package installed.How do I install that?
On Linux there’s usually a package called
python-dev
orpython2.5-dev
, or something along those lines. You’d use apt-get or yum or whatever the appropriate package manager to install that. On Windows there are some pre-built libraries for lxml that you can use. On a Mac you need to install the Mac developer tools.Hey thanks for all your help — Got everything past the install hurdles.
Of course, now I’m running into some other problems using PyQuery (when it tries to import etree from lxml)
Here are the relevant lines from the traceback:
File “/usr/lib/python2.4/site-packages/pyquery-0.3.1-py2.4.egg/pyquery/cssselectpatch.py”, line 6, in ? from lxml.cssselect import Pseudo, XPathExpr, XPathExprOr, Function, csstoxpath File “/usr/lib/python2.4/site-packages/lxml-2.2alpha1-py2.4-linux-i686.egg/lxml/cssselect.py”, line 8, in ? from lxml import etree ImportError: libxslt.so.1: cannot open shared object file: No such file or directory
I’m deep in the rabbit hole now ;)
I may just go back to using BeautifulSoup — it does have the benefit of being 100% python — these libxml2 and libxslt bindings have caused me no end of headaches — I’m working on linux in the cloud — my local Mac OS X box is working fine.
-Luke
@john_aman:
I was just able to run this in a virtualenv without using the activate script:
It did download, compile, and link with the versions of libxml2 and libxslt required by lxml (the box I tried this on is Centos4, whose native versions are too old).
FWIW, I never use the ‘activate’ bit at all when working in a virtualenv; I don’t know whether
I recently found BeautifulSpoon which provides a xpath-like API to BeautifulSoup:
http://code.google.com/p/beautifulspoon/
We switched from BeautifulSoup to lxml for a parser project about eight months ago and saw a significant performance increase (I can’t remember the numbers, but I think we were looking at an order of magnitude speed-up of our overall parsing code that was almost entirely thanks to lxml). To keep everyone happy I implemented a fallback to BeautifulSoup if lxml balked at any page we were looking at, but I think that code’s rarely been necessary.
@Andy @ed.: I tried something like this (combining lxml and cssutils) in a small proove of concept: http://code.google.com/p/cssutils/source/browse/trunk/examples/style.py
I have not developed it any further yet but I guess it may be a starting point and the code is not very complicated I think.
Great Article – I’m can smell Beautiful Soup cooking already.
Hi, great article. The main drawback I see in lxml is that it is not pure Python. I tested some XML frameworks for writing a small implementation of a template language for GAE and BeautifulSoup turned out to be the worst. I then switched to Genshi which I think is the best and most robust pure Python XML parser I know. It also supports XPath and therefore also CSS selectors if you use the cssselector module of lxml for translating to XPath. Just the interface is quite difficult to understand but very powerful. For people interested in this feature have a look at this: http://code.google.com/p/pyxer/source/browse/trunk#trunk/src/pyxer/template
Thanks to all.
@Ian, @Tres: The first real problem? My stupidity. python2.5-dev not installed. I should have landed here much sooner but for Ubuntu makes installing packages so damn easy. 2nd problem – running at 64 bit is a bit like warp speed to this MSDOS 2.0 old timer. Years of DOS and windoze development may also have corrupted my neural network. Now I get to this:
where error is marked ***********
It seems the linker is failing to link exslt.o into libexslt.a: (exslt.o): relocation RX8664_32 against `a local symbol’ can not be used when making a shared object; recompile with -fPIC
Oh, did I mention I’m new to 64 bit OS? Perhaps someone can point out how to do all the steps the wget/tar/.configure/make way. Too old to try to figure out something so easyinstall. Not knocking easy_install – this is a first time failure for me with that program.
@Perenzo-
something is wrong at pyxer.appspot.com:
But Ubuntu? sudo apt-get install python-genshi works. Still want to get this working for the love/speed of c.
@john aman: Yeah, thanks for the hint. This seems to be a bug in the Beaker session management for GAE. Don’t know why they do not fix it: http://pylonshq.com/project/pylonshq/ticket/537
BTW. I tried to rewrite the example using Genshi and cssselector an it looks like this:
Well the handling of the streams is not trivial but powerful: http://genshi.edgewall.org/wiki/Documentation/streams.html
Ian, you and your readers rock!
From Stefan Behnel follow-up message (posting here for the doc value):
And the results:
sorry about the “& gt;”‘s should be lxml>= …
Ian, thanks for posting about lxml. I hadn’t known about it and your post came just as I was starting a small screen-scraping project. Using CSS selectors in lxml is heavenly. It almost makes screen scraping fun. At any rate, it’s helping me make rapid progress.
Is it possible to use
make_links_absolute
with pyquery? I’ve messed around with the library source but haven’t had any luck!I tried the simple example:
… and I got this error:
That little tip on installing on OS X saved me a potential night of drama, as I was installing “Zine”.
STATICDEPS=true easyinstall ‘lxml>=2.2alpha1′
Thanks!
Regarding installing lxml … think I found a very easy way if you’re using ubuntu … search for lxml in the synaptic package manager , and the install is just a few clicks away :)
PS: the version available on synaptic may not be the very latest
I’m on a Mac (10.5.6). I tried the suggested line with an added sudo to install in my site-packages:
but it didn’t pull down libxml and libxslt. I ended up running it without sudo and it worked fine up until installation. Then I just re-ran with sudo and it grabbed the egg it just built. That did it, and now I have lxml installed. Thanks, Ian!
PS Ian: your Markdown link is 404ing right now…
Using “STATICDEPS=true easyinstall lxml” worked great on three Intel Macs using the python.org Python 2.5.4 — two of these Macs running 10.5 and one running 10.4.
Doing the same thing on a PPC Mac (10.4 and python.org 2.5.4) I had unfortunately no success. It fails with:
“ar cru .libs/testdso.a testdso.o ranlib .libs/testdso.a creating testdso.la (cd .libs && rm -f testdso.la && ln -s ../testdso.la testdso.la) gcc -DHAVECONFIGH -I. -I./include -I./include -DREENTRANT -arch ppc -arch i386 -isysroot /Developer/SDKs/MacOSX10.4u.sdk -O2 -pedantic -W -Wformat -Wunused -Wimplicit -Wreturn-type -Wswitch -Wcomment -Wtrigraphs -Wformat -Wchar-subscripts -Wuninitialized -Wparentheses -Wshadow -Wpointer-arith -Wcast-align -Wwrite-strings -Waggregate-return -Wstrict-prototypes -Wmissing-prototypes -Wnested-externs -Winline -Wredundant-decls -c xmllint.c /bin/sh ./libtool –tag=CC –mode=link gcc -arch ppc -arch i386 -isysroot /Developer/SDKs/MacOSX10.4u.sdk -O2 -pedantic -W -Wformat -Wunused -Wimplicit -Wreturn-type -Wswitch -Wcomment -Wtrigraphs -Wformat -Wchar-subscripts -Wuninitialized -Wparentheses -Wshadow -Wpointer-arith -Wcast-align -Wwrite-strings -Waggregate-return -Wstrict-prototypes -Wmissing-prototypes -Wnested-externs -Winline -Wredundant-decls -arch ppc -arch i386 -isysroot /Developer/SDKs/MacOSX10.4u.sdk -o xmllint xmllint.o ./libxml2.la -lpthread -lz -liconv -lm
gcc -arch ppc -arch i386 -isysroot /Developer/SDKs/MacOSX10.4u.sdk -O2 -pedantic -W -Wformat -Wunused -Wimplicit -Wreturn-type -Wswitch -Wcomment -Wtrigraphs -Wformat -Wchar-subscripts -Wuninitialized -Wparentheses -Wshadow -Wpointer-arith -Wcast-align -Wwrite-strings -Waggregate-return -Wstrict-prototypes -Wmissing-prototypes -Wnested-externs -Winline -Wredundant-decls -arch ppc -arch i386 -isysroot /Developer/SDKs/MacOSX10.4u.sdk -o xmllint xmllint.o ./.libs/libxml2.a -lpthread -lz /usr/lib/libiconv.dylib -lm
/usr/libexec/gcc/i686-apple-darwin8/4.0.1/ld: for architecture i386 /usr/libexec/gcc/i686-apple-darwin8/4.0.1/ld: warning /usr/lib/libiconv.dylib cputype (18, architecture ppc) does not match cputype (7) for specified -arch flag: i386 (file not loaded) /usr/libexec/gcc/i686-apple-darwin8/4.0.1/ld: Undefined symbols: _libiconv _libiconvclose libiconvopen collect2: ld returned 1 exit status lipo: can’t open input file: /var/tmp//ccm2sHDw.out (No such file or directory) make[2]: *** [xmllint] Error 1 make[1]: *** [all-recursive] Error 1 make: *** [all] Error 2
“
Any suggestions would be very much appreciated!
Thanks, Pascal
Thanks for this beautiful post. I haven’t try lxml till now because I was not sure whether it will be as useful as beautiful soap or not. But after reading this post I changed my mind and I will surely try lxml.
Thank you for clarifying the reasons to use lxml . Many have doubt about lxml and rumor is going on that lxml is not comfortable to use comparing BeautifulSoup. But your post has given a urge to rethink over the matter.
Ian,
first, thanks lxml is a hoot. I’ll grant that getting lxml install is a bit of a pain but not that bad. Using the STATIC_DEPS suggestion was a big help as well.
However I have noticed something working through your web scraping examples.
Working through this –
from lxml.html import parse doc = parse(‘http://java.sun.com’).getroot() for link in doc.cssselect(‘div.pad a’): print ‘%s: %s’ % (link.text_content(), link.get(‘href’))
I receive a failure on the parse().getroot() statement. However if I do the following —
import urllib from lxml.html import * content = urllib.urlopen(‘http://java.sun.com’).read() doc = fromstring(content) for link in doc.cssselect(‘div.pad a’): print ‘%s: %s’ % (link.text_content(), link.get(‘href’))
it works. Have you seen this behavior before?
No, not at all — I just tried it and it worked fine.
You must insert the trailing slash:
doc = parse(’http://java.sun.com/’).getroot()
work :) dont ask me why…
Has anybody looked at http://scrapy.org/.
Lxml is one of those Python libraries that should be really high profile given its usefulness. Sadly it is also one of those odd projects peculiar to open source that doesn’t have a forum and as a consequence receives little attention from anyone not prepared to maintain a database of mailing lists on every workstation they use for every single piece of software or technology subject they are interested in.
i second that installing lxml can be a hassle; i only managed to make it run on windows by manually downloading and installing a binary package. definitely one great advantage of BeautifulSoup: a single fairly short *.py file, and you’re good to go.
I don’t know how much code goes in the cpython code but maybe if the division is well defined between C calls to libxml libs one could write some wrappers to simulate libxml calls using ElementTree which is one of python battery (not sure about the xpath support). Not a small project but something that could be usefull for say Google App Engine users or Jython/IronPython/Pypy users.
Nice summarization of lxml features, thanks.
Thanks for this: it encouraged me to use lxml. I used lxml to import data from HTML to a database (I should have been generating those pages from a database in the first place, but thats another story).
I used xml.sax for something similar (except it was an XML file that time) a few months ago, and lxml was much easier to work with.
No installation problems on Mandriva Linux — lxml was in the repos so a tick and a click was all that was needed. As Grease suggests, if you use an OS with a package manager, its the easiest way to install almost anything.
I am trying to install lxml in an active virtual environment on Ubuntu 10.04 64-bit on a Dell xps, Python 2.6.5.
I used this command from the top of the virtual environment:
STATICDEPS=true bin/easyinstall lxml
This is the result I got:
make[1]: Leaving directory
/tmp/easy_install-N4_xs4/lxml-2.2.6/build/tmp/libxslt-1.1.26' NOTE: Trying to build without Cython, pre-generated 'src/lxml/lxml.etree.c' needs to be available. Using build configuration of libxml2 2.7.7 and libxslt 1.1.26 Building against libxml2/libxslt in the following directory: /tmp/easy_install-N4_xs4/lxml-2.2.6/build/tmp/libxml2/lib /usr/bin/ld: /tmp/easy_install-N4_xs4/lxml-2.2.6/build/tmp/libxml2/lib/libxslt.a(xslt.o): relocation R_X86_64_32 against
.rodata.str1.8′ can not be used when making a shared object; recompile with -fPIC /tmp/easyinstall-N4xs4/lxml-2.2.6/build/tmp/libxml2/lib/libxslt.a: could not read symbols: Bad valueI guess that something is missing. I installed Python-dev on my main system, created the virtual environment with –no-site-packages. If I need to install Python-dev in my virtual environment, I don’t know how to do it (apt-get tries to put it in the main installation, complains that I am not root). If I need something else, I would appreciate instructions on how to put it into my virtualenv.
By the way, what I am looking to do right now is to translate a dictionary object into xml. I currently translate it into JSON without a hitch, but I need to be able to support also translating it into xml. The structure consists of a key:value, the value consists of a list of a list of key-value pairs. Any thoughts on converting a dictionary to xml would also be welcome. That’s how I came to lxml.
Thanks so much.
Herb
Hi all! i have installed lxml2.2.2 on windows platform(i m using python version 2.6.5).ive tried the code you have mentioned : “from lxml.html import parse p= parse(‘http://www.google.com’).getroot()”
but i am getting the following error:
Traceback (most recent call last): File “”, line 1, in p=parse(‘http://www.google.com’).getroot() File “C:\Python26\lib\site-packages\lxml-2.2.2-py2.6-win32.egg\lxml\html_init_.py”, line 661, in parse return etree.parse(filenameorurl, parser, baseurl=baseurl, **kw) File “lxml.etree.pyx”, line 2698, in lxml.etree.parse (src/lxml/lxml.etree.c:49590) File “parser.pxi”, line 1491, in lxml.etree.parseDocument (src/lxml/lxml.etree.c:71205) File “parser.pxi”, line 1520, in lxml.etree.parseDocumentFromURL (src/lxml/lxml.etree.c:71488) File “parser.pxi”, line 1420, in lxml.etree.parseDocFromFile (src/lxml/lxml.etree.c:70583) File “parser.pxi”, line 975, in lxml.etree.BaseParser.parseDocFromFile (src/lxml/lxml.etree.c:67736) File “parser.pxi”, line 539, in lxml.etree.ParserContext.handleParseResultDoc (src/lxml/lxml.etree.c:63820) File “parser.pxi”, line 625, in lxml.etree.handleParseResult (src/lxml/lxml.etree.c:64741) File “parser.pxi”, line 563, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64056) IOError: Error reading file ‘http://www.google.com’: failed to load external entity “http://www.google.com”
i am clueless as to what to do next as i am a newbie to python. please guide me to solve this error. thanks in advance!! :)
Hi Ian, in the first lxml example that you have given,I think instead of doc = parse(‘http://java.sun.com’).getroot() it should be
from urllib2 import urlopen doc=parse(urlopen(‘http://java.sun.com’)).getroot()
as parse does not fetch the website. As i have said, the first one is giving an error but the second one is working fine for me.
Thanks a lot to help me to overcome my “fear’ in front of lxml. As a beginner I am quite happy to have managed to take all the data I needed that were sparsed on more than 400 web pages.(with the help of xmlstarlet to clean up manually some pages) No problem whatsoever to install with the debian packages. Thanks a lot