Ian Bicking: a blog :: lxml: an underappreciated web scraping library

{ 2008 12 10 }

lxml: an underappreciated web scraping library

When people think about web scraping in Python, they usually think BeautifulSoup. That’s okay, but I would encourage you to also consider lxml.

First, people think BeautifulSoup is better at parsing broken HTML. This is not correct. lxml parses broken HTML quite nicely. I haven’t done any thorough testing, but at least the BeautifulSoup broken HTML example is parsed better by lxml (which knows that <td> elements should go inside <table> elements).

Second, people feel lxml is harder to install. This is correct. BUT, lxml 2.2alpha1 includes an option to compile static versions of the underlying C libraries, which should improve the installation experience, especially on Macs. To install this new way, try:

$ STATIC_DEPS=true easy_install 'lxml>=2.2alpha1'

One you have lxml installed, you have a great parser (which happens to be super-fast and that is not a tradeoff). You get a fairly familiar API based on ElementTree, which though a little strange feeling at first, offers a compact and canonical representation of a document tree, compared to more traditional representations. But there’s more…

One of the features that should be appealing to many people doing screen scraping is that you get CSS selectors. You can use XPath as well, but usually that’s more complicated (for example). Here’s an example I found getting links from a menu in a page in BeautifulSoup:

from BeautifulSoup import BeautifulSoup
import urllib2
soup = BeautifulSoup(urllib2.urlopen('http://java.sun.com').read())
menu = soup.findAll('div',attrs={'class':'pad'})
for subMenu in menu:
links = subMenu.findAll('a')
for link in links:
print "%s : %s" % (link.string, link['href'])

Here’s the same example in lxml:

from lxml.html import parse
doc = parse('http://java.sun.com').getroot()
for link in doc.cssselect('div.pad a'):
print '%s: %s' % (link.text_content(), link.get('href'))

lxml generally knows more about HTML than BeautifulSoup. Also I think it does well with the small details; for instance, the lxml example will match elements in <div class="pad menu"> (space-separated classes), which the BeautifulSoup example does not do (obviously there are other ways to search, but the obvious and documented technique doesn’t pay attention to HTML semantics).

One feature that I think is really useful is .make_links_absolute(). This takes the base URL of the page (doc.base) and uses it to make all the links absolute. This makes it possible to relocate snippets of HTML or whole sets of documents (as with this program). This isn’t just <a href> links, but stylesheets, inline CSS with @import statements, background attributes, etc. It doesn’t see quite all links (for instance, links in Javascript) but it sees most of them, and works well for most sites. So if you want to make a local copy of a site:

from lxml.html import parse, open_in_browser
doc = parse('http://wiki.python.org/moin/').getroot()
doc.make_links_absolute()
open_in_browser(doc)

open_in_browser serializes the document to a temporary file and then opens a web browser (using webbrowser).

Here’s an example that compares two pages using lxml.html.diff:

from lxml.html.diff import htmldiff
from lxml.html import parse, tostring, open_in_browser, fromstring

def get_page(url):
doc = parse(url).getroot()
doc.make_links_absolute()
return tostring(doc)

def compare_pages(url1, url2, selector='body div'):
basis = parse(url1).getroot()
basis.make_links_absolute()
other = parse(url2).getroot()
other.make_links_absolute()
el1 = basis.cssselect(selector)[0]
el2 = other.cssselect(selector)[0]
diff_content = htmldiff(tostring(el1), tostring(el2))
diff_el = fromstring(diff_content)
el1.getparent().insert(el1.getparent().index(el1), diff_el)
el1.getparent().remove(el1)
return basis

if __name__ == '__main__':
import sys
doc = compare_pages(sys.argv[1], sys.argv[2], sys.argv[3])
open_in_browser(doc)

You can use it like:

$ python lxmldiff.py \
'http://wiki.python.org/moin/BeginnersGuide?action=recall&#038;rev=70' \
'http://wiki.python.org/moin/BeginnersGuide?action=recall&#038;rev=81' \
'div#content'

Another feature lxml has is form handling. All the cool sexy new sites use minimal forms, but searching for "registration forms" I get this nice complex form. Let’s look at it:

>>> from lxml.html import parse, tostring
>>> doc = parse('http://www.actuaryjobs.com/cform.html').getroot()
>>> doc.forms
[<Element form at -48232164>]
>>> form = doc.forms[0]
>>> form.inputs.keys()
['thank_you_title', 'City', 'Zip', ... ]

Now we have a form object. There’s two ways to get to the fields: form.inputs, which gives us a dictionary of all the actual <input> elements (and textarea and select). There’s also form.fields, which is a dictionary-like object. The dictionary-like object is convenient, for instance:

>>> form.fields['cEmail'] = 'me@example.com'

This actually updates the input element itself:

>>> tostring(form.inputs['cEmail'])
'<input type="input" name="cEmail" size="30" value="test2">'

I think it’s actually a nicer API than htmlfill and can serve the same purpose on the server side.

But then you can also use the same interface for scraping, by filling fields and getting the submission. That looks like:

>>> import urllib
>>> action = form.action
>>> data = urllib.urlencode(form.form_values())
>>> if form.method == 'GET':
... if '?' in action:
... action += '&#038;' + data
... else:
... action += '?' + data
... data = None
>>> resp = urllib.urlopen(action, data)
>>> resp_doc = parse(resp).getroot()

Lastly, there’s HTML cleaning. I think all these features work together well, do useful things, and it’s based on an actual understanding HTML instead of just treating tags and attributes as arbitrary. (Also if you really like jQuery, you might want to look at pyquery, which is a jQuery-like API on top of lxml).

Automatically generated list of related posts:

lxml.html Over the summer I did quite a bit of work...
Making a proxy with WSGI and lxml You can use WSGI to make rewriting middleware; WebOb specifically...
Python HTML Parser Performance In preparation for my PyCon talk on HTML I thought...
A new way to deploy web applications Deployment is one of the things I like least about...
A Python Web Application Package and Format (we should make one) At PyCon there was an open space about deployment, and...

51 Comments

Fred Blasdel says:

December 10, 2008 at 10:14 pm

One major benefit of BeautifulSoup being pure Python is that I can (and do) run it on Google AppEngine.
Ian Bicking says:

December 10, 2008 at 10:16 pm

Yeah, I’m actually using it on GAE as well. I don’t think lxml will be on GAE anytime in the near, middle, or maybe distant future. This is very sad for me.
Jehiah says:

December 10, 2008 at 10:32 pm

nice writeup & examples. thanks.
Richard says:

December 10, 2008 at 11:51 pm

I used to use Beautiful Soup but found it failed too often for bad markup. Then I tried lxml.html and haven’t looked back.
Donovan Preston says:

December 11, 2008 at 12:08 am

Thanks to this blog post, I now have lxml installed on OS X. I wasn’t able to get it compiled before. Thanks! Looking forward to playing with it. Especially looking forward to using CSS selectors.
Davide Alberani says:

December 11, 2008 at 2:37 am

Both BeautifulSoup and lxml have their merits. If someone wants to try a (still partial but perfectly working) XPath support for BeautifulSoup, it can be found here: http://www3.itu.edu.tr/~uyar/bsoupxpath/.

It was developed by H. Turgut Uyar to be used in the IMDbPY project, and I think it’s an excellent piece of code (and it would be nice if it could be fully developed and maybe even included in BeautifulSoup).
Giacomo says:

December 11, 2008 at 3:21 am

From the lxml documentation: “BeautifulSoup is a Python package that parses broken HTML. While libxml2 (and thus lxml) can also parse broken HTML, BeautifulSoup is a bit more forgiving and has superiour support for encoding detection. lxml can benefit from the parsing capabilities of BeautifulSoup through the lxml.html.soupparser module”

[ed note: noted, and I've changed the docs here to be a little clearer lxml can often parse better than BeautifulSoup, but the note on encoding detection remains correct]
Avinash Vora says:

December 11, 2008 at 4:51 am

Hmm, while I agree with you, I feel you are being extremely misleading here. Your first example comparing the two libraries has the superfluous menu and links temporary variables in the BeautifulSoup code whereas you forgo them in the lxml example. Granted, the lxml code is slightly cleaner.

lxml is completely awesome, though. Mirroring other comments, it is too bad there isn’t a way to make it run on GAE.
Andy Baker says:

December 11, 2008 at 5:09 am

Nice and timely as I need to do a bit of parsing soon. One feature I would kill for is something that could convert all CSS to inline (for HTML emails). Anyone seen any code to do this?

[ed note: it's been discussed on the lxml mailing list, mostly combining cssutils with the CSS selector support, but it's never quite happened]
R.Lacko says:

December 11, 2008 at 5:28 am

Nice summarization of lxml features, thanks
Raphael Ritz says:

December 11, 2008 at 5:33 am

@Andy: do you mean something like this? http://mahonata.com/maho/css2inline.pl
mbo says:

December 11, 2008 at 5:52 am

not to forget xpath, which is extremely powerful for scraping and easy to use with lxml
Love Encounter Flow says:

December 11, 2008 at 7:35 am

@Avinash Vora: “extremely misleading [...] superfluous menu and links temporary variables in the BeautifulSoup code whereas you forgo them in the lxml example”—i do not feel those variables to be misleadingly superfluous, although they could be skipped.

i’ve done a bit of screen-scraping with BeautifulSoup myself, and this is approximately the way i would’ve written the snippet. the main reason is documentation; it is just so much clearer to say ‘ok i get the menu with [incantation a]’, so lets step over its elements, and from each element, i ‘get the relevant links using [incantation b]’, and so on (granted, the expression filtering out the links is a very short and clear one, but it is a fact that BeautifulSoup would greatly profit from an overall nicer, simpler, API, and a jQuery/pyquery-like selector syntax).

i second that installing lxml can be a hassle; i only managed to make it run on windows by manually downloading and installing a binary package. definitely one great advantage of BeautifulSoup: a single fairly short *.py file, and you’re good to go.
john aman says:

December 11, 2008 at 9:27 am

I’ve just installed virtualenv. Using the activate script so now I have
```
(env) /home/john@Ibex64:~$ STATIC_DEPS=true easy_install lxml
... more than 1000 lines later (my xterm scroll bufsize)
...
src/lxml/lxml.etree.c: ~ 1000 errors
...
src/lxml/lxml.etree.c:128124: error: expected ‘)’ before ‘*’ token
src/lxml/lxml.etree.c:128139: error: expected ‘)’ before ‘*’ token
src/lxml/lxml.etree.c:128154: error: expected ‘)’ before ‘*’ token
src/lxml/lxml.etree.c:128169: error: expected ‘)’ before ‘*’ token
src/lxml/lxml.etree.c:128184: error: expected ‘)’ before ‘*’ token
error: Setup script exited with error: command 'gcc' failed with exit status 1
```
Where the heck is src at this point? Better yet, any clue how I can fix this? Or should I try without STATIC_DEPS? Thanks

[ed note: if you don't see it say "Downloading libxml2 into libs/libxml2-2.7.2.tar.gz" then STATIC_DEPS didn't work, and probably you don't have the appropriate libxml2-dev and libxslt-dev packages; probably I should have specified easy_install lxml>=2.2alpha]
- Luke says:
  
  March 11, 2009 at 9:41 am
  
  I’m also experiencing these errors — Can’t seem to figure out how to overcome them.
- Ian Bicking says:
  
  March 11, 2009 at 9:45 am
  
  As is usually the case with these kind of compile errors, the actual error message is at the top of that long stream of errors. (The error itself is, sadly, probably not that readable either — but it will probably be able some missing file.) It might be, for instance, that you don’t have the python-dev package installed.
  - Luke says:
    
    March 11, 2009 at 10:24 am
    
    How do I install that?
    - Ian Bicking says:
      
      March 11, 2009 at 10:46 am
      
      On Linux there’s usually a package called python-dev or python2.5-dev, or something along those lines. You’d use apt-get or yum or whatever the appropriate package manager to install that. On Windows there are some pre-built libraries for lxml that you can use. On a Mac you need to install the Mac developer tools.
      - Luke says:
        
        March 11, 2009 at 12:21 pm
        
        Hey thanks for all your help — Got everything past the install hurdles.
        
        Of course, now I’m running into some other problems using PyQuery (when it tries to import etree from lxml)
        
        Here are the relevant lines from the traceback:
        
        File “/usr/lib/python2.4/site-packages/pyquery-0.3.1-py2.4.egg/pyquery/cssselectpatch.py”, line 6, in ? from lxml.cssselect import Pseudo, XPathExpr, XPathExprOr, Function, csstoxpath File “/usr/lib/python2.4/site-packages/lxml-2.2alpha1-py2.4-linux-i686.egg/lxml/cssselect.py”, line 8, in ? from lxml import etree ImportError: libxslt.so.1: cannot open shared object file: No such file or directory
        
        I’m deep in the rabbit hole now ;)
        
        I may just go back to using BeautifulSoup — it does have the benefit of being 100% python — these libxml2 and libxslt bindings have caused me no end of headaches — I’m working on linux in the cloud — my local Mac OS X box is working fine.
        
        -Luke
Tres Seaver says:

December 11, 2008 at 11:48 am

@john_aman:

I was just able to run this in a virtualenv without using the activate script:
```
$ /path/to/virtualenv --no-site-packages ~/lxmltest
...
$ cd ~/lxmltest
$ STATIC_DEPS=true bin/easy_install lxml
...
NOTE: Trying to build without Cython, pre-generated 'src/lxml/lxml.etree.c' needs to be available.
Using build configuration of libxml2 2.7.2 and libxslt 1.1.24
Building against libxml2/libxslt in the following directory: /tmp/easy_install-_JbKqA/lxml-2.2alpha1/build/tmp/libxml2/lib
Adding lxml 2.2alpha1 to easy-install.pth file

Installed /home/tseaver/lxmltest/lib/python2.4/site-packages/lxml-2.2alpha1-py2.4-linux-i686.egg
Processing dependencies for lxml
Finished processing dependencies for lxml
```
It did download, compile, and link with the versions of libxml2 and libxslt required by lxml (the box I tried this on is Centos4, whose native versions are too old).

FWIW, I never use the ‘activate’ bit at all when working in a virtualenv; I don’t know whether
Chris says:

December 11, 2008 at 12:59 pm

I recently found BeautifulSpoon which provides a xpath-like API to BeautifulSoup:
```
from spoon import Spoon
import urllib2

spoon = Spoon(urllib2.urlopen("http://java.sun.com").read())
for link in spoon._.div(class_="pad")._.a:
    print "%s: %s" % (link.inner_html(), link["@href"])
```
http://code.google.com/p/beautifulspoon/
James Stewart says:

December 11, 2008 at 3:05 pm

We switched from BeautifulSoup to lxml for a parser project about eight months ago and saw a significant performance increase (I can’t remember the numbers, but I think we were looking at an order of magnitude speed-up of our overall parsing code that was almost entirely thanks to lxml). To keep everyone happy I implemented a fallback to BeautifulSoup if lxml balked at any page we were looking at, but I think that code’s rarely been necessary.
Christof says:

December 11, 2008 at 4:19 pm

@Andy @ed.: I tried something like this (combining lxml and cssutils) in a small proove of concept: http://code.google.com/p/cssutils/source/browse/trunk/examples/style.py

I have not developed it any further yet but I guess it may be a starting point and the code is not very complicated I think.
Mighty California says:

December 11, 2008 at 11:04 pm

Great Article – I’m can smell Beautiful Soup cooking already.
Perenzo says:

December 12, 2008 at 6:17 am

Hi, great article. The main drawback I see in lxml is that it is not pure Python. I tested some XML frameworks for writing a small implementation of a template language for GAE and BeautifulSoup turned out to be the worst. I then switched to Genshi which I think is the best and most robust pure Python XML parser I know. It also supports XPath and therefore also CSS selectors if you use the cssselector module of lxml for translating to XPath. Just the interface is quite difficult to understand but very powerful. For people interested in this feature have a look at this: http://code.google.com/p/pyxer/source/browse/trunk#trunk/src/pyxer/template

john aman says:

December 12, 2008 at 7:09 am

Thanks to all.

@Ian, @Tres: The first real problem? My stupidity. python2.5-dev not installed. I should have landed here much sooner but for Ubuntu makes installing packages so damn easy. 2nd problem – running at 64 bit is a bit like warp speed to this MSDOS 2.0 old timer. Years of DOS and windoze development may also have corrupted my neural network. Now I get to this:

(env)john@ibex:~$ STATIC_DEPS=true easy_install 'lxml&gt;=2.2alpha1'
...
Downloading libxslt into libs/libxslt-1.1.24.tar.gz
Unpacking libxslt-1.1.24.tar.gz into build/tmp
Running "./configure --without-python --disable-dependency-tracking --disable-shared --prefix=/tmp/easy_install-Am4yKQ/lxml-2.2alpha1/build/tmp/libxml2" in build/tmp/libxml2-2.7.2
checking build system type... x86_64-unknown-linux-gnu
checking host system type... x86_64-unknown-linux-gnu
...

lxml-2.2alpha1/build/tmp/libxml2/lib/pkgconfig"
 /usr/bin/install -c -m 644 'libxslt.pc' '/tmp/easy_install-6H0Ks9/lxml-2.2alpha1/build/tmp/libxml2/lib/pkgconfig/libxslt.pc'
 /usr/bin/install -c -m 644 'libexslt.pc' '/tmp/easy_install-6H0Ks9/lxml-2.2alpha1/build/tmp/libxml2/lib/pkgconfig/libexslt.pc'
make[2]: Leaving directory `/tmp/easy_install-6H0Ks9/lxml-2.2alpha1/build/tmp/libxslt-1.1.24'
make[1]: Leaving directory `/tmp/easy_install-6H0Ks9/lxml-2.2alpha1/build/tmp/libxslt-1.1.24'
NOTE: Trying to build without Cython, pre-generated 'src/lxml/lxml.etree.c' needs to be available.
Using build configuration of libxml2 2.7.2 and libxslt 1.1.24
Building against libxml2/libxslt in the following directory: /tmp/easy_install-6H0Ks9/lxml-2.2alpha1/build/tmp/libxml2/lib
************************
/usr/bin/ld: /tmp/easy_install-6H0Ks9/lxml-2.2alpha1/build/tmp/libxml2/lib/libexslt.a(exslt.o): relocation R_X86_64_32 against `a local symbol' can not be used when making a shared object; recompile with -fPIC
************************
/tmp/easy_install-6H0Ks9/lxml-2.2alpha1/build/tmp/libxml2/lib/libexslt.a: could not read symbols: Bad value
collect2: ld returned 1 exit status
error: Setup script exited with error: command 'gcc' failed with exit status 1

where error is marked ***********

It seems the linker is failing to link exslt.o into libexslt.a: (exslt.o): relocation RX8664_32 against `a local symbol’ can not be used when making a shared object; recompile with -fPIC

Oh, did I mention I’m new to 64 bit OS? Perhaps someone can point out how to do all the steps the wget/tar/.configure/make way. Too old to try to figure out something so easyinstall. Not knocking easy_install – this is a first time failure for me with that program.

john aman says:

December 12, 2008 at 8:04 am

@Perenzo-

something is wrong at pyxer.appspot.com:

Server Error
URL: <a href="http://pyxer.appspot.com/guestbook2/commit?message=jlklk" rel="nofollow">http://pyxer.appspot.com/guestbook2/commit?message=jlklk</a>
Module paste.exceptions.errormiddleware:144 in __call__
&gt;  app_iter = self.application(environ, sr_checker)
Module paste.exceptions.errormiddleware:138 in __call__
&gt;  return self.application(environ, start_response)
Module paste.registry:350 in __call__
&gt;  app_iter = self.application(environ, start_response)

But Ubuntu? sudo apt-get install python-genshi works. Still want to get this working for the love/speed of c.

Perenzo says:

December 12, 2008 at 8:53 am

@john aman: Yeah, thanks for the hint. This seems to be a bug in the Beaker session management for GAE. Don’t know why they do not fix it: http://pylonshq.com/project/pylonshq/ticket/537

BTW. I tried to rewrite the example using Genshi and cssselector an it looks like this:
```
from genshi import HTML
from lxml import cssselect
import urllib2

stream = HTML(urllib2.urlopen('http://java.sun.com').read())
xpath = cssselect.css_to_xpath('div.pad a')
substream = stream.select(xpath)
print substream
```
Well the handling of the streams is not trivial but powerful: http://genshi.edgewall.org/wiki/Documentation/streams.html

john aman says:

December 12, 2008 at 1:54 pm

Ian, you and your readers rock!

From Stefan Behnel follow-up message (posting here for the doc value):

(env)john@ibex:~$   CFLAGS="$CFLAGS -fPIC" STATIC_DEPS=true easy_install "lxml&gt;=2.2alpha1"

And the results:

...
Building against libxml2/libxslt in the following directory: /tmp/easy_install-Nzp53T/lxml-2.2alpha1/build/tmp/libxml2/lib
Adding lxml 2.2alpha1 to easy-install.pth file

Installed /home/john/env/lib/python2.5/site-packages/lxml-2.2alpha1-py2.5-linux-x86_64.egg
Processing dependencies for lxml&gt;=2.2alpha1
Finished processing dependencies for lxml&gt;=2.2alpha1

sorry about the “& gt;”‘s should be lxml>= …

Kevin Dahlhausen says:

December 12, 2008 at 2:09 pm

Ian, thanks for posting about lxml. I hadn’t known about it and your post came just as I was starting a small screen-scraping project. Using CSS selectors in lxml is heavenly. It almost makes screen scraping fun. At any rate, it’s helping me make rapid progress.
Harry says:

December 15, 2008 at 3:41 pm

Is it possible to use make_links_absolute with pyquery? I’ve messed around with the library source but haven’t had any luck!

Cowmix says:

December 22, 2008 at 11:41 am

I tried the simple example:

from lxml.html import parse
doc = parse('http://java.sun.com').getroot()
for link in doc.cssselect('div.pad a'):
    print '%s: %s' % (link.text_content(), link.get('href'))

… and I got this error:

D:\Data\src\cornfluence&gt;python parsedb.py
Traceback (most recent call last):
  File "parsedb.py", line 14, in
    for link in doc.cssselect('div.pad a'):
AttributeError: 'NoneType' object has no attribute 'cssselect'

Noah Gift says:

January 10, 2009 at 4:10 am

That little tip on installing on OS X saved me a potential night of drama, as I was installing “Zine”.

STATICDEPS=true easyinstall ‘lxml>=2.2alpha1′

Thanks!
Grease says:

January 10, 2009 at 6:50 am

Regarding installing lxml … think I found a very easy way if you’re using ubuntu … search for lxml in the synaptic package manager , and the install is just a few clicks away :)

PS: the version available on synaptic may not be the very latest
Shawn Wheatley says:

March 1, 2009 at 9:25 pm

I’m on a Mac (10.5.6). I tried the suggested line with an added sudo to install in my site-packages:
```
STATIC_DEPS=true sudo easy_install 'lxml&gt;=2.2alpha1'
```
but it didn’t pull down libxml and libxslt. I ended up running it without sudo and it worked fine up until installation. Then I just re-ran with sudo and it grabbed the egg it just built. That did it, and now I have lxml installed. Thanks, Ian!

PS Ian: your Markdown link is 404ing right now…
Pascal says:

March 14, 2009 at 1:07 pm

Using “STATICDEPS=true easyinstall lxml” worked great on three Intel Macs using the python.org Python 2.5.4 — two of these Macs running 10.5 and one running 10.4.

Doing the same thing on a PPC Mac (10.4 and python.org 2.5.4) I had unfortunately no success. It fails with:

“ar cru .libs/testdso.a testdso.o ranlib .libs/testdso.a creating testdso.la (cd .libs && rm -f testdso.la && ln -s ../testdso.la testdso.la) gcc -DHAVECONFIGH -I. -I./include -I./include -DREENTRANT -arch ppc -arch i386 -isysroot /Developer/SDKs/MacOSX10.4u.sdk -O2 -pedantic -W -Wformat -Wunused -Wimplicit -Wreturn-type -Wswitch -Wcomment -Wtrigraphs -Wformat -Wchar-subscripts -Wuninitialized -Wparentheses -Wshadow -Wpointer-arith -Wcast-align -Wwrite-strings -Waggregate-return -Wstrict-prototypes -Wmissing-prototypes -Wnested-externs -Winline -Wredundant-decls -c xmllint.c /bin/sh ./libtool –tag=CC –mode=link gcc -arch ppc -arch i386 -isysroot /Developer/SDKs/MacOSX10.4u.sdk -O2 -pedantic -W -Wformat -Wunused -Wimplicit -Wreturn-type -Wswitch -Wcomment -Wtrigraphs -Wformat -Wchar-subscripts -Wuninitialized -Wparentheses -Wshadow -Wpointer-arith -Wcast-align -Wwrite-strings -Waggregate-return -Wstrict-prototypes -Wmissing-prototypes -Wnested-externs -Winline -Wredundant-decls -arch ppc -arch i386 -isysroot /Developer/SDKs/MacOSX10.4u.sdk -o xmllint xmllint.o ./libxml2.la -lpthread -lz -liconv -lm
gcc -arch ppc -arch i386 -isysroot /Developer/SDKs/MacOSX10.4u.sdk -O2 -pedantic -W -Wformat -Wunused -Wimplicit -Wreturn-type -Wswitch -Wcomment -Wtrigraphs -Wformat -Wchar-subscripts -Wuninitialized -Wparentheses -Wshadow -Wpointer-arith -Wcast-align -Wwrite-strings -Waggregate-return -Wstrict-prototypes -Wmissing-prototypes -Wnested-externs -Winline -Wredundant-decls -arch ppc -arch i386 -isysroot /Developer/SDKs/MacOSX10.4u.sdk -o xmllint xmllint.o ./.libs/libxml2.a -lpthread -lz /usr/lib/libiconv.dylib -lm
/usr/libexec/gcc/i686-apple-darwin8/4.0.1/ld: for architecture i386 /usr/libexec/gcc/i686-apple-darwin8/4.0.1/ld: warning /usr/lib/libiconv.dylib cputype (18, architecture ppc) does not match cputype (7) for specified -arch flag: i386 (file not loaded) /usr/libexec/gcc/i686-apple-darwin8/4.0.1/ld: Undefined symbols: _libiconv _libiconvclose libiconvopen collect2: ld returned 1 exit status lipo: can’t open input file: /var/tmp//ccm2sHDw.out (No such file or directory) make[2]: *** [xmllint] Error 1 make[1]: *** [all-recursive] Error 1 make: *** [all] Error 2
“

Any suggestions would be very much appreciated!

Thanks, Pascal
Web Developer - Shom says:

April 2, 2009 at 5:28 am

Thanks for this beautiful post. I haven’t try lxml till now because I was not sure whether it will be as useful as beautiful soap or not. But after reading this post I changed my mind and I will surely try lxml.
Pamela Scoot says:

April 9, 2009 at 5:52 am

Thank you for clarifying the reasons to use lxml . Many have doubt about lxml and rumor is going on that lxml is not comfortable to use comparing BeautifulSoup. But your post has given a urge to rethink over the matter.
JohnMc says:

June 25, 2009 at 5:26 pm

Ian,

first, thanks lxml is a hoot. I’ll grant that getting lxml install is a bit of a pain but not that bad. Using the STATIC_DEPS suggestion was a big help as well.

However I have noticed something working through your web scraping examples.

Working through this –

from lxml.html import parse doc = parse(‘http://java.sun.com’).getroot() for link in doc.cssselect(‘div.pad a’): print ‘%s: %s’ % (link.text_content(), link.get(‘href’))

I receive a failure on the parse().getroot() statement. However if I do the following —

import urllib from lxml.html import * content = urllib.urlopen(‘http://java.sun.com’).read() doc = fromstring(content) for link in doc.cssselect(‘div.pad a’): print ‘%s: %s’ % (link.text_content(), link.get(‘href’))

it works. Have you seen this behavior before?
- Ian Bicking says:
  
  June 25, 2009 at 6:32 pm
  
  No, not at all — I just tried it and it worked fine.
- mauro.degiorgi@gmail says:
  
  July 3, 2009 at 11:47 am
  
  You must insert the trailing slash:
  
  doc = parse(’http://java.sun.com/’).getroot()
  
  work :) dont ask me why…
jesse says:

August 25, 2009 at 2:58 pm

Has anybody looked at http://scrapy.org/.
Neutrino says:

October 16, 2009 at 1:57 am

Lxml is one of those Python libraries that should be really high profile given its usefulness. Sadly it is also one of those odd projects peculiar to open source that doesn’t have a forum and as a consequence receives little attention from anyone not prepared to maintain a database of mailing lists on every workstation they use for every single piece of software or technology subject they are interested in.
pornoizle says:

December 4, 2009 at 12:26 pm

i second that installing lxml can be a hassle; i only managed to make it run on windows by manually downloading and installing a binary package. definitely one great advantage of BeautifulSoup: a single fairly short *.py file, and you’re good to go.
Benjamin Sergeant says:

February 1, 2010 at 12:17 pm

I don’t know how much code goes in the cpython code but maybe if the division is well defined between C calls to libxml libs one could write some wrappers to simulate libxml calls using ElementTree which is one of python battery (not sure about the xpath support). Not a small project but something that could be usefull for say Google App Engine users or Jython/IronPython/Pypy users.
sikis says:

February 8, 2010 at 3:05 am

Nice summarization of lxml features, thanks.
Graeme Pietersz says:

April 29, 2010 at 6:25 am

Thanks for this: it encouraged me to use lxml. I used lxml to import data from HTML to a database (I should have been generating those pages from a database in the first place, but thats another story).

I used xml.sax for something similar (except it was an XML file that time) a few months ago, and lxml was much easier to work with.

No installation problems on Mandriva Linux — lxml was in the repos so a tick and a click was all that was needed. As Grease suggests, if you use an OS with a package manager, its the easiest way to install almost anything.
Herbert Roitblat says:

May 29, 2010 at 12:24 pm

I am trying to install lxml in an active virtual environment on Ubuntu 10.04 64-bit on a Dell xps, Python 2.6.5.

I used this command from the top of the virtual environment:

STATICDEPS=true bin/easyinstall lxml

This is the result I got:

make[1]: Leaving directory /tmp/easy_install-N4_xs4/lxml-2.2.6/build/tmp/libxslt-1.1.26' NOTE: Trying to build without Cython, pre-generated 'src/lxml/lxml.etree.c' needs to be available. Using build configuration of libxml2 2.7.7 and libxslt 1.1.26 Building against libxml2/libxslt in the following directory: /tmp/easy_install-N4_xs4/lxml-2.2.6/build/tmp/libxml2/lib /usr/bin/ld: /tmp/easy_install-N4_xs4/lxml-2.2.6/build/tmp/libxml2/lib/libxslt.a(xslt.o): relocation R_X86_64_32 against.rodata.str1.8′ can not be used when making a shared object; recompile with -fPIC /tmp/easyinstall-N4xs4/lxml-2.2.6/build/tmp/libxml2/lib/libxslt.a: could not read symbols: Bad value

I guess that something is missing. I installed Python-dev on my main system, created the virtual environment with –no-site-packages. If I need to install Python-dev in my virtual environment, I don’t know how to do it (apt-get tries to put it in the main installation, complains that I am not root). If I need something else, I would appreciate instructions on how to put it into my virtualenv.

By the way, what I am looking to do right now is to translate a dictionary object into xml. I currently translate it into JSON without a hitch, but I need to be able to support also translating it into xml. The structure consists of a key:value, the value consists of a list of a list of key-value pairs. Any thoughts on converting a dictionary to xml would also be welcome. That’s how I came to lxml.

Thanks so much.

Herb
manju says:

June 24, 2010 at 6:11 am

Hi all! i have installed lxml2.2.2 on windows platform(i m using python version 2.6.5).ive tried the code you have mentioned : “from lxml.html import parse p= parse(‘http://www.google.com’).getroot()”

but i am getting the following error:

Traceback (most recent call last): File “”, line 1, in p=parse(‘http://www.google.com’).getroot() File “C:\Python26\lib\site-packages\lxml-2.2.2-py2.6-win32.egg\lxml\html_init_.py”, line 661, in parse return etree.parse(filenameorurl, parser, baseurl=baseurl, **kw) File “lxml.etree.pyx”, line 2698, in lxml.etree.parse (src/lxml/lxml.etree.c:49590) File “parser.pxi”, line 1491, in lxml.etree.parseDocument (src/lxml/lxml.etree.c:71205) File “parser.pxi”, line 1520, in lxml.etree.parseDocumentFromURL (src/lxml/lxml.etree.c:71488) File “parser.pxi”, line 1420, in lxml.etree.parseDocFromFile (src/lxml/lxml.etree.c:70583) File “parser.pxi”, line 975, in lxml.etree.BaseParser.parseDocFromFile (src/lxml/lxml.etree.c:67736) File “parser.pxi”, line 539, in lxml.etree.ParserContext.handleParseResultDoc (src/lxml/lxml.etree.c:63820) File “parser.pxi”, line 625, in lxml.etree.handleParseResult (src/lxml/lxml.etree.c:64741) File “parser.pxi”, line 563, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64056) IOError: Error reading file ‘http://www.google.com’: failed to load external entity “http://www.google.com”

i am clueless as to what to do next as i am a newbie to python. please guide me to solve this error. thanks in advance!! :)
manju says:

June 25, 2010 at 4:31 am

Hi Ian, in the first lxml example that you have given,I think instead of doc = parse(‘http://java.sun.com’).getroot() it should be

from urllib2 import urlopen doc=parse(urlopen(‘http://java.sun.com’)).getroot()

as parse does not fetch the website. As i have said, the first one is giving an error but the second one is working fine for me.
ddavout says:

October 10, 2010 at 5:49 am

Thanks a lot to help me to overcome my “fear’ in front of lxml. As a beginner I am quite happy to have managed to take all the data I needed that were sparsed on more than 400 web pages.(with the help of xmlstarlet to clean up manually some pages) No problem whatsoever to install with the debian packages. Thanks a lot

Ian Bicking: a blog

lxml: an underappreciated web scraping library

51 Comments

Home

About

Archives

Categories

Recent Posts

Recent Comments