Ian Bicking: a blog :: Making a proxy with WSGI and lxml

{ 2008 07 30 }

Making a proxy with WSGI and lxml

You can use WSGI to make rewriting middleware; WebOb specifically makes it easy to write. And that’s cool, but it’s more satisfying to use your middleware right away without having to think about writing applications that might live behind the middleware.

There’s two libraries I’ll describe here to make that possible: paste.proxy to send WSGI requests out via HTTP, and lxml.html which lets you rewrite the HTML to fix up the links.

To start, we need some kind of middleware that at least is noticeable. How about something to make a word jumble of the page? We’ll use lxml as well:

from lxml import html
from random import shuffle

def jumble_words(doc):
"""Mixes up the words in an HTML document (doesn't touch tags or attributes)"""
doc = html.fromstring(doc)
# .text_content() gives the text without tags or attributes,
# .body is the <body> tag:
words = doc.body.text_content().split()
shuffle(words)
for el in doc.body.iterdescendants():
# The ElementTree model puts all text in .text and .tail on elements, so that's
# what we mix up:
el.text = random_words(el.text, words)
el.tail = random_words(el.tail, words)
return html.tostring(doc)

def random_words(text, words):
"""Pulls some words from the list words, with the same number of words in
the previous `text`"""
# text can be None, so we need this test:
if not text:
return text
word_count = len(text.split())
try:
return ' '.join(words.pop() for i in range(word_count))
except IndexError:
# This shouldn't happen, because we should have exactly
# the right number of words, but just in case...
return text

from webob import Request

class JumbleMiddleware(object):
"""Middleware that jumbles the words of HTML responses
"""
# This __init__ and __call__ are the basic pattern for middleware:
def __init__(self, app):
self.app = app
def __call__(self, environ, start_response):
req = Request(environ)
# We don't want 304 Not Modified responses, because we mix up the response
# differently every time. So we'll make sure all the headers that could call that
# (If-Modified-Since, etc) are removed with .remove_conditional_headers():
req.remove_conditional_headers()
# This calls the application with the request, and then returns a response; this
# is the typical pattern for response-modifying middleware using WebOb:
resp = req.get_response(self.app)
if resp.content_type == 'text/html':
resp.body = jumble_words(resp.body)
return resp(environ, start_response)

Well, you don’t really need to jumble up your own pages, right? Much more fun to jumble other people’s pages. Enter the proxy. Here’s a basic proxy:

from paste.proxy import Proxy
# We use this to make sure we didn't mess up anything with JumbleMiddleware;
# the validator checks for many WSGI requirements:
from wsgiref.validate import validator
import sys

def main():
proxy_url = sys.argv[1]
app = JumbleMiddleware(
Proxy(proxy_url))
app = validator(app)
from paste.httpserver import serve
serve(app, 'localhost', 8080)

if __name__ == '__main__':
main()

If you look at the full source the command-line is a bit fancier, but it’s all obvious stuff.

OK, so this will work, but the links will often be broken unless the server only gives relative links. But you can rewrite the links using lxml…

import urlparse

class LinkRewriterMiddleware(object):
"""Rewrites the response, assuming the HTML was generated as though based at
`dest_href`, and needs to be rewritten for the incoming request"""

# The normal __init__, __call__ pattern:
def __init__(self, app, dest_href):
self.app = app
if dest_href.endswith('/'):
dest_href = dest_href[:-1]
self.dest_href = dest_href

def __call__(self, environ, start_response):
req = Request(environ)
# .path_info (aka environ['PATH_INFO']) is the path of the request
# (URL rewriting doesn't really have to care about query strings)
dest_path = req.path_info
dest_href = self.dest_href + dest_path
# req.application_url is the base URL not including path_info or the query string:
req_href = req.application_url
def link_repl_func(link):
link = urlparse.urljoin(dest_href, link)
if not link.startswith(dest_href):
# Not a local link
return link
new_url = req_href + '/' + link[len(dest_href):]
return new_url
resp = req.get_response(self.app)
# This decodes any possible gzipped content:
resp.decode_content()
if (resp.status_int == 200
and resp.content_type == 'text/html'):
doc = html.fromstring(resp.body, base_url=dest_href)
doc.rewrite_links(link_repl_func)
resp.body = html.tostring(doc)
# Redirects need their redirect locations rewritten:
if resp.location:
resp.location = link_repl_func(resp.location)
return resp(environ, start_response)

Then we rewire the application:

app = JumbleMiddleware(
LinkRewriterMiddleware(Proxy(proxy_url), proxy_url))

Now there’s a fun little proxy for you to play with. You can see the code here.

Automatically generated list of related posts:

lxml: an underappreciated web scraping library When people think about web scraping in Python, they usually...
lxml.html Over the summer I did quite a bit of work...
Inverted Partials I was talking with a coworker some time ago about...
WebOb decorator Lately I’ve been writing a few applications (e.g., PickyWiki and...
Decorators and Descriptors So, decorators are neat (maybe check out a new tutorial...

3 Comments

Martijn Faassen says:

July 30, 2008 at 11:36 am

Cool!

What’s the difference by the way between paste.proxy and WSGIProxy? Which one should people be using?
Ian Bicking says:

July 30, 2008 at 11:41 am

I need to clean WSGIProxy up, I’m afraid. I use wsgiproxy.exactproxy.proxy_exact_request a lot, but it requires more work to use because you have to modify the request in-place before sending it on. So for now I’d recommend paste.proxy.Proxy for this particular use case.
Quảng bá website says:

August 13, 2008 at 1:46 am

Quảng bá website – Thiết kế website – Thiết kế website đẹp!

Ian Bicking: a blog

Making a proxy with WSGI and lxml

3 Comments

Home

About

Archives

Categories

Recent Posts

Recent Comments