Making a proxy with WSGI and lxml

You can use WSGI to make rewriting middleware; WebOb specifically makes it easy to write. And that’s cool, but it’s more satisfying to use your middleware right away without having to think about writing applications that might live behind the middleware.

There’s two libraries I’ll describe here to make that possible: paste.proxy to send WSGI requests out via HTTP, and lxml.html which lets you rewrite the HTML to fix up the links.

To start, we need some kind of middleware that at least is noticeable. How about something to make a word jumble of the page? We’ll use lxml as well:


from lxml import html
from random import shuffle

def jumble_words(doc):
    """Mixes up the words in an HTML document (doesn't touch tags or attributes)"""
    doc = html.fromstring(doc)
    # .text_content() gives the text without tags or attributes,
    # .body is the <body> tag:
    words = doc.body.text_content().split()
    shuffle(words)
    for el in doc.body.iterdescendants():
        # The ElementTree model puts all text in .text and .tail on elements, so that's
        # what we mix up:
        el.text = random_words(el.text, words)
        el.tail = random_words(el.tail, words)
    return html.tostring(doc)

def random_words(text, words):
    """Pulls some words from the list words, with the same number of words in
    the previous `text`"
""
    # text can be None, so we need this test:
    if not text:
        return text
    word_count = len(text.split())
    try:
        return ' '.join(words.pop() for i in range(word_count))
    except IndexError:
        # This shouldn't happen, because we should have exactly
        # the right number of words, but just in case...
        return text

from webob import Request

class JumbleMiddleware(object):
    """Middleware that jumbles the words of HTML responses
    "
""
    # This __init__ and __call__ are the basic pattern for middleware:
    def __init__(self, app):
        self.app = app
    def __call__(self, environ, start_response):
        req = Request(environ)
        # We don't want 304 Not Modified responses, because we mix up the response
        # differently every time.  So we'll make sure all the headers that could call that
        # (If-Modified-Since, etc) are removed with .remove_conditional_headers():
        req.remove_conditional_headers()
        # This calls the application with the request, and then returns a response; this
        # is the typical pattern for response-modifying middleware using WebOb:
        resp = req.get_response(self.app)
        if resp.content_type == 'text/html':
            resp.body = jumble_words(resp.body)
        return resp(environ, start_response)
 

Well, you don’t really need to jumble up your own pages, right? Much more fun to jumble other people’s pages. Enter the proxy. Here’s a basic proxy:


from paste.proxy import Proxy
# We use this to make sure we didn't mess up anything with JumbleMiddleware;
# the validator checks for many WSGI requirements:
from wsgiref.validate import validator
import sys

def main():
    proxy_url = sys.argv[1]
    app = JumbleMiddleware(
        Proxy(proxy_url))
    app = validator(app)
    from paste.httpserver import serve
    serve(app, 'localhost', 8080)

if __name__ == '__main__':
    main()
 

If you look at the full source the command-line is a bit fancier, but it’s all obvious stuff.

OK, so this will work, but the links will often be broken unless the server only gives relative links. But you can rewrite the links using lxml…


import urlparse

class LinkRewriterMiddleware(object):
    """Rewrites the response, assuming the HTML was generated as though based at
    `dest_href`, and needs to be rewritten for the incoming request"
""

    # The normal __init__, __call__ pattern:
    def __init__(self, app, dest_href):
        self.app = app
        if dest_href.endswith('/'):
            dest_href = dest_href[:-1]
        self.dest_href = dest_href

    def __call__(self, environ, start_response):
        req = Request(environ)
        # .path_info (aka environ['PATH_INFO']) is the path of the request
        # (URL rewriting doesn't really have to care about query strings)
        dest_path = req.path_info
        dest_href = self.dest_href + dest_path
        # req.application_url is the base URL not including path_info or the query string:
        req_href = req.application_url
        def link_repl_func(link):
            link = urlparse.urljoin(dest_href, link)
            if not link.startswith(dest_href):
                # Not a local link
                return link
            new_url = req_href + '/' + link[len(dest_href):]
            return new_url
        resp = req.get_response(self.app)
        # This decodes any possible gzipped content:
        resp.decode_content()
        if (resp.status_int == 200
            and resp.content_type == 'text/html'):
            doc = html.fromstring(resp.body, base_url=dest_href)
            doc.rewrite_links(link_repl_func)
            resp.body = html.tostring(doc)
        # Redirects need their redirect locations rewritten:
        if resp.location:
            resp.location = link_repl_func(resp.location)
        return resp(environ, start_response)
 

Then we rewire the application:


app = JumbleMiddleware(
    LinkRewriterMiddleware(Proxy(proxy_url), proxy_url))
 

Now there’s a fun little proxy for you to play with. You can see the code here.

Automatically generated list of related posts:

  1. lxml: an underappreciated web scraping library When people think about web scraping in Python, they usually...
  2. lxml.html Over the summer I did quite a bit of work...
  3. Inverted Partials I was talking with a coworker some time ago about...
  4. WebOb decorator Lately I’ve been writing a few applications (e.g., PickyWiki and...
  5. Decorators and Descriptors So, decorators are neat (maybe check out a new tutorial...