You can use WSGI to make rewriting middleware; WebOb specifically makes it easy to write. And that’s cool, but it’s more satisfying to use your middleware right away without having to think about writing applications that might live behind the middleware.
There’s two libraries I’ll describe here to make that possible: paste.proxy to send WSGI requests out via HTTP, and lxml.html which lets you rewrite the HTML to fix up the links.
To start, we need some kind of middleware that at least is noticeable. How about something to make a word jumble of the page? We’ll use lxml as well:
from lxml import html from random import shuffle def jumble_words(doc): """Mixes up the words in an HTML document (doesn't touch tags or attributes)""" doc = html.fromstring(doc) # .text_content() gives the text without tags or attributes, # .body is the <body> tag: words = doc.body.text_content().split() shuffle(words) for el in doc.body.iterdescendants(): # The ElementTree model puts all text in .text and .tail on elements, so that's # what we mix up: el.text = random_words(el.text, words) el.tail = random_words(el.tail, words) return html.tostring(doc) def random_words(text, words): """Pulls some words from the list words, with the same number of words in the previous `text`""" # text can be None, so we need this test: if not text: return text word_count = len(text.split()) try: return ' '.join(words.pop() for i in range(word_count)) except IndexError: # This shouldn't happen, because we should have exactly # the right number of words, but just in case... return text from webob import Request class JumbleMiddleware(object): """Middleware that jumbles the words of HTML responses """ # This __init__ and __call__ are the basic pattern for middleware: def __init__(self, app): self.app = app def __call__(self, environ, start_response): req = Request(environ) # We don't want 304 Not Modified responses, because we mix up the response # differently every time. So we'll make sure all the headers that could call that # (If-Modified-Since, etc) are removed with .remove_conditional_headers(): req.remove_conditional_headers() # This calls the application with the request, and then returns a response; this # is the typical pattern for response-modifying middleware using WebOb: resp = req.get_response(self.app) if resp.content_type == 'text/html': resp.body = jumble_words(resp.body) return resp(environ, start_response)
Well, you don’t really need to jumble up your own pages, right? Much more fun to jumble other people’s pages. Enter the proxy. Here’s a basic proxy:
from paste.proxy import Proxy # We use this to make sure we didn't mess up anything with JumbleMiddleware; # the validator checks for many WSGI requirements: from wsgiref.validate import validator import sys def main(): proxy_url = sys.argv[1] app = JumbleMiddleware( Proxy(proxy_url)) app = validator(app) from paste.httpserver import serve serve(app, 'localhost', 8080) if __name__ == '__main__': main()
If you look at the full source the command-line is a bit fancier, but it’s all obvious stuff.
OK, so this will work, but the links will often be broken unless the server only gives relative links. But you can rewrite the links using lxml…
import urlparse class LinkRewriterMiddleware(object): """Rewrites the response, assuming the HTML was generated as though based at `dest_href`, and needs to be rewritten for the incoming request""" # The normal __init__, __call__ pattern: def __init__(self, app, dest_href): self.app = app if dest_href.endswith('/'): dest_href = dest_href[:-1] self.dest_href = dest_href def __call__(self, environ, start_response): req = Request(environ) # .path_info (aka environ['PATH_INFO']) is the path of the request # (URL rewriting doesn't really have to care about query strings) dest_path = req.path_info dest_href = self.dest_href + dest_path # req.application_url is the base URL not including path_info or the query string: req_href = req.application_url def link_repl_func(link): link = urlparse.urljoin(dest_href, link) if not link.startswith(dest_href): # Not a local link return link new_url = req_href + '/' + link[len(dest_href):] return new_url resp = req.get_response(self.app) # This decodes any possible gzipped content: resp.decode_content() if (resp.status_int == 200 and resp.content_type == 'text/html'): doc = html.fromstring(resp.body, base_url=dest_href) doc.rewrite_links(link_repl_func) resp.body = html.tostring(doc) # Redirects need their redirect locations rewritten: if resp.location: resp.location = link_repl_func(resp.location) return resp(environ, start_response)
Then we rewire the application:
app = JumbleMiddleware( LinkRewriterMiddleware(Proxy(proxy_url), proxy_url))
Now there’s a fun little proxy for you to play with. You can see the code here.