You can use WSGI to make rewriting
middleware;
WebOb
specifically makes it easy to write.
And that’s cool, but
it’s more satisfying to use
your middleware right away without
having to think about writing
applications that might live behind
the middleware.
There’s two libraries
I’ll describe here to make
that possible:
paste.proxy
to send WSGI requests out via HTTP,
and
lxml.html
which lets you rewrite the HTML to
fix up the links.
To start, we need some kind of
middleware that at least is
noticeable. How about something to
make a word jumble of the page?
We’ll use lxml as well:
from lxml
import
html
from
random
import
shuffle
def
jumble_words(doc):
"""Mixes up the words in an
HTML document (doesn't touch
tags or attributes)"""
doc = html.fromstring(doc)
# .text_content() gives the
text without tags or
attributes,
# .body is the <body>
tag:
words = doc.body.text_content().split()
shuffle(words)
for el
in
doc.body.iterdescendants():
# The ElementTree model puts
all text in .text and .tail on
elements, so that's
# what we mix up:
el.text
= random_words(el.text,
words)
el.tail
= random_words(el.tail,
words)
return
html.tostring(doc)
def
random_words(text, words):
"""Pulls some words from the
list words, with the same number
of words in
the previous
`text`"""
# text can be None, so we need
this test:
if
not
text:
return
text
word_count =
len(text.split())
try:
return
' '.join(words.pop()
for i
in
range(word_count))
except
IndexError:
# This shouldn't happen,
because we should have
exactly
# the right number of words,
but just in case...
return
text
from webob
import
Request
class
JumbleMiddleware(object):
"""Middleware that jumbles
the words of HTML responses
"""
# This __init__ and __call__
are the basic pattern for
middleware:
def
__init__(self,
app):
self.app
= app
def
__call__(self,
environ, start_response):
req =
Request(environ)
# We don't want 304 Not
Modified responses, because we
mix up the response
# differently every time.
So we'll make sure all the
headers that could call
that
# (If-Modified-Since, etc) are
removed with
.remove_conditional_headers():
req.remove_conditional_headers()
# This calls the application
with the request, and then
returns a response; this
# is the typical pattern for
response-modifying middleware
using WebOb:
resp =
req.get_response(self.app)
if
resp.content_type
==
'text/html':
resp.body
= jumble_words(resp.body)
return
resp(environ, start_response)
Well, you don’t really need to
jumble up your own pages,
right? Much more fun to jumble other
people’s pages. Enter the
proxy. Here’s a basic proxy:
from
paste.proxy
import
Proxy
# We use this to make sure we
didn't mess up anything with
JumbleMiddleware;
# the validator checks for many
WSGI requirements:
from
wsgiref.validate
import
validator
import
sys
def
main():
proxy_url =
sys.argv[1]
app =
JumbleMiddleware(
Proxy(proxy_url))
app = validator(app)
from
paste.httpserver
import
serve
serve(app,
'localhost', 8080)
if __name__
==
'__main__':
main()
If you look at the
full source
the command-line is a bit fancier,
but it’s all obvious stuff.
OK, so this will work, but the links
will often be broken unless the
server only gives relative links.
But you can rewrite the links using
lxml…
import
urlparse
class
LinkRewriterMiddleware(object):
"""Rewrites the response,
assuming the HTML was generated
as though based at
`dest_href`, and
needs to be rewritten for the
incoming request"""
# The normal __init__, __call__
pattern:
def
__init__(self, app,
dest_href):
self.app
= app
if
dest_href.endswith('/'):
dest_href = dest_href[:-1]
self.dest_href
= dest_href
def
__call__(self,
environ, start_response):
req =
Request(environ)
# .path_info (aka
environ['PATH_INFO']) is the
path of the request
# (URL rewriting doesn't really
have to care about query
strings)
dest_path = req.path_info
dest_href =
self.dest_href
+ dest_path
# req.application_url is the
base URL not including path_info
or the query string:
req_href
= req.application_url
def
link_repl_func(link):
link =
urlparse.urljoin(dest_href, link)
if
not
link.startswith(dest_href):
# Not a local link
return
link
new_url = req_href +
'/' +
link[len(dest_href):]
return
new_url
resp =
req.get_response(self.app)
# This decodes any possible
gzipped content:
resp.decode_content()
if
(resp.status_int
== 200
and
resp.content_type
==
'text/html'):
doc = html.fromstring(resp.body,
base_url=dest_href)
doc.rewrite_links(link_repl_func)
resp.body
= html.tostring(doc)
# Redirects need their redirect
locations rewritten:
if
resp.location:
resp.location
= link_repl_func(resp.location)
return
resp(environ, start_response)
Then we rewire the application:
app = JumbleMiddleware(
LinkRewriterMiddleware(Proxy(proxy_url), proxy_url))
Now there’s a fun little proxy
for you to play with. You can see
the code
here.