Ian Bicking: a blog :: 2008

A Doctest Wishlist

Lately I’ve been doing most of my testing with doctest, primarily using stand-alone text files. I generally like it (otherwise I wouldn’t be using it), but it does make me frustrated with doctest sometimes. On my wishlist (roughly in order):

I wish output was always displayed, even when there’s an exception. I see no reason for the current behavior. Really exceptions could be treated like any other output (if ELLIPSIS was on by default).
I wish you could turn on options like ELLIPSIS from within a doctest, for all expressions. (# doctest: +ELLIPSIS on every line is beyond ugly.)
<BLANKLINE> is terribly ugly.
There’s no way of saying "I don’t care what this prints". You can’t do:

>>> some_function()
...

because the ... is treated like a continuation.
Plugging in an alternate output checker is kind of tedious, and can’t be done from within a doctest (without horrible hacks).
I’d like to be able to easily jump into an interactive state from doctest. Maybe pdb can do this, but I’ve never figured that out exactly.
Getting nose to run .txt files as doctests is really hard, involving a combination of options I always forget.
There’s no way to abort the doctest. Sometimes I’d like to run some environment checks early on, and be able to stop the test if they fail.
I wish it was easier to apply to non-Python code. (I’ve adapted it via subclassing for Logo but I wouldn’t do that often.)
I wish I could copy and paste from doctests to consoles. But I don’t see any solution to this problem.
The integration with unittest is pretty hacky. Not that I’ve used unittest in years. But some other test frameworks build off this integration.
python -m doctest sometest.txt doesn’t do what it should do. Instead it runs doctest’s self-tests.

Making a proxy with WSGI and lxml

You can use WSGI to make rewriting middleware; WebOb specifically makes it easy to write. And that’s cool, but it’s more satisfying to use your middleware right away without having to think about writing applications that might live behind the middleware.

There’s two libraries I’ll describe here to make that possible: paste.proxy to send WSGI requests out via HTTP, and lxml.html which lets you rewrite the HTML to fix up the links.

To start, we need some kind of middleware that at least is noticeable. How about something to make a word jumble of the page? We’ll use lxml as well:

from lxml import html
from random import shuffle

def jumble_words(doc):
"""Mixes up the words in an HTML document (doesn't touch tags or attributes)"""
doc = html.fromstring(doc)
# .text_content() gives the text without tags or attributes,
# .body is the <body> tag:
words = doc.body.text_content().split()
shuffle(words)
for el in doc.body.iterdescendants():
# The ElementTree model puts all text in .text and .tail on elements, so that's
# what we mix up:
el.text = random_words(el.text, words)
el.tail = random_words(el.tail, words)
return html.tostring(doc)

def random_words(text, words):
"""Pulls some words from the list words, with the same number of words in
the previous `text`"""
# text can be None, so we need this test:
if not text:
return text
word_count = len(text.split())
try:
return ' '.join(words.pop() for i in range(word_count))
except IndexError:
# This shouldn't happen, because we should have exactly
# the right number of words, but just in case...
return text

from webob import Request

class JumbleMiddleware(object):
"""Middleware that jumbles the words of HTML responses
"""
# This __init__ and __call__ are the basic pattern for middleware:
def __init__(self, app):
self.app = app
def __call__(self, environ, start_response):
req = Request(environ)
# We don't want 304 Not Modified responses, because we mix up the response
# differently every time. So we'll make sure all the headers that could call that
# (If-Modified-Since, etc) are removed with .remove_conditional_headers():
req.remove_conditional_headers()
# This calls the application with the request, and then returns a response; this
# is the typical pattern for response-modifying middleware using WebOb:
resp = req.get_response(self.app)
if resp.content_type == 'text/html':
resp.body = jumble_words(resp.body)
return resp(environ, start_response)

Well, you don’t really need to jumble up your own pages, right? Much more fun to jumble other people’s pages. Enter the proxy. Here’s a basic proxy:

from paste.proxy import Proxy
# We use this to make sure we didn't mess up anything with JumbleMiddleware;
# the validator checks for many WSGI requirements:
from wsgiref.validate import validator
import sys

def main():
proxy_url = sys.argv[1]
app = JumbleMiddleware(
Proxy(proxy_url))
app = validator(app)
from paste.httpserver import serve
serve(app, 'localhost', 8080)

if __name__ == '__main__':
main()

If you look at the full source the command-line is a bit fancier, but it’s all obvious stuff.

OK, so this will work, but the links will often be broken unless the server only gives relative links. But you can rewrite the links using lxml…

import urlparse

class LinkRewriterMiddleware(object):
"""Rewrites the response, assuming the HTML was generated as though based at
`dest_href`, and needs to be rewritten for the incoming request"""

# The normal __init__, __call__ pattern:
def __init__(self, app, dest_href):
self.app = app
if dest_href.endswith('/'):
dest_href = dest_href[:-1]
self.dest_href = dest_href

def __call__(self, environ, start_response):
req = Request(environ)
# .path_info (aka environ['PATH_INFO']) is the path of the request
# (URL rewriting doesn't really have to care about query strings)
dest_path = req.path_info
dest_href = self.dest_href + dest_path
# req.application_url is the base URL not including path_info or the query string:
req_href = req.application_url
def link_repl_func(link):
link = urlparse.urljoin(dest_href, link)
if not link.startswith(dest_href):
# Not a local link
return link
new_url = req_href + '/' + link[len(dest_href):]
return new_url
resp = req.get_response(self.app)
# This decodes any possible gzipped content:
resp.decode_content()
if (resp.status_int == 200
and resp.content_type == 'text/html'):
doc = html.fromstring(resp.body, base_url=dest_href)
doc.rewrite_links(link_repl_func)
resp.body = html.tostring(doc)
# Redirects need their redirect locations rewritten:
if resp.location:
resp.location = link_repl_func(resp.location)
return resp(environ, start_response)

Then we rewire the application:

app = JumbleMiddleware(
LinkRewriterMiddleware(Proxy(proxy_url), proxy_url))

Now there’s a fun little proxy for you to play with. You can see the code here.

Ian Bicking: a blog

July 2008

A Doctest Wishlist

2008 07 31

Making a proxy with WSGI and lxml

2008 07 30

Me In Berlin & Amsterdam

2008 07 28

Home

About

Archives

Categories

Recent Posts

Recent Comments