Since I'm doing some HTTP proxying I've been thinking about how it should work. And I really don't know -- it's all very vague. Much more vague than it should be. HTTP won't be a pipe if we have to rely on ad hoc configuration everywhere.
Some of the issues...
Some information gets covered up. For instance, in many HTTP proxying situations the Host header is lost; changed into localhost or something that doesn't accurately represent the initial request. I guess there's an informal standard that X-Forwarded-Server contains the original Host value.
Also the remote IP address is lost, but there's a more widely used convention to put that information in X-Forwarded-For. This seems much less interesting than Host to me, but much more widely supported. Go figure.
I don't know of any standards that exist for remapping request paths. I guess in theory you shouldn't need to do this, but in practice it is pretty common. For instance, lets say you want to map /blog/* to localhost:9999 where your blog app is running in a separate process. Do you preserve the full path? Do you duplicate any configuration for path mappings or virtual host settings in that server on port 9999? All too often the target server isn't very cooperative about this, and various hacks emerge to work around the problems. Ideally I think it would be good to give three pieces of information for the path:
2 and 3 are similar to SCRIPT_NAME and PATH_INFO. 1 is something new. HTTP always leaves these all squished together.
Also: potentially the request path has nothing to do with the original path or domain. This can happen when you are aggregating pieces of data from many sources (e.g., using SSIs, which get pieces of content as subrequests, or HInclude which composes content in the client). If the output uses HTML then that HTML needs to be written either with no assumptions about what URL it is rendered under (i.e., all links are fully qualified), or it needs to be smart about the real context it will be rendered in. How to write relocatable HTML is a separate issue, but there also doesn't seem to be any conventions about how to tell the web app about the indirection that is happening.
There's a lot more information that can be passed through. For instance, the upstream server may have authenticated the request already. How does it pass that information through? Maybe there's other ad hoc information. For instance, consider this rewrite rule:
RewriteCond %{HTTP:Host} ^(.*)\.myblogs.com$ RewriteRule (.*) http://localhost:9999/blogapp/$1?username=%1 [P,QSA]
If you aren't familiar with mod_rewrite, this tells Apache to take a request like http://bob.myblogs.com/archive?month=1 and forward it to http://localhost:999/blogapp/archive?month=1&username=bob
This kind of works, but is clearly hacky.
Ideally we would pass a header like X-Blog-Username: bob. But that opens up other issues...
We can start adding headers willy-nilly, and that's actually okay, but opens up security concerns. If you aren't certain that only trusted clients can access your backend server, can you really trust the headers? It's no good if anyone can connect to your server with X-Remote-User: admin and then you trust that information. With no concept of trusted and untrusted headers, we have to rely on ad hoc configuration for security. This is both difficult to setup and maintain, and doing it wrong can lead to a very insecure setup.
The previous issues can all be resolved with conventions about new HTTP headers. This one is much harder.
I'd like to use HTTP like a pipe. Really! None of the issues I've brought up are new, but they also aren't well answered despite their age. In comparison, FastCGI and SCGI actually answer most of these problems right now.
If we're going to use HTTP this way -- and there's great reasons to start doing this -- we need to work harder at coming up with a good answer for these kinds of issues.
I can only see two viable options to trusting data in headers/urls, share a secret and sign the header with that (a simple example would be to just to set the authorization header for the backend, but that requires that the backend have some sort of authentication system in place to check the data) or only allow requests to from trusted servers/ports.
The security problem doesn't seem to have much to do with HTTP headers, really. If your back-end server is open to the network, the entire content of each request is suspect, right?
You need authentication; and you need to secure the channel from snooping and tampering. Each of these three facets of the problem has to be enforced in some protocol layer.
Take authentication. You need to: (a) stack HTTP on top of a network-level protocol that authenticates; OR (b) use some hypothetical flavor of HTTP authentication that's actually secure; OR (c) stack another authentication-capable protocol on top of HTTP--you know, SOAP with digital signatures or whatever they're doing these days.
- doesn't exist, and everyone knows (c) is immoral, so take (a). This is the "don't do that" solution. Just do exactly what you said you weren't going to do, and firewall off the back-end server. There are several ways to do it--SSL; SSH (?); checking getpeername(); moving the back-end server to another machine behind a physical firewall; VMware... some of which have performance problems, and some of which are only secure given certain prerequisites. You can pick one way that will work everywhere; or you can have your system choose the best way dynamically; or you can push the burden of configuring all the pieces to talk together onto the user.
Or take tampering. You need to: (a) stack HTTP on top of a protocol that provides a no-tampering guarantee; or (b) use some hypothetical HTTP feature that prevents tampering, like adding an HTTP header with a digitally signed hash of the message; or (c) stack another protocol on top of HTTP. Again with the SOAP.
Etc. The world of (a) contains multitudes, with no means of interoperability; the standards for (b) simply doesn't have the features you need; and the stuff at level (c) is despised by all right-thinking persons.
-j