We can answer on question what is VPS? and what is cheap dedicated servers?

January 2008

Documents vs. Objects

Let’s imagine we live in a late binding world, a programming world of messages, what should that look like? What does it look like in the large?

The Smalltalk notion of this is many, many independent referencable objects. Objects all the way down. So you might get something like buddies do: [ buddy | buddy sendMessage: myMessage].

At runtime this means something like:

  • Send the buddies object (some reference we obtained elsewhere) the do: message, with the argument [ buddy | buddy   sendMessage: myMessage]. Except remember that that [...] is a block (aka closure), and it’s also an object. Which only makes sense, because buddies doesn’t know what myMessage means. So all buddies knows is that it gets an argument.
  • The buddies object receives this do: message, with an argument we’ll call block (that [...] stuff). By convention do: is sent to sequence-like things, and calls its argument multiple times. So buddies figures out some items — maybe it’s a concrete sequence (like an Array) or maybe it looks it up in a database, or does a call over the net, or whatever.
  • Once buddies gets an item, it calls block value: item. That is, it sends a value: message back to the block object, with the argument item.
  • The block has just received that value: message. The block wouldn’t have to be that [...] syntactic construct, it just usually is. So it receives that value, and locally binds the buddy variable to the argument and evaluates the inside of that block.
  • Our expression then sends the sendMessage: message to that buddy object, with another object as the argument. That argument might be a string, but it also might be markup, or a richer object that itself does all sorts of magical things. Maybe the buddy object sends the object over the net to your remote buddy, then the remote buddy calls message displayOn: myScreen, and the message object reponds to displayOn: by sending screen showText:   someString withColor: #blue. This can go on for a while.

As you can see, there can be a lot of tracing through code in this style. This has been called Ravioli code. Many people lean heavily on concepts like classes and methods to make sense of this kind of code, though if you look through my description you’ll notice they don’t show up; it is apparently somewhat frustrating for Alan Kay that he first used the term "object-oriented" when he would have preferred this message-oriented concept. Classes and methods are just implementation details. Messages and object references are the true fundamental concept.

Doing this in the large means that a lot of these messages go over the network. It means we can get a handle to an object that lives on another computer. It might result in a system that is very chatty (many systems with transparent RPC do end up being unintentionally chatty). But it doesn’t have to. You can make this asynchronous to facilitate these kinds of efficiencies.

That said, there is another distinct technique for programming in the large: document oriented programming. This is what REST is focused on: transmitting state in as complete a form as feasible, with explicit references whenever referring to another resource.

Resources and objects are in many ways similar, but in other ways completely opposite. An object has behavior but no visible state. It responds to messages, passing more objects along. Resources themselves have no behavior. They are just bare state. REST and the web kind of adds behavior, but strict REST avoids even that. PUT is not behavior. PUT is almost like a declaration of fact. When you PUT to a resource you are claiming that the new state of the resource represents the truth of that resource. Any behavior that occurs because of that is hidden. Other resources may change as a result of the PUT, but this is not exposed in any way. To find out what’s changed you simply have to refresh all the state you’ve acquired — re-GET the resources. (POST, except when used simply to create a resource, is more like a verb than a statement of truth; the HTTP model is not pure in this respect.)

For programming on a web scale resources have become the dominant technique. REST was really an acknowledgement of this idea, not the invention of it. It’s also an advocacy effort that our future work should be built on this past success. It’s been focused very clearly on advocating resources over the object/message-based systems of WS-*, but the WS-* design is hardly the first or only object based system of its kind; anything called "RPC" comes from that line of thinking. But now that WS-* isn’t king of the hill anymore maybe we can more be more thoughtful about how REST relates to other programming techniques.

A way to describe the difference in terms of programming language design is that with resources everything we do is call by value. Resources are values, and we pass them around in their complete form. You can pass around a reference value (a URI), but this is like calling a function with a pointer argument. Object/RPC techniques are call by reference (though at some level there are true "values", things like integers and strings, as things would simply be too chatty otherwise).

One interesting place where objects and resources can overlap is in functional programming. Thinking in terms of call by value or call by reference, in a functional programming language these are identical. Because values are not mutable a value and a copy of the value are equivalent. You can also pass an object between systems by reference, in whole, or some lazy compromise where the object is only incrementally communicated between the processes.

Resources are in a sense immutable regardless of your programming language. When you retrieve a resource you cannot modify it in place, as the result is no longer representative of what the resource really is. You can put a new resource in a named location (PUT), and you can copy resources (maybe making an internal copy). Incidentally this is why I think literal document models are best. It would probably be useful to have a native representation of what a "resource" really is in programming languages, though usually this is not done. A resource should really be a document (content type, some bytes, perhaps a cache of parsed representations) plus the name (URI) and the time. Resources aren’t eternal, but given a time and URI there is (in a REST world) a single canonical representation.

Further investigations of functional programming languages would probably be useful here. I believe strict functional languages have interesting metaphors for handling this kind of information; the eternal immutable sense of what a thing is at a point in time, and a way to reconcile the incomplete knowledge of what a resource is right now. I don’t think this is a clear pattern in Erlang, but is in Haskell.

One useful aspect of resources is the ability to spy on the process, to get a snapshot of it, to understand it in a kind of literal way. This is view source in a browser, or just poking around with curl. You can feel a certain confidence that you can get a real sense of what the truth is using these tools. In an object-based system you can only poke around with a process of 20 questions, hoping that there isn’t some hidden truth that you are unaware of. For instance, talking to a proxy that is almost exactly like the original object, but with some small replacement. Of course internally a server can be peculiarly coupled as well, with relationships between the resources that aren’t explicit or obvious. But at least the faces it projects are fully formed.

I still haven’t fully figured out what distinguishes a document model from an object model, but I think the distinction between the two contains some interesting stuff worth exploring.


Comments (12)


What PHP Deployment Gets Right

With the recent talk on the blogosphere about deployment (and for Django, and lots of other posts too), people are thinking about PHP a bit more analytically. I think people mostly get it wrong.

There are several different process models for the web:

  1. CGI, where every request creates a new process, and the process handles only one request. (You could also fork a new process each request with mostly the same result.)
  2. Worker processes, where a process handles one request at a time, but the process is long-lived and will handle another request after finishing the first.
  3. Threaded, where a process handles multiple requests, each in its own thread. Threads may or may not be reused (depending on whether you use a thread pool), and that’s kind of like the difference between CGI and worker processes. But it’s not a big difference because either way the process is long lived, including all your objects.
  4. Asynchronous, where a process handles multiple requests without threads. Typically your code is event driven: a request is an event, but so are things like "database results returned", or "file read from disk". Code takes the form of short snippets between these events.

What people don’t realize is that PHP is effectively a CGI model of execution. People don’t appreciate this because PHP is implemented with mod_php, an Apache module. There are many other modules like mod_perl (the first of these mod_language modules), mod_python, mod_ruby, etc. None of these other modules are like mod_php. This has led many a commentator astray because they don’t get this. This is because the PHP language was written for mod_php. Perl, Python, Ruby — none of them were written to be used as an Apache module. You can’t take one of these existing languages and just retrofit it to be like PHP or like mod_php.

Why is it important that PHP has a CGI like model? Mostly because it lets two groups separate their work: system administration types (including hosting companies) set up the environment, and developer types use the environment, and they don’t have to interact much. The developers are empowered, and the administrators are not bothered.

Some details:

  • PHP processes can leak memory like crazy. It doesn’t matter because they only leak memory for one request.
  • PHP processes can be easily killed if they act badly (sucking up too much memory or if they get caught in an infinite loop).
  • PHP code has no global state. It has lots of global variables, but since they only live for one request they are relatively harmless. (Not entirely harmless, but what’s a major sin in another language is only a minor sin in PHP.) The process cannot become corrupted.
  • There’s no global set of installed libraries (besides what PHP is compiled with). If you want to include a library you include files, typically relative paths. You won’t include someone else’s library accidentally. (This has changed in PHP, but the search path remains largely unused.)
  • Most of the language is implemented in C, in a shared library. In comparison Python has "batteries included", but those batteries are largely written in Python. Python code is not shareable, and can take time to load up. So while a single Python CGI script might be small, it probably imports lots of code which would have to be loaded each request. PHP scripts actually are small. (Stuff like PEAR changes this by adding substantial libraries written in PHP, but also seriously effects PHP performance.)
  • What few system-dependent libraries and C extensions that are required are compiled into PHP, and hosting companies have kind of figured out a consistent set of these libraries (stuff like database drivers, an image processing library, and XML parsing). Individual developers write only PHP (and that’s all they are typically allowed to write).

These features are all contrary to the design of these other languages, and even more contrary to the conventions in other language communities. Python programmers could write all their libraries in C and start to make CGI scripts feasible, but they don’t and they won’t.

Focusing on mod_php is a goose chase. You have to focus on the features it provides.

Here’s the features I think are important:

  • The failure cases are isolated. Processes never get wedged. One user doesn’t take out another user’s application. Even one bad page in an application won’t take out the entire application.
  • File-based deployment. There isn’t a build process for deploying a PHP application. You put the files in the right place and they are deployed. (Configuration in PHP is fuzzy, but PHP is not perfect.)
  • Minimal global dependencies. If you use libraries then you package those libraries with your application. You don’t worry about the administrator upgrading something on the system and breaking your application. (At least not much, there still things like upgrading mod_php that can break your application, or changing settings in php.ini. PHP is not perfect.)
  • It’s pretty easy to do multiple deployments for development. You just drop another copy of the files somewhere else and change your database settings.

And of course the high-level features:

  • Working applications stay working. (Even half-working applications at least stay half-working.)
  • Administrators aren’t being hassled to make little changes and fixes all the time.
  • Developers can do what they need to do without going through the administrators.

People or organizations that are hybrid developer/administrators don’t care about this so much. And that makes sense; this is all about separating those two roles. Though even developer/administrators would benefit from a better story here because it would let them concentrate more on being developers and the administration part will start taking care of itself.

Solving this is all the harder because of how it interacts with the language itself. But it’s not impossible. Process model 2 — worker processes — are feasible for most languages. You just need a really good process manager (including process setup for isolation, and process monitoring to mitigate problems). Apache alone is not this manager. mod_fastcgi could be this manager, but it’s not. Maybe mod_wsgi will become this manager for Python.


Comments (24)