Centralized vs. Decentralized 2

After thinking about the comments to my last post I'm starting to see some of the technical advantages to distributed version control (even though I'm not any more enamored to the development process it so seldom presumes).

Really the thing that keeps me from just opening up large swathes of a Subversion repository to anonymous access is the security concern. I just don't trust Subversion in that way, and I doubt the Subversion developers trust it that way either. After all, it's written in C, one of the Least Secure Languages Ever. (PHP is giving it a run for its money with its own take on How To Be Insecure, but C has a much deeper and richer legacy of insecurity.)

But a lot of the problems are hard to really imagine fixing in Subversion. What if someone uploads 10Gig of asdfasdfasdf into the repository? Sure, you can delete anything, but stuff still Lives Forever in the history. Or less maliciously, someone is sure to start uploading core dumps, or giant PDFs, or something. So even though Subversion is much less prone to mistakes than CVS, because operations can generally be "undone" there's still cruft left behind. Not enough to bother me now, but enough that I suspect I'd be bothered if I give access to the public. (I still plan to give access to more people once I get the permission thing figured out, just not self-signup.)

Also, because lots of the logic lives on the server with a centralized tool like Subversion, there's a lot more to worry about in terms of remote exploits. If most of the logic is in the client then they can only exploit their own machine. Though on reflection this might be worse, since it could mean checking out a repository could itself be a security risk. Well... let's just hope we're working with environments where security is valued and attainable.

Another issue is backend management. One of Subversion's benefits and drawbacks over CVS is that you couldn't "maintain" the repository, meaning you couldn't go in and fiddle with files on the server. This means you can't break the repository, but also that you can't fix it (like when someone uploads those core dumps, or completely eliminating defunct branches). Distributed systems leave room to meaningfully modify the "repository" using file commands, where the "repository" is really a whole set of repositories, which together form something equivalent to the more inclusive repository that Subversion expects.

So maybe a distributed version control system would be a good basis for an open centralized repository, where that open repository is primarily a file share. A usable system would actually handle the file sharing internally, since relying on scp, rsync, or an OS-level webdav client implementation is too error prone at this time.

From this perspective centralizing the files is still very important to me. In the model I would prefer there is a privileged (and presumably somewhat trusted) limited set of branches (like trunk or HEAD, tagged releases, stable branches). And then there are all the other branches, and anyone can edit any of them. That means that the default is open, which I feel strongly is the right default. The current default of distributed systems is private and with author editing only. This feels like an unnecessary restriction imposed mostly because they are avoiding the technical issues of sharing. I think the way these systems rely on email, rsync, ssh, etc., is simply avoidance, an unwillingness to address the whole experience. But that can certainly be resolved.

Created 09 Aug '05

Comments:

"""Sure, you can delete anything, but stuff still Lives Forever in the history."""

It is possible to make the current (and past) states of the repository whatever you like by doing a dump/filter/reload.

# Benji York

I'll be really original, and suggest that what this really means is that you need to check out _my_ favorite system ;-)

In this case, Monotone. The interesting point relative to your post is that monotone, from some points of view, is quite close to what you describe. Most of the distributed VCSes have branches thare are not just distributed, but scattered -- each meaningful branch lives in a particular location, these meaningful locations can be on different hosts, you have to know which host each branch lives on and be able to connect to it, etc.

In monotone, OTOH, no repository is more equal than any other one; the basic network operation is just "send them the facts I know but they don't, get fetch the facts that they know that I don't" -- so one way to think of it is as a centralized VCS where that center is made a little more diffuse. You can still commit while on an airplane, and it becomes completely trivial to do things like set up load-balancing hot-backup repos, if your server goes down anyone can step in for it, etc., but in the usual case everything is in the same place. Think of it less "no-one knows what's going on", more "no central point of failure, many more 9s of robustness than achievable by any other method".

New developers generally cannot push to a project's main server, because of exactly the sort of issues you mention -- we'd really rather not become the next big warez distribution network -- but they can push to their own server, and if a developer pulls from that server the changes will end up in the central server the next time they sync. Apparently this sort of thing is called a "gossip network", and has some rather nice properties this way. Of course, this is a really high-level hand-wavy description...

(It also, uh, contrary to comments on the last post, has never used OpenSSL's libraries, and has supported tag and branch based checkout for eons. No local tags, though, hadn't encountered that idea before. Interesting idea.)

I think you were missing the point just a bit in the last post; the advantages of DVCSes aren't so much "oo, distributedness!" as lots and lots of things like this -- increased robustness, better merging, workflow improvements (no update-before-commit requirement, for instance; frees one from the tyranny of the cowboy who keeps breaking mainline), cooperation across organizational and trust boundaries, that sort of thing. Distribution is one nice feature, but hardly defines a VCS by itself; the particular issues you listed are real ones, but not necessarily the relevant ones.

DVCSes _do_ also make it possible and even easy to fork. There are people who want to make it hard to fork. (I don't know if you're one of them.) I can understand where they're coming from, but I do outright disagree with the position; forking is an important freedom, and I honestly think it's wrong for tools to try and enforce otherwise. OTOH, this certainly doesn't mean they should make it easy to accidentally drift out of sync and neglect to inform other contributors of work ongoing...

# Nathaniel Smith

"It also, uh, contrary to comments on the last post, has never used OpenSSL's libraries, and has supported tag and branch based checkout for eons"

My bad. Sorry. :(

# Chad Walstrom

Don't forget that distributed systems make it easier to recover from a fork. It's just a merge, after all.

I don't think they have any effect on how easy it is to fork in the first place; tools like cvsps and tailor make it trivial to pull history out of one repository and into another.

# Bryan O'Sullivan

What's wrong with SSH? If your goal is to support authenticated write access, using SSH makes a lot more sense than trying to hack together your own cryptographically sound transport mechanism. You should also support anonymous read access, of course.

# Aaron Bentley

Working with ssh via TortoiseSVN on Windows is not endearing to ssh. Of course in this case it's my own fault for not having an https access method to the repository... which is to say, it's my fault for not using the superior server support Subversion provides via HTTP. Anyway, ssh doesn't make for a very good, complete experience. SSH servers and clients are not as easy to abstract or build upon as with other protocols, so actually building ssh support into the server and client is rather hard, and you end up with hacky (IMHO) command-line solutions.

# Ian Bicking

I think you're talking more about bad ssh use than problems with ssh, itself. If you're building your own protocol, layering it on top of SSH takes care of security. If you just want access to files, sftp comes with SSH and is capable-- much nicer than http for filesystem access.

# Aaron Bentley

SSH requires OS level user accounts. That's just not possible in many situations.

# Stephen

Not true. Nothing in the SSH protocol requires OS-level accounts.

Admittedly implementations of it that don't require OS-level accounts are rare, but they exist. For instance, Canonical provides SFTP space for Bazaar archives (and soon Bazaar-NG branches) for anyone with a Launchpad (https://launchpad.net/) account. These are then mirrored to the world via HTTP.

We implement this SFTP server using Twisted's Conch SSH library, and I can assure you we don't create OS-level accounts for every Launchpad user :) ... it was actually surprisingly easy to use Conch for this.

# Andrew

And I've posted it at http://www.serpentine.com/blog/software/distributed-vs-centralised.html to save on re-pasting it here.

# Bryan O'Sullivan

Ian Bicking: the old part of his blog

Centralized vs. Decentralized 2

Comments: