Why centralize when you can decentralize?

Submitted by Eric Pierce on Sun, 10/03/2010 - 01:34

Like most experienced software developers, I've accumulated quite a few years of experience with revision control systems. We use revision control for many reasons--to collaborate, to track changes, and to have a contingency plan in case something breaks--but maybe the biggest reason is simply to keep ourselves from going insane from the frenzied cat-herding that is software engineering.

My first experience with a revision control system was, well, Revision Control System (RCS), a rather contrived introduction during my third or fourth year of the Computer Science curriculum at my university. RCS is extremely limited, working on single files, and having no support for projects with multiple files. Aside from a couple of class assignments on RCS, I didn't do much else with it; it seemed like a cool idea, but in retrospect I can clearly see why it was superseded by more capable systems.

Next came CVS, naturally. I used CVS for several years for my open source projects (mostly because that was the only thing SourceForge supported at the time). I remember it as a fairly sturdy system; its client-server approach was easy enough to understand, and although it occasionally spat out some confusing error messages, it worked pretty well as long as you didn't demand too much of it. The projects I worked on typically involved only one or two developers, so I didn't run into problems with merge conflicts very often.

After I'd been using CVS for a few years, Subversion started to gain some popularity. It was hyped as "a better CVS", which sounded good. Its architecture (at least from a user's perspective) was pretty much the same as CVS, but overall things seemed to work better, and the errors were less confusing. It was exactly as promised: a better CVS. It did not aim, nor succeed, at being anything more than that. I used Subversion quite happily for many years; again, I was mostly dealing with small projects, and repositories with only a couple of active committers.

Eventually its limitations started becoming apparent. What if I wanted to do revision control of just a couple files on my local disk? Maybe I don't feel like signing up for a public svn host until my fledgling project starts showing real promise. I don't want to have to think about the whole client-server architecture until there's actually a need for a server, and setting up a Subversion server looked (at a casual glance, at least) to be an exercise in overkill just to revision-control one or two files.

I also found that I wanted the ability to checkout from a repository that was hosted on a network disk drive. My team needed to share code, and setting up a whole HTTP server just to collaborate on some code again seemed like overkill. If Subversion supports this, it's still not obvious to me how to do it.

Then, what if I want to have a shared repository, but I want the ability to work locally--make several smaller commits to my local copy instead of committing directly to the main trunk where everyone else will see it right away? I like to make lots of small commits, since it helps me keep track of changes at a fine-grained level, but with each commit there's a chance something will break. I don't want to do full unit and regression testing for each commit--I'd rather wait until I'm done with all my incremental steps, then flush out any problems with a few more patching commits (again, all local) before sharing it with everyone else.

I began to seek out alternatives to Subversion. In fact, at the organization where I worked (which shall remain nameless), they didn't even use Subversion--all the developers were using some proprietary thing called ClearCase. I thought, "OK, let's see if it can do what I want." Boy, let me tell you, I had not yet begun to experience the concept of "overkill". It took hours just to install the monstrously-big client software. It took many hours more just to import our medium-sized project. Then the real trouble started. Whether through design or through policy, all code in our group's ClearCase repositories was managed using an exclusive, locked checkout. If someone wanted to edit a file, they were required to check it out. While that person had the file checked out, nobody else was allowed to edit it. If you edited a file that you didn't have checked out, you were considered to have "hijacked" the file, and had to go through a bunch of extra steps to merge your changes into whatever changes the other person made. If someone forgot or went on vacation for a week with files checked out, it disrupted everyone else's work. This has to be the most idiotic approach to collaborative coding I've ever encountered. I only used ClearCase for a couple days--enough to realize that it was the complete antithesis of what I was looking for.

This is when I started to read about distributed revision control systems. I tried a couple of them before becoming enamored with Bazaar. It was dead simple to set up a repository with only a few files, I could work on small projects without even thinking about servers, and with minimal effort I could set up a shared repository on a network drive that my team could access, again without even thinking about HTTP, Apache, or anything server-related. There was no "server"; we had no need of a "server". We just wanted to work together on some files that we both had direct access to, without stepping on each others' toes with concurrent editing.

After an initial period of mental adjustment, this worked out pretty well. We ended up using a centralized model with checkout/checkin, since that's what my team was most familiar with, though individual team members like me could always create a local branch to do isolated commits if desired. We also found that Bazaar did very well for revision-control of single files--for example, we had these massive, complicated configuration files on each of our development servers, and keeping track of changes within them was a nightmare without revision control. A quick bzr init and bzr add was sufficient to put a corral around that particular herd of cats; everyone could continue using the file as they always had, and nobody even had to care that it was revision-controlled.

I've continued to use Bazaar for projects both big and small. Most of the time, I'm the only committer or one of only a few committers, so in that respect it's pretty much the same as my CVS and Subversion experience in years past. But even so, I've found the distributed approach useful--I have occasionally created separate branches to work on a new feature or fix a particular bug without adverse damage to whatever I'm focusing on in the main branch. Merging has nearly always been painless. I never did branching or merging with Subversion, mostly because I saw no need for it most of the time, but also because I've heard some horror stories about how painful it is to merge in Subversion. I guess with centralized revision control, branching is one of those seemingly-cheap operations that you pay for later.

During the last year or so, I've become quite familiar with Git, as well as one of the most awesome things I've seen in all my years of software development, the repository-of-repositories, GitHub. In fact, Git would not interest me much were it not for GitHub. Now, the "server" part is also dead simple--a couple clicks, and you have a public repository. Another click, and you can fork it. Branching is truly cheap, and unlike with Subversion, merging is also cheap. Git is fast. Blink and you're done.

Git's terminology differs in a few places. For example, in Git, a "branch" is not a full copy of the repository like it is in Subversion or Bazaar. That's a "clone" in Git, whereas a "branch" is effectively an isolated set of changes, such as you might make when implementing a certain feature, or fixing a bug. Similarly, a "checkout" isn't what you might think; when I check something out, it's basically just a context-switch. "checkout X" just means "I want to work on branch X for a few minutes", then "checkout Y" means "Let me go back to working on branch Y, without my changes to X getting in the way." Each branch has its own chain of commits; as long as all branches came from a common ancestor, then merging is virtually painless.

Overall, this strikes me as the right way to do revision control. Context-switching is something that software developers must do very frequently, and Git makes it easy to do. With a centralized system like CVS or Subversion, you may be in the middle of working on feature X, with a bunch of uncommitted and untested changes in your checkout, when you suddenly get interrupted to fix bug Y. What do you do? Hastily commit your untested work on X? Go ahead and fix bug Y in the same batch of commits, even though it might overlap what you're doing for X? Get a brand-new checkout from trunk and fix Y there, and worry about merging it into X later? Get someone else to fix Y? There's no good answer. Centralized revision control breaks down in an environment with frequent context-switching. It only works if everyone is climbing the same trunk, following some mythical plan laid out in advance, with clear segregation of duties and minimal interruptions.

Branching in distributed systems like Bazaar and Git feels organic; it's a natural growth that is still firmly joined to the tree. Branching in centralized systems like CVS and Subversion feels like splitting the tree with an axe. Distributed revision control is a much better fit to the way that most of us develop software in the real world.

Like RCS, Subversion and the other centralized revision control systems are relics of a bygone era. They are holding us back, and I think it's about time we laid them to rest.

References:

Anyone who is still using Subversion, especially those who have never seriously tried using a DVCS, should read Subversion Re-education. It's oriented towards Mercurial, but most of its arguments apply equally well to the other DVCSes.

See also Why Mercurial? For a Bazaar perspective, try Why Switch to Bazaar? For Git, see Why Git is Better than X. For a tongue-in-cheek counter-perspective, see Why Subversion is Better than Git.

Eric Pierce's blog

Why centralize when you can decentralize?

Search

Technology

Topic