Intro to Distributed Version Control (Illustrated)

Traditional version control helps you backup, track and synchronize files. Distributed version control makes it easy to share changes. Done right, you can get the best of both worlds: simple merging and centralized releases.

Distributed? What’s wrong with regular version control?

Nothing — read a visual guide to version control if you want a quick refresher. Sure, some people will deride you for using an “ancient” system. But you’re still OK in my book: using any VCS is a positive step forward for a project.

Centralized VCS emerged from the 1970s, when programmers had thin clients and admired “big iron” mainframes (how can you not like a machine with a then-gluttonous 8 bits to a byte?).

Centralized is simple, and what you’d first invent: a single place everyone can check in and check out. It’s like a library where you get to scribble in the books.

This model works for backup, undo and synchronization but isn’t great for merging and branching changes people make. As projects grow, you want to split features into chunks, developing and testing in isolation and slowly merging changes into the main line. In reality, branching is cumbersome, so new features may come as a giant checkin, making changes difficult to manage and untangle if they go awry.

Sure, merging is always “possible” in a centralized system, but it’s not easy: you often need to track the merge yourself to avoid making the same change twice. Distributed systems make branching and merging painless because they rely on it.

A Few Diagrams, Please

Other tutorials have plenty of nitty-gritty text commands; here’s a visual look. To refresh, developers use a central repo in a typical VCS:

Everyone syncs and checks into the main trunk: Sue adds soup, Joe adds juice, and Eve adds eggs.

Sue’s change must go into main before it can be seen by others. Yes, theoretically Sue could make a new branch for other people to try out her changes, but this is a pain in a regular VCS.

Distributed Version Control Systems (DVCS)

In a distributed model, every developer has their own repo. Sue’s changes live in her local repo, which she can share with Joe or Eve:

But will it be a circus with no ringleader? Nope. If desired, everyone can push changes into a common repo, suspiciously like the centralized model above. This franken-repo contains the changes of Sue, Joe and Eve.

I wish distributed version control had a different name, such as “independent”, “federated” or “peer-to-peer.” The term “distributed” evokes thoughts of distributed computing, where work is split among a grid of machines (like searching for signals with SETI@home or doing protein folding).

A DVCS is not like Seti@home: each node is completely independent and sharing is optional (in Seti you must phone back your results).

Key Concepts In 5 Minutes

Here’s the basics; there’s a book on patch theory if you’re interested.

Core Concepts

  • Centralized version control focuses on synchronizing, tracking, and backing up files.
  • Distributed version control focuses on sharing changes; every change has a guid or unique id.
  • Recording/Downloading and applying a change are separate steps (in a centralized system, they happen together).
  • Distributed systems have no forced structure. You can create “centrally administered” locations or keep everyone as peers.

New Terminology

  • push: send a change to another repository (may require permission)
  • pull: grab a change from a repository

Key Advantages

  • Everyone has a local sandbox. You can make changes and roll back, all on your local machine. No more giant checkins; your incremental history is in your repo.
  • It works offline. You only need to be online to share changes. Otherwise, you can happily stay on your local machine, checking in and undoing, no matter if the “server” is down or you’re on an airplane.
  • It’s fast. Diffs, commits and reverts are all done locally. There’s no shaky network or server to ask for old revisions from a year ago.
  • It handles changes well. Distributed version control systems were built around sharing changes. Every change has a guid which makes it easy to track.
  • Branching and merging is easy. Because every developer “has their own branch”, every shared change is like reverse integration. But the guids make it easy to automatically combine changes and avoid duplicates.
  • Less management. Distributed VCSes are easy to get running; there’s no “always-running” server software to install. Also, DVCSes may not require you to “add” new users; you just pick what URLs to pull from. This can avoid political headaches in large projects.

Key Disadvantages

  • You still need a backup. Some claim your “backup” is the other machines that have your changes. I don’t buy it — what if they didn’t accept them all? What if they’re offline and you have new changes? With a DVCS, you still want a machine to push changes to “just in case”. (In Subversion, you usually dedicate a machine to store the main repo; do the same for a DVCS).
  • There’s not really a “latest version”. If there’s no central location, you don’t immediately know whether to see Sue, Joe or Eve for the latest version. Again, a central location helps clarify what the latest “stable” release is.
  • There aren’t really revision numbers. Every repo has its own revision numbers depending on the changes. Instead, people refer to change numbers: Pardon me, do you have change fa33e7b? (Remember, the id is an ugly guid). Thankfully, you can tag releases with meaningful names.

Mercurial Quickstart

Mercurial is a fast, simple DVCS. The nickname is hg, like the element Mercury.

cd project
hg init                                (create repo here)
hg add list.txt                        (start tracking file)
hg commit -m "Added file"              (check file into local repo)
hg log                                 (see history; notice guid)

changeset:   0:55bbcb7a4c24
user:        Kalid@kazad-laptop
date:        Sun Oct 14 21:36:18 2007 -0400
summary:     Added file

[edit file]
hg revert list.txt                 (revert to previous version)

hg tag v1.0                        (tag this version)
[edit file]
hg update -C v1.0                  ("update" to the older tagged version; -C forces overwrite of local copy)

Once Mercurial has initialized a directory, it looks like this:

You have:

  • A working copy. The files you are currently editing.
  • A repository. A directory (.hg in Mercurial) containing all patches and metadata (comments, guids, dates, etc.). There’s no central server so the data stays with you.

In our distributed example, Sue, Joe and Eve have their own repos, with independent revision histories.

Understanding Updates and Merging

There’s a few items that confused me when learning about DVCS. First, updates happen in several steps:

  • Getting the change into a repo (pushing or pulling)
  • Applying the change to the files (update or merge)
  • Saving the new version (commit)

Second, depending on the change, you can update or merge:

  • Updates happen when there’s no ambiguity. For example, I pull changes to a file that only you’ve been editing. The file just jumps to the latest revision, since there’s no overlapping changes.
  • Merges are needed when we have conflicting changes. If we both edit a file, we end up with two “branches” (i.e. alternate universes). One world has my changes, the other world has yours. In this case we (probably) want to merge the changes together into a single universe.

I’m still wrapping my head around how easily branches spring up and collapse in a DVCS:

In this case, a merge is needed because (+Soup) and (+Juice) are changes to a common parent: the list with just “Milk”. After Joe merges the files, Sue can do a regular “pull and update” to get the combined file from Joe. She doesn’t have to merge again on her own.

In Mercurial you can run:

hg incoming ../another-dir  (see pending changes)
hg pull ../another-dir      (download changes)

hg update                   (actually apply changes...)
hg merge                    (... or merge if needed)

hg commit                   (check in merged file; unite branches)

Yep, the “pull-merge-commit” cycle is long. Luckily, Mercurial has shortcuts to combine commands into a single one. Though it seems complex, it’s much easier than handling merges manually in Subversion.

Most merges are automatic. When conflicts come up, they are typically resolved quickly. Mercurial keeps track of the parent/child relationship for every change (our merged list has two parents), as well as the “heads” or latest changes in each branch. Before the merge we have two heads; afterwards, one.

Organizing a Distributed Project

Here’s one way to organize a distributed project:

Sue, Joe and Eve check changes into a common branch. They can trade patches with each other to do simple “buddy builds”: Hey buddy, can you try out these patches? I need to see if it works before I push to the experimental branch.

Later, a maintainer can review and pull changes from the experimental branch into a stable branch, which has the latest release. A distributed VCS helps isolate changes but still provide the “single source” of a centralized system. There are many models of development, from “pull only” (where maintainers decide what to take, and is used when developing Linux) to “shared push” (which acts like a centralized system). A distributed VCS gives you flexibility in how a project is maintained.

Practice And Scathing Ridicule Makes Perfect

I’m a DVCS newbie, but am happy with what I’ve learned so far. I enjoy SVN, but it’s “fun” seeing how easy a merge can be. My suggestion is to start with Subversion, get a grasp for team collaboration, then experiment with a distributed model. With the proper layout a DVCS can do anything a centralized system can, with the added benefit of easy merging.

Online Resources

Notable Quotes:

  • “How many have done a branch and merged it? How many of you enjoyed it?”
  • “When you do a merge, you plan ahead for a week, then set aside a day to do it.”
  • “Some people have 5, 10, 15 branches”. One branch is experimental. One branch is maintenance, etc.
  • “CVS — you don’t commit. You make changes without committing. You never commit until it passes a giant test suite. People make 1-liner changes, knowing it can’t possibly break.”

So good luck, and watch out for the holy wars. Feel free to share any tips or suggestions below.

Other Posts In This Series

  1. A Visual Guide to Version Control
  2. Intro to Distributed Version Control (Illustrated)
  3. Aha! Moments When Learning Git
Kalid Azad loves those Aha! moments when an idea finally clicks. BetterExplained is dedicated to learning with intuition, not blind memorization, and is honored to serve 250k readers each month.

If you liked this article, try the site guide, the ebook, or join the free newsletter.

68 Comments

  1. Your git link goes to darcs.net instead of git.or.cz

    In any case, that was a good article. I’ll give it to my friend who I have thus far been unable to articulate the benefits of dvcs and git to.

  2. Thakns for the guide. I’ve been putting off developing something very similar to this for a couple of months and now I don’t need to.

    What did you use for your diagrams? They look cool :)

  3. But isnt this in clearcase ucm / multisite for years? each developer has his own developer streams on which he can work independantly and there is the integration stream. each project has an integration stream and all streams can deliver or mergecopy to each other. on each stream there can be multiple views.

  4. Hi Cogmios, thanks for the info. I’ve never used Clearcase, but from Wikipedia it appears as both client/server and distributed, which is pretty interesting (along with its view model). Nope, these ideas are not brand new — just wanted to explain them :)

  5. Clearcase UCM is not a distributed version control system, and neither is Clearcase multi-site.

    Clearcase UCM is hand holding of branch creation and merging of changes between streams, ticket tracking integration etc. It’s the Higher-Order Perl of Clearcase. Underneath it’s just a centralised VCS with branches.

    Clearcase multi-site is form of VCS replication. Multiple sites can be defined and a replication schedule defined. Each branch is mastered in one site, i.e. writable in one place, and a read-only copy to replicated to the other sites. Mastering can be changed. And one can merge from a read only (remote) branch to local branch with no problems. The merge tracking is good, if slow, but does have some bugs that can surface if you have a slow network.

    So ClearCase multi-site is kind of like a distributed VCS, however only the centralised ‘sites’ are the nodes of the web, unlike the contemporary distributed VCS’s in which every machine or even working copy is a node in the web.

  6. Hi David, thanks for the details! From your description, it seems like ClearCase is somewhere between a regular CVS and DVCS, by allowing to merge from multiple sites (even if not every working copy is independent). Appreciate the info.

  7. Awesome. Thanks. You could get a job as a professional communicator.

    Explaining the concepts with those nice visual graphics is great. Pretty hard to explain the details of configuring Eclipse that way, though.

  8. Thanks for the comment — maybe I’ll try getting into that eventually :). Yep, I agree you’ll always need the manual for certain things (but nearly anything can be more clear with a simple diagram!).

  9. Hi Kalid. Thank you for this wonderful write-up. You recommended to start with Subversion, is that because it’s simpler? What do you think of tools like Trac? I just need a good tool for my website projects. Thank you again!

  10. Hi Ildar, thanks for the comment. Yes, I’ve found regular VCS a good way to start. Subversion is pretty popular and works with many tools, like Trac. You can get subversion working first, then use Trac if you need project management. Personally I don’t use Trac for my projects.

    Here’s an article on getting Subversion up and running with TortoiseSVN, a nice GUI client: http://lifehacker.com/software/subversion/hack-attack-using-subversion-with-tortoisesvn-192367.php

    It’s nice to look at a directory and see what’s up to date and what isn’t.

  11. Thanks, Kalid! TortoiseSVN looks very interesting, especially for people who don’t want to use the command line.

  12. Hi Kalid,

    Very well written article. Cleared out the pros and cons about DVCS for me!

    Maybe you can help me on this one. Im currently looking for a version control tool for our devision. We support Oracle software, mostly custom code.

    Sometimes code needs to be changed in case to fix errors or for further development. Therefore it needs version control. We are used to work with no more than one person on the same object on the same time because merging afterwards is very complex.

    So we developed a tool where we can claim objects. Once an object is claimed, an other person isnt allowed to develop that specific object.

    Now we need a more sophisticated tool, like subversion or mercurial. But, what would be your advise? In a centralized system we are still able to claim a source(by locking) but a distributed cant do that, can they.

    My question is, wich one should I pick? I really like DVCS(particularly HG with Tortoise) but sources cant be claimed. I ask myself, how bad is that actually?

    Hope you can give some advice on this. I dont expect a plug and play answer offcourse!

  13. Hi Johan, glad you enjoyed the article! Great question — it’s hard to say, but if just starting out I’d begin with Subversion.

    Centralized systems are easier to understand and get going with, and the locking feature is pretty nice.

    I believe there are ways to export Subversion to a distributed system later if you find you need to do that (article here: http://ww2.samhart.com/node/49).

    Hope this helps!

  14. Nice visual representation and I learned a lot. But I’m still not seeing how merges are easier than Subversion. Can you compare/contrast a little more, maybe with an example of how Svn is different (for the merge process, not the tracking).

  15. Hi taleswapper, great question. The primary difference is that DVCS keeps track of each change individually, so it can tell what changes need to be applied and what have already been applied.

    With SVN, you need to manually track whether a change has been applied. So when merging branches, you aren’t quite sure what revisions to start and end from, and whether you’ve already done it.

    With a DVCS, you can just “pull” from a repository and be confident you are getting just the changes you need.

    The actual merge command is pretty straightforward in SVN (svn merge -r5:6 http://path/to/branch), but again, you need to know what revisions to start and end from, and whether it’s been applied.

  16. A very good read… really gives a _visual_ of the DVCS. You also cleared a few doubts that I had in mind regarding the DVCS implementation in a Enterprise Dev Arena. I have a few queries;

    * how do you make available the most latest, stable code.

    * Doesn’t having *private* dev copy with each developer, lead to a scenario wherein huge merges may be required due to the divergence.

    * Again making private changes, I think, is to some extent, depriving a community from the changes available.

    Thanks again for the post.

  17. Hi Kishore, those are great questions. I don’t have any experience managing DVCS projects, so I’m making a guess based on what I’ve read.

    1) Latest code. One drawback with a distributed model is that there’s no central location. One alternative is to have a maintainer who pulls code from various contributors on some schedule.

    2) Giant merges. Although giant merges may happen, they are easier than merging in centralized systems because changes are tracked better. But you’re right, unless there is some syncrhronization the code could diverge.

    3) Yes, it’s hard for the community to see the changes since everyone has their own copy. Having a mainter put together a “central” release may help this too (everything else considered experimental).

    Again, these are just guesses based on the descriptions I’ve seen. Some projects use a hybrid centralized/distributed model.

  18. Hi Kishore,
    Though I do not manage any large dVCS repository, I am quite actively following related topics on the Emacs development lists.
    Attempting to answer your questions:

    1> There is ONE official repository from where the releases are made. It is up to the individual developers to convince the owner/maintainer to pull in their changes. There are sites that list developer repositories with a brief description. Someone wanting to try code not still in the official repository can pull changes from the developer repositories if they are published (mercurial has a stable and crew repositories)

    2> Giant merges do not often happen (well it did happen twice recently on the Emacs branch – multi-tty and unicode). But, it would still suffer from the same fate as in a centralized VCS. If some developer had merged some of the changes, those merges are remembered and you do not have to solve them again in dVCS.

    3> Each developer can publish his/her official latest repository in the published list. If they are not disciplined in keeping it up to date, the same can happen in centralized VCS if they just edit and never commit.

    Overall, I would say you can do anything in dVCS that you can do in centralized VCS. dVCS is a super set.

  19. Hmmm…I wonder if a polling system could work instead of a “maintainer” to make decisions for what get placed in the main trunk? Say have a control panel that lets you set up each user with a weighted vote (maybe you don’t want the new untested guy to have the same weight as the rest of the team who has been battle tested), and for setting the trigger for acceptance of new code into the main. Hmmmm…

  20. Hi Scott, that’s an interesting idea. You’re right, currently DVCS require a maintainer to pull, but there may be cool ways to automate the process.

  21. Just delete this bit, it wrong.

    “Centralized VCS emerged from the 1970s, when programmers had thin clients and admired “big iron” mainframes (how can you not like a machine with a then-gluttonous 8 bits to a byte?).”

  22. I don’t see how this can possibly work. Yes, I know it works for the Linux kernel and some other projects — I just do not see how. With a central repository, all of the developers are basing their work on the same code, and everyone’s changes are incorporated, so that everyone has everyone else’s code. You keep things separate by branching and merging. It’s all straightforward. I don’t know why Torvalds has such hate for branching and merging — it’s really no big deal, in a central VCS.

    But in this “distributed” system… if there are ten developers, every developer needs to individually merge the changes from the other ten? So instead of 10 merges, there are 100 merges, all being merged into different branches which may or may not be based on the same trunk code? To me, this looks like a nightmare.

    Again: yes, I know, it works for some projects. I just don’t see how.

  23. … unless, of course, you have one person who acts as the “maintainer”, whose repository becomes the de facto central repository. In which case, what is the point of all this extra added complexity?

    I guess I just don’t get it.

  24. Hi Scartis, great question. Regarding organization, you get a few benefits:

    * Choice of decentralized (default) or ‘centralized’ system (with a maintainer who pulls changes)
    * Very simple change management. Although branching/merging is “possible” in subversion, it’s difficult in practice because the system doesn’t keep track of the changes. So you don’t know if you’re applying the same patch twice.
    * Speed. Because it’s decentralized, you can make checkins/checkouts locally, which is very fast (no need to go over the network). In fact, you only need the network to sync so it works great in disconnected scenarios.

    Personally, I’m happy with subversion for my projects (any VCS is better than nothing) but DVCS is a step above.

  25. Hi Kalid,

    Thank you for such an impressive and informative article.
    I have gone through your other article as well; that is also equally impressive.

    Your articles have helped me in understanding basics SCM as well as DSCM.

    Thanks once again.

  26. If anyone is interested in a DVCS system check out Plastic SCM. We use a replication system that uses branches to synchronize data between individual servers which means users can always check in to a local server (individual or shared at an office) and sync up with other servers and even work concurrently in the same branches.

    Want to know how we handle conflicts and other scenario’s check us out here:

    http://www.codicesoftware.com/xpproducts/xpcore/xpdistributed.aspx

    We can even share changes through email. Imagine two users working in total isolation from one another with the exception of emailing changes back and forth. We have a blog post here that shows off the email approach.

    http://codicesoftware.blogspot.com/2008/09/check-in-changes-using-email.html

  27. Dear Kalid,

    Thank you for your agreement on translation.
    I have translated this article into Chinese, and post it with the link to your original article url. The translated article is posted on http://scmteamwork.blogspot.com/
    Since this artilce relates to version control introduction, which is also very helpful to let people understand what’s version control, so we also translate this article. Of course, we will also post your original article link with the translated one.
    Thank you for your good illustration of the VCS concept, and we believe more people will enjoy it.

    Best regards,
    Yu-Hsiu Lin
    ESAST CO LTD
    http://www.esast.com

  28. Hi Kalid,

    Don’t forget that you can use a version control system non source code files/personal files too.

    Obviously, any developer will use them for source code control, but time after time fellow my developers and co-workers are always astounded that I use a personal repro for my other personal files too. Why keep 17 different backups of old of my Resume/CV when I can put into my local Repro on my hard drive or my NAS at home. It make life so much easier – and isn’t that what high tech is all about.

    I finally convinced my boss to keep all our design docs etc under source control too. Finally, no more meetings where someone doesn’t have the latest copy of the desing spec for review.. etc etc.

    Love the site- keep up the good work
    Mike

  29. @Mike: Great point! Yes, source control systems are very nice for keeping any type of file in sync. Unfortunately binary formats don’t get the benefit of change tracking, but for keeping files up-to-date I’m sure it can be a lifesaver! Thanks for the comment.

  30. I tend to work on two or three person microprojects, and I tend to be the maintainer. I definitely find hg to be easier for small fast projects. I don’t know why, precisely, but it just seems simpler and more straightforward, everyone runs their own branch, they commit, combine, and push changes whenever they want, I get all their changes and branches whenever I pull, I weave the branches together into the current stable version. You can do the same things in svn, but it just feels more natural to me in hg.

  31. @J Paul: Thanks for the note — I don’t have experience as a maintainer, but I can see how the distributed model makes sense for multi-person projects. Everyone has their own copy which is combined / weaved together as necessary. Appreciate the insight!

  32. This article is a clear and useful introduction. Thank you.

    As a self-taught web programmer I do not have a background in CS. Version control for the web sites I have been involved with consist of someone telling me this is my page to update :) Today I am learning the Drupal framework application and its more complicated. I have to learn some system.

    The issue of locking was mentioned in comment #21 above. I imagine locking to be important for inter-meshed projects with high degree of file dependencies (I’m thinking OO classes and objects). From what I know of OO it could be possible to over-work your area to cause overlap. In the case of Drupal, this framework is divided at the meta-logic level of its design into modules, themes, etc. Do the conceptual divisions keep programmers on track? And is there a similar conceptual division of labor making git usable for Linux development built in from the get-go by open source contribution?

    Then I would ask the programmers who do not see dVCS’s advantages, are you involved in conceptually complicated projects? Is it more difficult to have a sense of ownership where a centralized locking system or contract enforcement is necessary?

  33. @Xtian: Great question. In general, version control systems can only protect you from so many conflicts. Git is really good about managing changes, but ideally, you isolate components into different modules / files that can be worked on separately. If two people are working on different components, a DVCS like Git is definitely the easiest way to manage these different branches. Rather than locking, you have a set master branch that everyone can pull from (assume is the authoritative version) and the maintainer can examine the individual branches and merge them into master as needed. This way the master is always stable.

    Try taking a look at the branch diagram on http://betterexplained.com/articles/aha-moments-when-learning-git/ for more ideas on branch strategies. Hope this helps!

  34. I don’t even know the way I ended up right here, but I believed this submit used to be great. I don’t recognise who you’re but definitely you are going to a well-known blogger if you aren’t already. Cheers!

  35. The second diagram (The one explaining what a distributed version control system is compared to a centralised vcs) is not explained and I don’t understand it

  36. Hi Makar, check out http://betterexplained.com/articles/a-visual-guide-to-version-control/ to see how traditional version control systems work. I’ll try to clarify below.

    The 2nd diagram shows how the changes (which are usually linear, in a regular VCS) happen independently in a distributed model. These changes can be moved between users independently, without need to be sync’d to a central repo. This allows for more flexibility and easier collaboration.

Your feedback is welcome -- leave a reply!

Your email address will not be published.

LaTeX: $$e=mc^2$$