Aha! Moments When Learning Git

Git is a fast, flexible but challenging distributed version control system. Before jumping in:

Along with a book, tutorial and cheatsheet, here are the insights that helped git click.

There's a staging area!

Git has a staging area. Git has a staging area!!!

Yowza, did this ever confuse me. There's both a repo ("object database") and a staging area (called "index"). Checkins have two steps:

• git add foo.txt
• Add foo.txt to the index. It's not checked in yet!
• git commit -m "message"
• Put staged files in the repo; they're now tracked
• You can "git add --update" to stage all tracked, modified files

Why stage? Git's flexible: if a, b and c are changed, you can commit them separately or together.

But now there's two undos:

• git checkout foo.txt
• Undo local changes (like svn revert)
• git reset HEAD foo.txt
• Remove from staging area (local copy still modified).

Add and commit, add and commit -- Git has a rhythm.

Branching is "Save as..."

Branches are like "Save as..." on a directory. Best of all:

• Easily merge changes with the original (changes tracked and never applied twice)
• No wasted space (common files only stored once)

Why branch? Consider the utility of "Save as..." for regular files: you tinker with multiple possibilities while keeping the original safe. Git enables this for directories, with the power to merge. (In practice, svn is like a single shared drive, where you can only revert to one backup).

Imagine virtual directories

I see branches as "virtual directories" in the .git folder. While inside a physical directory (c:\project or ~/project), you traverse virtual directories with a checkout.

• git checkout master
• switch to master branch ("cd master")
• git branch dev
• create new branch from existing ("cp * dev")
• you still need to "cd" with "git checkout dev"
• git merge dev
• (when in master) pull in changes from dev ("cp dev/* .")
• git branch
• list all branches ("ls")

My inner dialogue is "change to dev directory (checkout)... make changes... save changes (add/commit)... change to master directory... copy in changes from dev (merge)".

The physical directory is a scratchpad. Virtual directories are affected by git commands:

• rm foo.txt
• Remove foo.txt from your sandbox (restored if you checkout the branch again)
• git rm foo.txt
• Remove foo.txt from current virtual directory
• Gotcha: you need to commit that change!

Know the current branch

Just like seeing your current directory, put the current branch in your prompt!

In my .bash_profile:

parse_git_branch() {
git branch 2> /dev/null | sed -e '/^[^*]/d' -e 's/* (.*)/(1)/'
}

export PS1="[33[00m][email protected][33[01;34m] W [33[31m]$(parse_git_branch) [33[00m]$[33[00m] "


Visualize your branch structure

Git leaves branch organization to you. Nvie.com has a great branch strategy:

• Have a mainline (master). Mentally it's on the far right.
• Create branches (master -> dev) and subbranches (dev -> featureX). The further from master, the crazier.
• Only merge with neighbors (master -> dev -> feature X, or featureX -> dev -> master)

Stay sane by choosing a branch layout up front. I have a master tracking a svn project, and dev for my own code. In general, master is clean so I can branch anytime for one-off fixes.

Understand local vs. remote

Git has local and remote commands; seeing both confused me ("When do you checkout vs. pull?"). Work locally, syncing remotely as needed.

Local data

• git init
• create local repo
• use git add/commit/branch to work locally

Remote data

• git remote add name path-to-repo
• track a remote repo (usually "origin") from an existing repo
• remote branches are "origin/master", "origin/dev" etc.
• git branch -a
• list all branches (remote and local)
• git clone path-to-repo
• create a new local git repo copied from a remote one
• local master tracks remote master
• git pull
• merge changes from tracked remote branch (if in dev, pull from origin/dev)
• git push
• send changes to tracked remote branch (if in dev, push to origin/dev)

Why local and remote? Subversion has central checkins, so you avoid committing unfinished work. With git, local commits are frequent and you only push when ready.

GUIDs are GOOD

Git addresses information by a hash (GUID) of its contents. If two branches are the same, they have the same GUID (and vice versa).

Why's this cool? We can create branches independently, merge them, and have a common GUID. No central numbering needed. Usually, we just compare the first few digits: "Are you on a93?".

Tips & Tricks

For your .gitconfig:

[alias]
ci = commit
st = status
co = checkout
oneline = log --pretty=oneline
br = branch
la = log --pretty="format:%ad %h (%an): %s" --date=short


There are some GUI tools for git, but I prefer to learn via the command line. Git is opinionated software (which I like), and analogies help me understand its world view.

Join 450k Monthly Readers

Enjoy the article? There's plenty more to help you build a lasting, intuitive understanding of math. Join the newsletter for bonus content and the latest updates.

Intro to Distributed Version Control (Illustrated)

Traditional version control helps you backup, track and synchronize files. Distributed version control makes it easy to share changes. Done right, you can get the best of both worlds: simple merging and centralized releases.

Distributed? What’s wrong with regular version control?

Nothing — read a visual guide to version control if you want a quick refresher. Sure, some people will deride you for using an “ancient” system. But you’re still OK in my book: using any VCS is a positive step forward for a project.

Centralized VCS emerged from the 1970s, when programmers had thin clients and admired “big iron” mainframes (how can you not like a machine with a then-gluttonous 8 bits to a byte?).

Centralized is simple, and what you’d first invent: a single place everyone can check in and check out. It’s like a library where you get to scribble in the books.

This model works for backup, undo and synchronization but isn’t great for merging and branching changes people make. As projects grow, you want to split features into chunks, developing and testing in isolation and slowly merging changes into the main line. In reality, branching is cumbersome, so new features may come as a giant checkin, making changes difficult to manage and untangle if they go awry.

Sure, merging is always “possible” in a centralized system, but it’s not easy: you often need to track the merge yourself to avoid making the same change twice. Distributed systems make branching and merging painless because they rely on it.

A Few Diagrams, Please

Other tutorials have plenty of nitty-gritty text commands; here’s a visual look. To refresh, developers use a central repo in a typical VCS:

Everyone syncs and checks into the main trunk: Sue adds soup, Joe adds juice, and Eve adds eggs.

Sue’s change must go into main before it can be seen by others. Yes, theoretically Sue could make a new branch for other people to try out her changes, but this is a pain in a regular VCS.

Distributed Version Control Systems (DVCS)

In a distributed model, every developer has their own repo. Sue’s changes live in her local repo, which she can share with Joe or Eve:

But will it be a circus with no ringleader? Nope. If desired, everyone can push changes into a common repo, suspiciously like the centralized model above. This franken-repo contains the changes of Sue, Joe and Eve.

I wish distributed version control had a different name, such as “independent”, “federated” or “peer-to-peer.” The term “distributed” evokes thoughts of distributed computing, where work is split among a grid of machines (like searching for signals with SETI@home or doing protein folding).

A DVCS is not like [email protected]: each node is completely independent and sharing is optional (in Seti you must phone back your results).

Key Concepts In 5 Minutes

Here’s the basics; there’s a book on patch theory if you’re interested.

Core Concepts

• Centralized version control focuses on synchronizing, tracking, and backing up files.
• Distributed version control focuses on sharing changes; every change has a guid or unique id.
• Recording/Downloading and applying a change are separate steps (in a centralized system, they happen together).
• Distributed systems have no forced structure. You can create “centrally administered” locations or keep everyone as peers.

New Terminology

• push: send a change to another repository (may require permission)
• pull: grab a change from a repository

Key Advantages

• Everyone has a local sandbox. You can make changes and roll back, all on your local machine. No more giant checkins; your incremental history is in your repo.
• It works offline. You only need to be online to share changes. Otherwise, you can happily stay on your local machine, checking in and undoing, no matter if the “server” is down or you’re on an airplane.
• It’s fast. Diffs, commits and reverts are all done locally. There’s no shaky network or server to ask for old revisions from a year ago.
• It handles changes well. Distributed version control systems were built around sharing changes. Every change has a guid which makes it easy to track.
• Branching and merging is easy. Because every developer “has their own branch”, every shared change is like reverse integration. But the guids make it easy to automatically combine changes and avoid duplicates.
• Less management. Distributed VCSes are easy to get running; there’s no “always-running” server software to install. Also, DVCSes may not require you to “add” new users; you just pick what URLs to pull from. This can avoid political headaches in large projects.

Key Disadvantages

• You still need a backup. Some claim your “backup” is the other machines that have your changes. I don’t buy it — what if they didn’t accept them all? What if they’re offline and you have new changes? With a DVCS, you still want a machine to push changes to “just in case”. (In Subversion, you usually dedicate a machine to store the main repo; do the same for a DVCS).
• There’s not really a “latest version”. If there’s no central location, you don’t immediately know whether to see Sue, Joe or Eve for the latest version. Again, a central location helps clarify what the latest “stable” release is.
• There aren’t really revision numbers. Every repo has its own revision numbers depending on the changes. Instead, people refer to change numbers: Pardon me, do you have change fa33e7b? (Remember, the id is an ugly guid). Thankfully, you can tag releases with meaningful names.

Mercurial Quickstart

Mercurial is a fast, simple DVCS. The nickname is hg, like the element Mercury.

cd project
hg init                                (create repo here)
hg add list.txt                        (start tracking file)
hg commit -m "Added file"              (check file into local repo)
hg log                                 (see history; notice guid)

changeset:   0:55bbcb7a4c24
user:        [email protected]
date:        Sun Oct 14 21:36:18 2007 -0400
summary:     Added file

[edit file]
hg revert list.txt                 (revert to previous version)

hg tag v1.0                        (tag this version)
[edit file]
hg update -C v1.0                  ("update" to the older tagged version; -C forces overwrite of local copy)


Once Mercurial has initialized a directory, it looks like this:

You have:

• A working copy. The files you are currently editing.
• A repository. A directory (.hg in Mercurial) containing all patches and metadata (comments, guids, dates, etc.). There’s no central server so the data stays with you.

In our distributed example, Sue, Joe and Eve have their own repos, with independent revision histories.

Understanding Updates and Merging

There’s a few items that confused me when learning about DVCS. First, updates happen in several steps:

• Getting the change into a repo (pushing or pulling)
• Applying the change to the files (update or merge)
• Saving the new version (commit)

Second, depending on the change, you can update or merge:

• Updates happen when there’s no ambiguity. For example, I pull changes to a file that only you’ve been editing. The file just jumps to the latest revision, since there’s no overlapping changes.
• Merges are needed when we have conflicting changes. If we both edit a file, we end up with two “branches” (i.e. alternate universes). One world has my changes, the other world has yours. In this case we (probably) want to merge the changes together into a single universe.

I’m still wrapping my head around how easily branches spring up and collapse in a DVCS:

In this case, a merge is needed because (+Soup) and (+Juice) are changes to a common parent: the list with just “Milk”. After Joe merges the files, Sue can do a regular “pull and update” to get the combined file from Joe. She doesn’t have to merge again on her own.

In Mercurial you can run:

hg incoming ../another-dir  (see pending changes)
hg pull ../another-dir      (download changes)

hg update                   (actually apply changes...)
hg merge                    (... or merge if needed)

hg commit                   (check in merged file; unite branches)


Yep, the “pull-merge-commit” cycle is long. Luckily, Mercurial has shortcuts to combine commands into a single one. Though it seems complex, it’s much easier than handling merges manually in Subversion.

Most merges are automatic. When conflicts come up, they are typically resolved quickly. Mercurial keeps track of the parent/child relationship for every change (our merged list has two parents), as well as the “heads” or latest changes in each branch. Before the merge we have two heads; afterwards, one.

Organizing a Distributed Project

Here’s one way to organize a distributed project:

Sue, Joe and Eve check changes into a common branch. They can trade patches with each other to do simple “buddy builds”: Hey buddy, can you try out these patches? I need to see if it works before I push to the experimental branch.

Later, a maintainer can review and pull changes from the experimental branch into a stable branch, which has the latest release. A distributed VCS helps isolate changes but still provide the “single source” of a centralized system. There are many models of development, from “pull only” (where maintainers decide what to take, and is used when developing Linux) to “shared push” (which acts like a centralized system). A distributed VCS gives you flexibility in how a project is maintained.

Practice And Scathing Ridicule Makes Perfect

I’m a DVCS newbie, but am happy with what I’ve learned so far. I enjoy SVN, but it’s “fun” seeing how easy a merge can be. My suggestion is to start with Subversion, get a grasp for team collaboration, then experiment with a distributed model. With the proper layout a DVCS can do anything a centralized system can, with the added benefit of easy merging.

Online Resources

Notable Quotes:

• “How many have done a branch and merged it? How many of you enjoyed it?”
• “When you do a merge, you plan ahead for a week, then set aside a day to do it.”
• “Some people have 5, 10, 15 branches”. One branch is experimental. One branch is maintenance, etc.
• “CVS — you don’t commit. You make changes without committing. You never commit until it passes a giant test suite. People make 1-liner changes, knowing it can’t possibly break.”

So good luck, and watch out for the holy wars. Feel free to share any tips or suggestions below.

Join 450k Monthly Readers

Enjoy the article? There's plenty more to help you build a lasting, intuitive understanding of math. Join the newsletter for bonus content and the latest updates.

A Visual Guide to Version Control

Version Control (aka Revision Control aka Source Control) lets you track your files over time. Why do you care? So when you mess up you can easily get back to a previous working version.

You’ve probably cooked up your own version control system without realizing it had such a geeky name. Got any files like this? (Not these exact ones I hope).

• KalidAzadResumeOct2014.doc
• KalidAzadResumeMar2015.doc
• instacalc-logo3.png
• instacalc-logo4.png
• logo-old.png

It’s why we use “Save As”. You want the new file without obliterating the old one. It’s a common problem, and solutions are usually like this:

• Make a single backup copy (Document.old.txt).
• If we’re clever, we add a version number or date: Document_V1.txt, DocumentMarch2015.txt
• We may even use a shared folder so other people can see and edit files without sending them over email. Hopefully they relabel the file after they save it.

So Why Do We Need A Version Control System (VCS)?

Our shared folder/naming system is fine for class projects or one-time papers. But software projects? Not a chance.

Do you think the Windows source code sits in a shared folder like “Windows2007-Latest-UPDATED!!”, for anyone to edit? That every programmer just works in a different subfolder? No way.

Large, fast-changing projects with many authors need a Version Control System (geekspeak for “file database”) to track changes and avoid general chaos. A good VCS does the following:

• Backup and Restore. Files are saved as they are edited, and you can jump to any moment in time. Need that file as it was on Feb 23, 2007? No problem.
• Synchronization. Lets people share files and stay up-to-date with the latest version.
• Short-term undo. Monkeying with a file and messed it up? (That’s just like you, isn’t it?). Throw away your changes and go back to the “last known good” version in the database.
• Long-term undo. Sometimes we mess up bad. Suppose you made a change a year ago, and it had a bug. Jump back to the old version, and see what change was made that day.
• Track Changes. As files are updated, you can leave messages explaining why the change happened (stored in the VCS, not the file). This makes it easy to see how a file is evolving over time, and why.
• Track Ownership. A VCS tags every change with the name of the person who made it. Helpful for blamestorming giving credit.
• Sandboxing, or insurance against yourself. Making a big change? You can make temporary changes in an isolated area, test and work out the kinks before “checking in” your changes.
• Branching and merging. A larger sandbox. You can branch a copy of your code into a separate area and modify it in isolation (tracking changes separately). Later, you can merge your work back into the common area.

Shared folders are quick and simple, but can’t beat these features.

Learn the Lingo

Most version control systems involve the following concepts, though the labels may be different.

Basic Setup

• Repository (repo): The database storing the files.
• Server: The computer storing the repo.
• Client: The computer connecting to the repo.
• Working Set/Working Copy: Your local directory of files, where you make changes.
• Trunk/Main: The primary location for code in the repo. Think of code as a family tree — the trunk is the main line.

Basic Actions

• Add: Put a file into the repo for the first time, i.e. begin tracking it with Version Control.
• Revision: What version a file is on (v1, v2, v3, etc.).
• Head: The latest revision in the repo.
• Check out: Download a file from the repo.
• Check in: Upload a file to the repository (if it has changed). The file gets a new revision number, and people can “check out” the latest one.
• Checkin Message: A short message describing what was changed.
• Changelog/History: A list of changes made to a file since it was created.
• Update/Sync: Synchronize your files with the latest from the repository. This lets you grab the latest revisions of all files.
• Revert: Throw away your local changes and reload the latest version from the repository.

Advanced Actions

• Branch: Create a separate copy of a file/folder for private use (bug fixing, testing, etc). Branch is both a verb (“branch the code”) and a noun (“Which branch is it in?”).
• Diff/Change/Delta: Finding the differences between two files. Useful for seeing what changed between revisions.
• Merge (or patch): Apply the changes from one file to another, to bring it up-to-date. For example, you can merge features from one branch into another. (At Microsoft this was called Reverse Integrate and Forward Integrate)
• Conflict: When pending changes to a file contradict each other (both changes cannot be applied).
• Resolve: Fixing the changes that contradict each other and checking in the correct version.
• Locking: Taking control of a file so nobody else can edit it until you unlock it. Some version control systems use this to avoid conflicts.
• Breaking the lock: Forcibly unlocking a file so you can edit it. It may be needed if someone locks a file and goes on vacation (or “calls in sick” the day Halo 3 comes out).
• Check out for edit: Checking out an “editable” version of a file. Some VCSes have editable files by default, others require an explicit command.

And a typical scenario goes like this:

Alice adds a file (list.txt) to the repository. She checks it out, makes a change (puts “milk” on the list), and checks it back in with a checkin message (“Added required item.”). The next morning, Bob updates his local working set and sees the latest revision of list.txt, which contains “milk”. He can browse the changelog or diff to see that Alice put “milk” the day before.

Visual Examples

This guide is purposefully high-level: most tutorials throw a bunch of text commands at you. Let’s cover the high-level concepts without getting stuck in the syntax (the Subversion manual is always there, don’t worry). Sometimes it’s nice to see what’s possible.

Checkins

The simplest scenario is checking in a file (list.txt) and modifying it over time.

Each time we check in a new version, we get a new revision (r1, r2, r3, etc.). In Subversion you’d do:

svn add list.txt
(modify the file)
svn ci list.txt -m "Changed the list"


The -m flag is the message to use for this checkin.

Checkouts and Editing

In reality, you might not keep checking in a file. You may have to check out, edit and check in. The cycle looks like this:

If you don’t like your changes and want to start over, you can revert to the previous version and start again (or stop). When checking out, you get the latest revision by default. If you want, you can specify a particular revision. In Subversion, run:

svn co list.txt (get latest version)
…edit file…
svn revert list.txt (throw away changes)
svn co -r2 list.txt (check out particular version)


Diffs

The trunk has a history of changes as a file evolves. Diffs are the changes you made while editing: imagine you can “peel” them off and apply them to a file:

For example, to go from r1 to r2, we add eggs (+Eggs). Imagine peeling off that red sticker and placing it on r1, to get r2.

And to get from r2 to r3, we add Juice (+Juice). To get from r3 to r4, we remove Juice and add Soup (-Juice, +Soup).

Most version control systems store diffs rather than full copies of the file. This saves disk space: 4 revisions of a file doesn’t mean we have 4 copies; we have 1 copy and 4 small diffs. Pretty nifty, eh? In SVN, we diff two revisions of a file like this:

svn diff -r3:4 list.txt


Diffs help us notice changes (“How did you fix that bug again?”) and even apply them from one branch to another.

Bonus question: what’s the diff from r1 to r4?

+Eggs
+Soup


Notice how “Juice” wasn’t even involved — the direct jump from r1 to r4 doesn’t need that change, since Juice was overridden by Soup.

Branching

Branches let us copy code into a separate folder so we can monkey with it separately:

For example, we can create a branch for new, experimental ideas for our list: crazy things like Rice or Eggo waffles. Depending on the version control system, creating a branch (copy) may change the revision number.

Now that we have a branch, we can change our code and work out the kinks. (“Hrm… waffles? I don’t know what the boss will think. Rice is a safe bet.”). Since we’re in a separate branch, we can make changes and test in isolation, knowing our changes won’t hurt anyone. And our branch history is under version control.

In Subversion, you create a branch simply by copying a directory to another.

svn copy http://path/to/trunk http://path/to/branch


So branching isn’t too tough of a concept: Pretend you copied your code into a different directory. You’ve probably branched your code in school projects, making sure you have a “fail safe” version you can return to if things blow up.

Merging

Branching sounds simple, right? Well, it’s not — figuring out how to merge changes from one branch to another can be tricky.

Let’s say we want to get the “Rice” feature from our experimental branch into the mainline. How would we do this? Diff r6 and r7 and apply that to the main line?

Wrongo. We only want to apply the changes that happened in the branch!. That means we diff r5 and r6, and apply that to the main trunk:

If we diffed r6 and r7, we would lose the “Bread” feature that was in main. This is a subtle point — imagine “peeling off” the changes from the experimental branch (+Rice) and adding that to main. Main may have had other changes, which is ok — we just want to insert the Rice feature.

In Subversion, merging is very close to diffing. Inside the main trunk, run the command:

svn merge -r5:6 http://path/to/branch


This command diffs r5-r6 in the experimental branch and applies it to the current location. Unfortunately, Subversion doesn’t have an easy way to keep track of what merges have been applied, so if you’re not careful you may apply the same changes twice. It’s a planned feature, but the current advice is to keep a changelog message reminding you that you’ve already merged r5-r6 into main.

Conflicts

Many times, the VCS can automatically merge changes to different parts of a file. Conflicts can arise when changes appear that don’t gel: Joe wants to remove eggs and replace it with cheese (-eggs, +cheese), and Sue wants to replace eggs with a hot dog (-eggs, +hot dog).

At this point it’s a race: if Joe checks in first, that’s the change that goes through (and Sue can’t make her change).

When changes overlap and contradict like this, the VCS may report a conflict and not let you check in — it’s up to you to check in a newer version that resolves this dilemma. A few approaches:

• Re-apply your changes. Sync to the the latest version (r4) and re-apply your changes to this file: Add hot dog to the list that already has cheese.
• Override their changes with yours. Check out the latest version (r4), copy over your version, and check your version in. In effect, this removes cheese and replaces it with hot dog.

Conflicts are infrequent but can be a pain. Usually I update to the latest and re-apply my changes.

Tagging

Who would have thought a version control system would be Web 2.0 compliant? Many systems let you tag (label) any revision for easy reference. This way you can refer to “Release 1.0” instead of a particular build number:

In Subversion, tags are just branches that you agree not to edit; they are around for posterity, so you can see exactly what your version 1.0 release contained. Hence they end in a stub — there’s nowhere to go.

(in trunk)
svn copy http://path/to/revision http://path/to/tag


Real-life example: Managing Windows Source Code

We guessed that Windows was managed out of a shared folder, but it’s not the case. So how’s it done?

• There’s a main line with stable builds of Windows.
• Each group (Networking, User Interface, Media Player, etc.) has its own branch to develop new features. These are under development and less stable than main.

You develop new features in your branch and “Reverse Integrate (RI)” to get them into Main. Later, you “Forward Integrate” to bring the latest changes from Main into your branch:

Let’s say we’re at Media Player 10 and IE 6. The Media Player team makes version 11 in their own branch. When it’s ready and tested, there’s a patch from 10 – 11 which is applied to Main (just like the “Rice” example, but a tad more complicated). This is a reverse integration, from the branch to the trunk. The IE team can do the same thing.

Later, the Media Player team can pick up the latest code from other teams, like IE. In this case, Media Player forward integrates and gets the latest patches from main into their branch. This is like pulling in the “Bread” feature into the experimental branch, but again, more complicated.

So it’s RI and FI. Aye aye. This arrangement lets changes percolate throughout the branches, while keeping new code out of the main line. Cool, eh?

In reality, there’s many layers of branches and sub-branches, along with quality metrics that determine when you get to RI. But you get the idea: branches help manage complexity. Now you know the basics of how one of the largest software projects is organized.

Key Takeaways

My goal was to share high-level thoughts about version control systems. Here are the basics:

• Use version control. Seriously, it’s a good thing, even if you’re not writing an OS. It’s worth it for backups alone.
• Take it slow. I’m only now looking into branching and merging for my projects. Just get a handle on using version control and go from there. If you’re a small project, branching/merging may not be an issue. Large projects often have experienced maintainers who keep track of the branches and patches.
• Keep Learning. There’s plenty of guides for SVN, CVS, RCS, Git, Perforce or whatever system you’re using. The important thing is to know the concepts and realize every system has its own lingo and philosophy. Eric Sink has a detailed version control guide also.

These are the basics — as time goes on I’ll share specific lessons I’ve learned from my projects. Now that you’ve figured out a regular VCS, try an illustrated guide to distributed version control.

Join 450k Monthly Readers

Enjoy the article? There's plenty more to help you build a lasting, intuitive understanding of math. Join the newsletter for bonus content and the latest updates.