Monday 20 February 2012

To rebase or not to rebase, that is the question

In recent weeks I’ve made my first code contributions to open source projects on GitHub. A number of GitHub hosted projects like their contributors to rebase their changes before issuing a pull request. Since I normally use Mercurial, which doesn’t enable rebasing by default (although is easily turned on with an extension), I haven’t used rebase much except to experiment with it. And rebase is notorious for causing problems if you use it incorrectly.

Why would you want to rebase?

1. Simplified history

The main argument for rebasing is that it makes the history of your repository much simpler. Instead of seeing lots of short-lived branches and merge commits, you get one single linear history for your branch.

This might seem like a case of OCD on the part of the developers wanting rebase, but unfortunately, the task of looking back through source control history is something we’ve all had to do at some time or another, so the desire for it to appear as simple as possible is understandable.

If like me you quite often forget to do a pull before making a change to your local clone, you may end up requiring a merge when you do get round to doing a push. And more often than not, it is a trivial merge, since your change doesn’t conflict with the most recent commits that you forgot to fetch. In this scenario, rebase is able to present your change as though you had remembered to do the pull before starting work, and eliminates the superfluous merge commit.

Also if you work on a feature for a few weeks, you might end up with lots of intermediate merge commits as you keep pulling from the source repository. Rebasing simplifies the history of your feature by eliminating the many (often trivial) merge commits.

2. Commit combining

At the same time as rebasing, it is a common practice to combine your commits into one. Again, this is more about keeping the history of your repository nice and clean. If you made 20 commits to implement a single feature, perhaps with a few of them going in directions that you later backtracked out of, do you really want all those to be pushed up into the master repository?

Again, at some point in the future, someone will be trawling back through history trying to find where something went wrong, and it is a waste of their time to look at a revision that perhaps contains a silly mistake (might not even build), when your work could be combined into one revision.

Having your feature contained in a single commit also makes life simpler for the code reviewer, since they can just look at the diff for that one revision (although your tools ought to be able to show you a combined diff for a range of revisions – I noticed that github does this very elegantly with pull requests that contain multiple revisions).

Some people have called commit combining a “pretend you are perfect” feature – where you show the final outcome of your work without revealing all the stupid mistakes you made along the way. But that shouldn’t be the point of combining commits. You do it to save time for whoever in the future needs to look back at the history of the project.

What are the dangers?

So if rebase makes our history simpler, why would some people not want to use it. There are a few dangers associated with rebasing and combining commits.

1. Rebasing published revisions

The biggest danger with rebase is rebasing already published revisions. This is most likely to happen if you accidentally rebase the wrong way round (rebasing changes you pulled in from elsewhere onto your own work), something I imagine would be quite easy to do by mistake for a beginner with DVCS. Or maybe you forgot or unintentionally already pushed the revisions you are about to rebase to a public repository.

Doing this means that the revisions you rebased, rather than disappearing, will keep coming back again, and end up getting merged back in alongside the rebased versions of the same change. It can be very hard with a DVCS to get rid of revisions you no longer care about once they have been published.

In Mercurial 2.1, there is a new feature called phases, which gives the repository the ability to know which revisions have been published. This means that commands like rebase can now refuse to work on published revisions, making it a much safer command to use. It will be interesting to see how well this works in practice, and if it does work, maybe Mercurial will allow rebase and other history changing extensions, to be enabled by default). Having said that, since (unlike git), Mercurial by default pushes and pulls all heads, you might find you end up sharing a work in progress branch earlier than you intended.

2. Loss of information provided by a merge commit

One of the benefits of rebase, the removal of the merge commit, is also one of its dangers. It is possible for a rebase to complete successfully, with no merge conflicts, but the resulting code to be broken (e.g. one developer adds a call to functionX, while another developer renames functionX to functionY).

The original commit you made may well have been working perfectly and passing all its unit tests when it was in its own branch, but now it has been rebased there is a commit with your name against it that doesn’t build or contains bugs.

With an explicit merge commit, it is much easier to identify the point at which things went wrong. This remains the main reason why I am not convinced that rebase should be a major part of my workflow. The important thing to remember is that a rebase is just like a merge – it needs to be tested before it can be considered a success, even if there were no conflicts.

3. Loss of intermediate revisions

The goal of rebasing and collapsing is to get rid of intermediate revisions. But I wonder whether you could shoot yourself in the foot by over-enthusiastic collapsing of multiple revisions into one. Often after getting a feature working, I might quickly go over the code and do some last minute cleanup, refactoring a few class names, deleting TODO comments etc. But what if I accidentally break something in these final commits? If I collapse to a single commit it is too late to do a revert of the offending revision, or to rewind to the last good revision and make the changes correctly. Keeping your intermediate commits allows you to backtrack to the last good point.

Summary

In short, I think rebase is a useful tool to have available, but one that should be used with caution. Innovations like Mercurial’s phases could make it much safer, but on the whole, I prefer my source control history to show what really happened, rather than what I would have liked to happen in an ideal world.

As always, I welcome your thoughts on this in the comments.

2 comments:

Anonymous said...

all your base are belong to us

Unknown said...

Thanks for this, this is single-handedly the best explanation of rebase and phases I have seen!