Monday 20 February 2012

To rebase or not to rebase, that is the question

In recent weeks I’ve made my first code contributions to open source projects on GitHub. A number of GitHub hosted projects like their contributors to rebase their changes before issuing a pull request. Since I normally use Mercurial, which doesn’t enable rebasing by default (although is easily turned on with an extension), I haven’t used rebase much except to experiment with it. And rebase is notorious for causing problems if you use it incorrectly.

Why would you want to rebase?

1. Simplified history

The main argument for rebasing is that it makes the history of your repository much simpler. Instead of seeing lots of short-lived branches and merge commits, you get one single linear history for your branch.

This might seem like a case of OCD on the part of the developers wanting rebase, but unfortunately, the task of looking back through source control history is something we’ve all had to do at some time or another, so the desire for it to appear as simple as possible is understandable.

If like me you quite often forget to do a pull before making a change to your local clone, you may end up requiring a merge when you do get round to doing a push. And more often than not, it is a trivial merge, since your change doesn’t conflict with the most recent commits that you forgot to fetch. In this scenario, rebase is able to present your change as though you had remembered to do the pull before starting work, and eliminates the superfluous merge commit.

Also if you work on a feature for a few weeks, you might end up with lots of intermediate merge commits as you keep pulling from the source repository. Rebasing simplifies the history of your feature by eliminating the many (often trivial) merge commits.

2. Commit combining

At the same time as rebasing, it is a common practice to combine your commits into one. Again, this is more about keeping the history of your repository nice and clean. If you made 20 commits to implement a single feature, perhaps with a few of them going in directions that you later backtracked out of, do you really want all those to be pushed up into the master repository?

Again, at some point in the future, someone will be trawling back through history trying to find where something went wrong, and it is a waste of their time to look at a revision that perhaps contains a silly mistake (might not even build), when your work could be combined into one revision.

Having your feature contained in a single commit also makes life simpler for the code reviewer, since they can just look at the diff for that one revision (although your tools ought to be able to show you a combined diff for a range of revisions – I noticed that github does this very elegantly with pull requests that contain multiple revisions).

Some people have called commit combining a “pretend you are perfect” feature – where you show the final outcome of your work without revealing all the stupid mistakes you made along the way. But that shouldn’t be the point of combining commits. You do it to save time for whoever in the future needs to look back at the history of the project.

What are the dangers?

So if rebase makes our history simpler, why would some people not want to use it. There are a few dangers associated with rebasing and combining commits.

1. Rebasing published revisions

The biggest danger with rebase is rebasing already published revisions. This is most likely to happen if you accidentally rebase the wrong way round (rebasing changes you pulled in from elsewhere onto your own work), something I imagine would be quite easy to do by mistake for a beginner with DVCS. Or maybe you forgot or unintentionally already pushed the revisions you are about to rebase to a public repository.

Doing this means that the revisions you rebased, rather than disappearing, will keep coming back again, and end up getting merged back in alongside the rebased versions of the same change. It can be very hard with a DVCS to get rid of revisions you no longer care about once they have been published.

In Mercurial 2.1, there is a new feature called phases, which gives the repository the ability to know which revisions have been published. This means that commands like rebase can now refuse to work on published revisions, making it a much safer command to use. It will be interesting to see how well this works in practice, and if it does work, maybe Mercurial will allow rebase and other history changing extensions, to be enabled by default). Having said that, since (unlike git), Mercurial by default pushes and pulls all heads, you might find you end up sharing a work in progress branch earlier than you intended.

2. Loss of information provided by a merge commit

One of the benefits of rebase, the removal of the merge commit, is also one of its dangers. It is possible for a rebase to complete successfully, with no merge conflicts, but the resulting code to be broken (e.g. one developer adds a call to functionX, while another developer renames functionX to functionY).

The original commit you made may well have been working perfectly and passing all its unit tests when it was in its own branch, but now it has been rebased there is a commit with your name against it that doesn’t build or contains bugs.

With an explicit merge commit, it is much easier to identify the point at which things went wrong. This remains the main reason why I am not convinced that rebase should be a major part of my workflow. The important thing to remember is that a rebase is just like a merge – it needs to be tested before it can be considered a success, even if there were no conflicts.

3. Loss of intermediate revisions

The goal of rebasing and collapsing is to get rid of intermediate revisions. But I wonder whether you could shoot yourself in the foot by over-enthusiastic collapsing of multiple revisions into one. Often after getting a feature working, I might quickly go over the code and do some last minute cleanup, refactoring a few class names, deleting TODO comments etc. But what if I accidentally break something in these final commits? If I collapse to a single commit it is too late to do a revert of the offending revision, or to rewind to the last good revision and make the changes correctly. Keeping your intermediate commits allows you to backtrack to the last good point.

Summary

In short, I think rebase is a useful tool to have available, but one that should be used with caution. Innovations like Mercurial’s phases could make it much safer, but on the whole, I prefer my source control history to show what really happened, rather than what I would have liked to happen in an ideal world.

As always, I welcome your thoughts on this in the comments.

Saturday 18 February 2012

Modular WPF Screencast Part 3

Due to popular demand, I have finally got round to recording part three of my modular WPF application screencast series. Rather than jumping directly to a fully featured solution, I wanted to show how we might evolve an architecture step by step, and without being afraid to make some wrong choices that we will need to refactor later.

  • Part 1 covered setting up the framework to use MEF
  • Part 2 covered adding MVVM to make it unit testable

In this episode I walk through adding a new feature, the ability to cancel switching between modules, which turns out to be a bit more tricky than we might have anticipated, and ends up with us creating a templated list of buttons to replace our original ListBox. (n.b. an even better choice might have been to use a tab control, but I didn’t think of that when I created this tutorial, so maybe that can be another refactoring for a future episode).

Also, loads of people are asking for the code, so I have made the Mercurial repository publicly available. In the video, the reason you don’t see me coding, is because I am just switching between revisions of my repository (and it also keeps the video much more succinct). I’ve bought a new headset too, so the voice quality ought to be better than before (although I’ll probably reduce the level slightly for next time). I also tried capturing at 1280x720, so there is a bit more screen real estate. Let me know if this is better or worse.

Thursday 16 February 2012

Fork First or Just in Time?

I’ve been following the progress of Code 52 a bit over recent weeks. It is an audacious attempt to create a new open source project every week for a year. They also seem willing to accept code contributions, so I was tempted to download a few of their projects and make some minor improvements.

I’ve been using Mercurial for my projects at CodePlex for some time now, but I hadn’t used git in anger, and since Code52 store all their projects on the very impressive github, it gave me a good excuse to learn.

I read up a few tutorials on how to fork a repository in github. The official guide is good, and Scott Hanselman wrote a great post on how to contribute to Code 52 projects. But one thing they all have in common, is a workflow of Fork, Clone, Commit, [optional: Pull & Merge], Push, Pull Request. This has always struck me as being the wrong way round. (CodePlex recommends essentially the same workflow for Mercurial).

Fork First

The reason I don’t like this workflow, is that it assumes the first thing I want to do is create a fork. But that’s not how I typically interact with an open source project. My workflow goes like this:

  1. I come across a new open source project and maybe I find it interesting
  2. Often I will just want to download compiled binaries, but maybe I want to explore the code to see how it was implemented
  3. I clone it (git/hg clone) and maybe I will get round to playing with it later
  4. I attempt to build it locally and maybe it succeeds on my machine (surprising how often it doesn’t)
  5. I attempt to use it and maybe I find a bug or I wish it had a new feature
  6. I report the bug or feature to the developers, and maybe I think I could fix it myself
  7. I explore the source code, and maybe I understand it well enough to make a change
  8. I begin coding a fix/feature, and maybe I get it working
  9. I realise my code needs cleaning up before I issue a pull request, and maybe I get round to doing so
  10. If I have made it this far, now is the time I am ready to push to a public fork and issue a pull request. I estimate I get to this step on less than 1 percent of open source projects I come across.

As you can see, it is only at step 10 that I need to have a fork, but the tutorials all want me to make my fork at step 3. This results in lots of projects having multiple forks that have never been pushed to. Or have been pushed to but no pull request ever submitted, leaving you wondering what the status of the changes is.

Just in time fork

In my opinion, forks (which are really just publicly visible clones), should be made just in time. Currently, Code 52’s Pretzel project has 47 forks, and as far as I can tell, many (most?) of them have had no changes pushed to them at all. (In fact, a nice github feature would be to hide forks that have not been pushed to yet, and to highlight forks that have pull requests outstanding).

The just in time fork workflow isn’t difficult. First clone from the main repository. Think of this clone as your private fork if that helps.

git clone https://github.com/Code52/pretzel.git

Once you decide to make some changes to your repo, you can make a branch to work on (not strictly necessary, but recommended).

git checkout –b my-new-feature

Now work away on your feature. You can pull in changes from the master repository, and optionally merge them into your working branch whenever you like.

Once you are sure that you want to contribute to the project, at this point, you create your public fork on github. Now you add it as a remote:

git remote add myfork https://myname@github.com/myfork/pretzel.git

You can now easily push to your github fork. I think it is probably best to also have a feature branch on your github fork, which means that if you wanted to contribute another unrelated feature, you could do that in another branch, and have two pull requests outstanding that weren’t dependent on each other.

git push myfork my-new-feature

The github gui makes it very easy to issue a pull request from a branch.

Summary

Why create dozens of unused forks when it is straightforward to create them at the point they are needed? Am I missing some important reason why you shouldn’t work like this? Let me know in the comments.