Monday 27 January 2014

Announcing Understanding Distributed Version Control

I’m very pleased to announce that my third Pluralsight course has been published today. This one is entitled “Understanding Distributed Version Control”. Regular followers of my blog will know this is a subject I often write about, and have spoken on it at various developer groups. This course draws and expands on that material to provide what I hope will be a really accessible introduction to what Distributed Version Control systems are, how they work, what the workflow is, and why you should consider using them.

The course is aimed at anyone who is interested in finding out what all the fuss is about DVCS. I know that when I started investigating DVCS, it seemed quite confusing at first, so I have tried to make the type of course that I wish I had seen back then. I focus in particular on explaining the way that DVCS systems store the revision history (in a graph structure known as a “DAG”). Personally I think that once this concept becomes clear in your mind, much of the complexity of DVCS goes away.

I have also tried to show what benefits there are to switching from centralised to distributed version control systems. I know many developers quite sensibly take an “if it ain’t broke don’t fix it” attitude, and so can be reluctant to consider such a fundamental shift in the way they use version control. I’ve tried to show that whether you work alone on a “single developer project”, or work on open source, or work in a large team of developers commercially, DVCS has some very compelling benefits. I’ve also tried to be open about some of the limitations of DVCS compared with centralized tools. I am very enthusiastic about the power and potential of DVCS, but I do recognise that there are still some rough edges in the tooling at the moment.

The final thing I try to do in this course is give you a feel for the general workflow involved in using DVCS day to day. I’ve done this by showing demos in both Mercurial and Git. I quite deliberately chose to use two different DVCS tools to show that the main principles are the same, irrespective of the particular tool you choose. Since there are many other great tutorials as well as Pluralsight courses going into more detail on the specifics of individual DVCS such as Git or Mercurial, I didn’t want to repeat that material. Really my course is designed as a precursor to those courses, helping you understand the big picture, before you get into the nitty-gritty of learning the command line syntax for various operations.

I hope you enjoy the course and find it helpful. I love to hear feedback from people who watched the course, so let me know how you get on with it. I’m always wanting to know how I can improve my courses, so any constructive criticism will also be gratefully received.

Saturday 18 January 2014

Five Audio Processing Tasks that are a Lot Harder than you Think

I regularly get emails along these lines…

“Hi, I am a student and I need to make program to take MP3 speech recording and output what the words are. I tried Microsoft speech recognition but it kept getting wrong. I need 100% accuracy. And if there is more than one person speaking it needs to say who is speaking. It has to use NAudio, but I am new to audio programming, so please send me the codes to do this. Please hurry because I need it by Friday.”

It seems to be a common problem that people new to audio processing greatly underestimate how difficult some tasks are. So here I present the top five audio processing problems I get asked about. All of which you might find frustratingly difficult to solve…

1) Speech Recognition

Speech recognition is slowly becoming more and more mainstream. Apple has Siri, Windows comes with built-in speech recognition, Google have their own speech recognition technology. The trouble is, despite their huge R&D budgets, none of these three leading software companies have actually produced speech recognition engines that don’t get it hilariously wrong on a regular basis.

Speech recognition is such a difficult problem that most existing products need to spend a certain amount of time learning the way you speak (regional accents can be a huge problem), and often they are simply looking for matches in a limited set of keywords (e.g. “open”, “play”, “search”) which can increase their reliability.

One of the biggest issues is that in human speech two completely different sentences can sound exactly the same. For example, how does a computer know if you said “I would like an ice-cream” or “Eye wood lie can I scream”? Humans know because the second sentence is complete nonsense. For a computer to know that, it also needs to learn about the rules of grammar, and understand the wider context of the audio it is transcribing.

In short, if anyone asks me if they can implement a speech recognition algorithm from scratch using NAudio, I tell them to give up, unless they’ve got a lot of time, and have a large team of signal processing experts they can call on.

2) Speaker Recognition

This is a related problem to speech recognition, where you attempt to determine who is speaking in a recording of a conversation. Again, this is fraught with difficulties. First, how does the computer know how many speakers there are? Ideally, it should be given a voice sample of each speaker individually, to build up some kind of profile. You might have some success using an FFT to get the pitch information, but this would only likely be successful if the speakers had very different voices (e.g. a male and female voice, or an adult and a child).

Doubtless there are some state of the art algorithms being developed somewhere to do this, but I know of none in the public domain, and anyone who solves this problem is likely to keep their technique a closely guarded secret.

3) Transcribing Music

Is it possible to take a piece of music and turn it into sheet music, indicating what notes were played when? Well it depends very much on what exactly is being played. If you have a recording of a single monophonic instrument being played, then pitch detection may well give you a decent transcription.

But if you have recorded a polyphonic instrument, such as a piano or a guitar (or worse yet, a whole band), then things get a whole lot more difficult. It becomes a lot less clear when a note starts and stops, and which note exactly is being played. One of the big issues is harmonics. When you play a middle C on a piano, you don’t just hear a single frequency (261Hz), but rather a whole host of other frequencies as well. It’s what gives each instrument a rich and distinctive sound. This means there is inevitably some amount of guesswork involved in determining which note(s) exactly were being played in order to produce the complex set of frequencies that have been detected.

4) BPM Detection

This request I think comes from DJs who want to group tracks together by their BPM. In theory this should be easy – detect each beat, count how many beats there are in a given time interval, and then calculate the BPM.

The trouble is that there is no rule in music that the kick drum must play on every beat, or the snare must only hit on beats 2 and 4. Some music has no percussion, and if there are drums, they can be very busy or very sparse. So even if you did create an algorithm that detected the “transient” for each kick or snare hit (or strum on the guitar, or strike of a bongo), you would need to come up with a strategy for ignoring the ones that weren’t on the beat. For example if the music is in 12/8 you could end up detecting a far too high BPM.

Depending on the type of music you are analysing, you may actually get reasonable success with a primitive BPM detection algorithm. For example, if it is all four to the floor dance music then you might be able to get consistently good results. Probably it would be best to measure the BPM in several places in the song, and select the most common one.

5) Song Matching

This problem is where you have a snippet of audio, maybe someone humming a tune, or a recording made with your phone, and want to match it to a database of songs. What song is my snippet from?

This turns out to be extremely difficult. You could try to solve it by matching the melody – looking for similar patterns of notes, but even that is fraught with difficulty – how do you extract just the melody for each song and store it in such a way that it can easily be matched on.

One additional complicating factor is that the same song can be sung in different keys and at different speeds. You can easily recognise a song you know even if it is sung by someone with a completely different sounding voice and on a completely different instrument to the original performer. But a computer will find that a lot harder to do.

This is a problem that some big music companies are attempting to tackle, as it could be used to help them identify illegal distribution of songs. But it’s very unlikely that they will reveal the secrets of any algorithms they do come up with. So if you do want to tackle this problem, don’t expect to find much by way of helpful information.

Conclusion

I don’t write this to suggest that these five tasks are impossible. With enough effort and ingenuity I am sure great solutions to each one can be found. What I am trying to say is that there are many ways in which the human brain is still vastly superior to state of the art software technology, particularly when it comes to the types of recognition tasks discussed here. So if you have dreams of creating the next killer audio application using NAudio, by all means try, but make sure you have realistically set your expectations of what can be achieved. And if you do want to tackle one of these five problems listed above, prepare to spend lots of time learning advanced DSP techniques, and learning to live with much less than 100% accuracy.

Friday 17 January 2014

Delete Dead Code

If you are working on a large codebase the chances are there is some dead code in there. Maybe a class that isn’t used any more, an if statement whose condition can never be satisfied, or a click handler for a button that is no longer visible on the GUI. This code could be simply be deleted and it would have no negative effect on your application.

But what harm is it actually doing? Does it even matter if your code is littered with vestigial methods? The impact on build times and memory consumption is probably negligible, so why care about the presence of dead code? Why spend time eliminating it from your codebase?

The reason to remove is simple: dead code wastes time. There are several reasons ways in which this happens. First, unused code can break the build. A change you make in live code can cause a compile error in dead code, forcing you to investigate and fix. Or the unit tests for the dead code start failing and again an investigation is needed. The trouble is, usually it’s not obvious that you’re fixing problems in dead code, so you end up wasting your time fixing it when you could just delete it.

I once worked on a large VS solution where an entire project stopped compiling with an ILMerge error caused by an unrelated feature added elsewhere. It took someone three days to fix. On further investigation, the non-building project was in fact completely obsolete. Had it been deleted at the time it was obsoleted, those three days would have been saved.

Dead code also hinders future development. When you are contemplating making a change to a method, you’ll probably search the codebase for all places that might be affected. So you do a “Find all references”, and discover that there are 12 places this method is called. For each of those 12 places, you’ll need to read the code and understand what it is doing, before you know if your change is safe to make. Each one of those references will take time to understand, and if they seem particularly complex, might even dissuade you from making your desired change. If any of those references came from dead code, then it was a waste of your time even thinking about them.

Perhaps you don’t believe me. Here’s a simple experiment you can do. Next time you come across a class that has been unused for several years, take a look at the source code history. In my experience, an unused class will end up being modified at least once every four months. Which means it is probably being read and thought about every couple of weeks. It’s slowing you down by making you think about it.

In conclusion, if you come across dead code in your codebase, delete it. Make sure you are the last person whose time is wasted by it.

Tuesday 14 January 2014

Storing Large Files in DVCS

I’m hoping to release a new Pluralsight course shortly on Distributed Version Control. It’s a generalised course, not specifically about Git or Mercurial, although I use those two as my main examples.

One of the topics I briefly touch on is the issue of storing large files within a DVCS repository. If you have used a centralized version control system, you may be used to putting all kinds of huge binary files into source control, such as test data, installers, dependencies. But this is generally not recommended in the world of DVCS. Why?

Well its not because it doesn’t work. If you want to put files that are hundreds of megabytes in size into a DVCS, it will have no problem doing so.

Slow Clone

One of the main reasons is the fact that with distributed version control, clone gets everything. It gets all the files and folders in your repository. This is quite different from the way that centralized systems work. With centralized version control, you can usually ask just to get the latest version of a single folder within a repository, and work with that. So you can avoid having to download large files if you know you don’t need them.

With DVCS clone not only gets the latest versions of all files in your repository, but all historical versions too. This means that if you add a huge file, then delete it in a future commit, you won’t make clone any quicker. It will still need to download that huge file, in case you want to go back in history to that version.

So storing huge files in a DVCS repo will make clone slow. Of course, once you’ve done the clone, everyday operations will be nice and quick again. So you may decide you can put up with a slow clone in order to store all the files you want to in your repository.

Memory Usage

Another issue is that some DVCS may be written assuming that they can load individual files into memory. So Mercurial for example will warn you when you add a file over about 10MB that it can sometimes cause it to use 3-4 times the size of the file in RAM. So if you were adding a 500MB file, then you could quite easily run into an out of memory error. Git doesn’t give you any warnings, so it may not suffer from the same problem, although I have read reports of people having memory issues when dealing with huge files.

Not Source Code

There are other reasons not to store huge files in repositories. They are typically binary files, which are rarely mergable and very hard to diff. It may be better to recognise that these assets are not source code, and to store them elsewhere on a server. There are a number of extensions for DVCS tools like Git and Mercurial that make it easier to host properly versioned copies of large binary files on a centralized server, allowing you to create a hybrid distributed/centralized system. (e.g. see git-annex)

One approach is for a large files server to make the files available over http, with a separate URL not just for each file, but for each version of the file. This is important, as if you go back to build a historical version of your product, you will want the big file in the state that it was at the time.

http://mylargefilesserver/biginstaller.exe/v1/
http://mylargefilesserver/welcome_video.mp4/v1/
http://mylargefilesserver/welcome_video.mp4/v2/

It’s up to you whether the inconvenience of having to manage a separate large files server is worth the advantage of keeping your main repository size small. It probably also depends on whether these large files are essential for developers to be able to hotfix old versions, or whether they can do without them, and only the build machine actually needs everything.

Erasing a Large File from Repository History

If you inadvertently checked a large file into your repository, and now wish you hadn’t, it can be really hard to get rid of, especially if new commits have been made since you added it. The usual approach to fixing this is to rewrite history, creating a brand new repository with all the same changes but with the large file explicitly excluded.

In Mercurial you can do this quite easily with the convert extension. You need to create a filemap (e.g. filemap.txt), which indicates what files you want to exclude (the filemap can also be used to rename or move things). Here I simply want to remove a large SDK that shouldn’t have been added to source control, so my filemap contains a single line:

exclude "DXSDK_Jun10.exe"

Now we can run the convert extension to create a new repository that excludes DXSDK_Jun10.exe:

hg convert --filemap filemap.txt original_repo new_repo

And now you have a new repository without the problematic large file. One word of caution though. While the hashes of all the revisions before the large file was added will stay the same, all the hashes of revisions afterwards will change. So everyone needs to clone the new repository and base their new work off that. If anyone is using an old clone containing the large file, there is a chance it could end up getting pulled back into the main repository (it’s the same problem you can run into if you rebase published commits).

Hopefully in the future we’ll see some innovation in DVCS tools for making large file support a smoother experience. I think there is some scope for a lazy clone, where large files (particularly deleted ones) don’t have to come down as part of the initial clone, but are pulled down on demand if needed.

Monday 13 January 2014

Git Stash for Mercurial Users

One of the really nice features of git is the stash command. It allows you to put some work in progress to the side and get back to a clean working directory, without needing to make an actual commit of half-finished code.

Mercurial doesn’t come with a stash command, but it does have an extension called shelve which does the same thing.

Enabling the Shelve Extension

To turn it on, you need the following lines in your mercurial.ini file (or to enable it in just a single repository, add them to .hg/hgrc):

[extensions]
shelve =

Or if you are using TortoiseHg, you can easily enable the extension in the settings dialog.

Shelving your Changes

Shelving is extremely straightforward. Simply type:

hg shelve

or you can name your shelveset:

hg shelve --name “some name”

Note that shelving only shelves things currently staged for commit, so if you’ve added new files, then do a hg add to make sure they are included in the shelveset.

Now you have a clean working folder, allowing you to work on a different task and make a commit.

Unshelving

Unshelving is really simple. If you have just one shelveset, then all you need to type is:

hg unshelve

If you have more than one shelveset then you can use the name you gave it:

hg unshelve --name “some name”

Note that this will actually perform a merge if necessary (you modified the same files in the meantime). No commit will happen when you do an unshelve; it simply gets you back to where you were, with the same pending changes.

As you can see, this extension is really simple to use, and very useful to have at your disposal if you need to make a quick context switch, or if you start doing some work that turns out to be more complicated than you thought.