Monday, 20 October 2008

Find Orphaned Source Files Using LINQ

On a project I am working on there is a growing number of files that are in Source Control but are not actually referenced by any .csproj files. I decided to write a quick and dirty command line program to find these files, and at the same time learn a bit of LINQ to XML.

During the course of my development, I ran into a couple of tricky issues. First was how to combine some foreach loops into a LINQ statement, and second was to construct the regex for source file matching. Both I guess I could have solved myself with a bit of time reading books, but I decided to throw them out onto Stack Overflow. Both were answered within a couple of minutes of asking. I have to say this site is incredible, and rather than treating it as a last resort for questions I have reached the end of my resources on, I am now thinking of it more like a super-knowledgeable co-worker who you can just ask a quick question and get a pointer in the right direction.

Here's the final code. I'm sure it could easily be turned into one nested LINQ query and improved on a little, but it does what I need. Feel free to suggest refactorings and enhancements in the comments.

using System.Text;
using System.IO;
using System.Xml.Linq;
using System.Text.RegularExpressions;

namespace SolutionChecker
{
    public class Program
    {
        public const string SourceFilePattern = @"(?<!\.g)\.cs$";

        static void Main(string[] args)
        {
            string path = (args.Length > 0) ? args[0] : GetWorkingFolder();
            Regex regex = new Regex(SourceFilePattern);
            var allSourceFiles = from file in Directory.GetFiles(path, "*.cs", SearchOption.AllDirectories)
                                 where regex.IsMatch(file)
                                 select file;
            var projects = Directory.GetFiles(path, "*.csproj", SearchOption.AllDirectories);
            var activeSourceFiles = FindCSharpFiles(projects);
            var orphans = from sourceFile in allSourceFiles                          
                          where !activeSourceFiles.Contains(sourceFile)
                          select sourceFile;
            int count = 0;
            foreach (var orphan in orphans)
            {
                Console.WriteLine(orphan);
                count++;
            }
            Console.WriteLine("Found {0} orphans",count);
        }

        static string GetWorkingFolder()
        {
            return Path.GetDirectoryName(typeof(Program).Assembly.CodeBase.Replace("file:///", String.Empty));
        }

        static IEnumerable<string> FindCSharpFiles(IEnumerable<string> projectPaths)
        {
            string xmlNamespace = "{http://schemas.microsoft.com/developer/msbuild/2003}";
            
            return from projectPath in projectPaths
                   let xml = XDocument.Load(projectPath)
                   let dir = Path.GetDirectoryName(projectPath)
                   from c in xml.Descendants(xmlNamespace + "Compile")
                   let inc = c.Attribute("Include").Value
                   where inc.EndsWith(".cs")
                   select Path.Combine(dir, c.Attribute("Include").Value);
        }
    }
}

2 comments:

Joel Haasnoot said...

Using a Hashset as the result of FindCSharpFiles() provides a significant speed improvement. Our 1000+ file project is analyzed much much faster.

Unknown said...

thanks for the tip Joel. I haven't really got into the hashset class as our project is on .NET 2.0 still. Looks like its a useful class.