Monday, 3 September 2012

Screen-Scraping in C# using LINQPad and HTML Agility Pack

I am a big fan of LINQPad for prototyping small bits of code, but every now and then you find you need to reference another library. LINQPad, does allow this by editing the Query Properties and adding a reference to a dll in the GAC or at a specific path. But the new LINQPad beta release makes life even easier by allowing you to reference NuGet packages.

I recently wanted to do a simple bit of screen-scraping, to extract the results from a web page containing football scores. By examining the HTML with FireBug I was quickly able to see that each month’s fixtures were in a series of tables each with a header row and then one row per result. The HTML looked something like this:

<table class="fixtures" width="502" border="0">
<tbody>
    <tr class="fixture-header">
        <th class="first" scope="col" colspan="6">November</th>
        <th class="goals-for" title="Goals For" scope="col">F</th>
        <th class="goals-against" title="Goals Against" scope="col">A</th>
        <th class="tv-channel" scope="col">
        <th class="last" scope="col"> </th>
    </tr>
    <tr class="home">
        <td class="first ">04</td>
        <td class="month"> Wed </td>
        <td class="fixture-icon">
        <td class="competition">UEFA Champions League</td>
        <td class="home-away">H</td>
        <td class="bold opposition ">
        <td class="goals-for"> 4 </td>
        <td class="goals-against"> 1 </td>
        <td class="tv-channel"> </td>
        <td class="menu-button" valign="middle">
    </tr>

To be able to navigate around HTML in .NET, by far the best library I have found is the HTML Agility Pack. Adding this to your LINQPad Query is very simple with the new beta. Press F4 to bring up Query Properties, then click Add NuGet, find the Html Agility Pack in the list and click Add To Query.

Now we are ready to load the document and find all the tables with the class of “fixtures”. You can use a special XPath syntax to do this in one step:

var web = new HtmlAgilityPack.HtmlWeb();
var doc = web.Load("http://www.arsenal.com/fixtures/fixtures-reports?season=2009-2010&x=11&y=15");
foreach(var fixturesTable in doc.DocumentNode.SelectNodes("//table[@class='fixtures']"))
{
    // ...
}

Having got each fixture table, I then ignore the top row (which has a class of “fixture-header”), and use the classes on each of the table columns to pull out the information I am interested in. Finally, I use the handy Dump extension method in LINQPad to output my information to the results window:

foreach(var fixture in fixturesTable.SelectNodes("tr"))
{
    var fixtureClass = fixture.Attributes["class"];
    // header rows have class of fixture-header
    if(fixtureClass == null || !fixtureClass.Value.Contains("fixture-header"))
    {
        var day = fixture.SelectSingleNode("td[@class='first ']").InnerText.Trim();
        var month = fixture.SelectSingleNode("td[@class='month']").InnerText.Trim();
        var venue = fixture.SelectSingleNode("td[@class='home-away']").InnerText.Trim();
        var oppositionNode = fixture.SelectNodes("td").FirstOrDefault(n => n.Attributes["class"].Value.Contains("opposition"));
        var opposition = oppositionNode.SelectSingleNode("a").InnerText.Trim();
        var matchReportUrl = oppositionNode.SelectSingleNode("a").Attributes["href"].Value.Trim();
        var goalsFor = fixture.SelectSingleNode("td[@class='goals-for']").InnerText.Trim();
        var goalsAgainst = fixture.SelectSingleNode("td[@class='goals-against']").InnerText.Trim();
        string.Format("{0} {1} {2} {3} {4}-{5}", day, month, venue, opposition, goalsFor, goalsAgainst).Dump();
    }
}

This gives me the data I am after, and from here it is easy to convert it into any other format I want such as XML or insert it into a database (something that LINQPad also makes very easy).

04 Sun H Blackburn Rovers 6-2
17 Sat H Birmingham City 3-1
20 Tue A AZ Alkmaar 1-1
25 Sun A West Ham United 2-2
28 Wed H Liverpool 2-1
31 Sat H Tottenham Hotspur 3-0
04 Wed H AZ Alkmaar 4-1
07 Sat A Wolverhampton W. 4-1

And the really nice thing about using LINQPad for this rather than creating a Visual Studio project is that the entire thing is stored in a single compact .linq file without all the extraneous noise of sln, csproj, AssemblyInfo files etc.

2 comments:

Anonymous said...

Hey, can u help me with your example? Im trying to repeat your project, but i have a problem with:

ar doc = web.Load

Error 1 'HtmlAgilityPack.HtmlWeb' does not contain a definition for 'Load' and no extension method 'Load' accepting a first argument of type 'HtmlAgilityPack.HtmlWeb' could be found (are you missing a using directive or an assembly reference?)


Can your help me with this?

Unknown said...

it ought to work. Are you sure you referenced the latest HTML Agility Pack?