Webscraping with Powershell

Webscraping often becomes a touchy subject in regards to the legality of the whole thing. Obviously, stealing/scraping copyrighted content and claiming it as your own crosses those boundaries, but it can also provide some great benefits when done within legal limits.

A few months back, I was shopping for a new car and found myself checking Craigslist multiple times a day within a specific criteria. I got tired of filling out the “Filter” forms over and over again for specific data and then thought to myself, “I bet I could script this.” Where Craigslist is already public data anyway, and I was only scraping it for my own personal use, I felt it was fine.

First, I began by defining a few variables and creating a few arrays to hold my data. I then looped through the number of pages that I specified earlier, each time invoking a new web request and adding those results to my $Pages array.

$MaxPages = 2
$MinPrice = 2500
$MaxPrice = 10000
$Cars = @()
$Pages = @()

For ($CurrentPage=0; $CurrentPage -lt $MaxPages; $CurrentPage++) {  
    $WebPage = Invoke-WebRequest "http://eastidaho.craigslist.org/search/cta?hasPic=1&min_price=$MinPrice&max_price=$MaxPrice&s=$($CurrentPage * 100)"
    $Pages += $WebPage.ParsedHtml.body.innerHTML
}

Now contained in my $Pages array are two pages of results (although I could’ve told it to loop through any other number of pages that I wanted). Before we can extract any data from these pages, we need to “explode” them, or break them up into individual results. Using my browser, I inspected the code and found that each new car posting was broken up by a new result-row CSS class.

We then use this specific line of code to break up our long result set into individual pieces, and then store them into a new $Results array.

$Results = $Pages -split "<li class=`"result-row`""

Now, here comes the fun part. We loop through all of the items within that array. For each item in the array, we will run it through a set of regular expressions looking for individual pieces of data. If it finds a match, it will store it in a $matches variable, which we can then assign to a custom powershell object.

In this case, I collected the Craigslist listing ID, description, price, and location for each car, but we could expand this to an infinite number of possibilities.

And then finally, once our custom object has been populated, we assign that to our $Cars global array (that was declared outside of this loop, so we don’t destroy our results every time it’s run).

ForEach ($Item in $Results) {

    # search for a 10-digit number
    if ($Item -match 'd{10}') {
        $ID = $matches[0]
    }

    # search for the text between two anchor tags, which contains a class of 'data-id' and a 10-digit number
    if ($Item -match 'data-id="d{10}">(.*?)</a>') {
        $Description = $matches[0]
        $Description = ($Description -replace ".*`">", "") -replace "</a>"
    }

    # look for 3 to 6 digits that are preceded by a dollar sign
    if ($Item -match '$d{3,6}') {
        $Price = $matches[0]
    }

    # look for a specific date-time format
    if ($Item -match 'd{4}-d{2}-d{2} d{2}:d{2}') {
        $DatePosted = $matches[0]
    }

    # look for a single word, wrapped in parentheses, then remove them
    if ($Item -match '(w+)') {
        $Location = $matches[0]
        $Location = ($Location -replace "[()]")
    }

    # insert our stripped data into a custom powershell object
    $ItemObject = New-Object -TypeName PSObject -Property @{
        'ID' = $ID
        'Description' = $Description
        'Price' = $Price
        'DatePosted' = $DatePosted
        'Location' = $Location

    }
    $Cars += $ItemObject
}

Once we’ve looped through all the results, we can display our results by simply calling our array. Because I’m picky, I wanted the data displayed in a specific order, which I explicitly specified.

$Cars | Format-Table ID,Description,Price,Location,DatePosted

Running all of the above code, we get the following results:

Pretty cool, eh? Again, this can be modified to include any number of search criteria by modifying the query string in the first step and creating additional regular expressions.

And if all of that was too difficult to follow, here’s the complete code block:

$MaxPages = 2
$MinPrice = 2500
$MaxPrice = 10000
$Cars = @()
$Pages = @()

For ($CurrentPage=0; $CurrentPage -lt $MaxPages; $CurrentPage++) {  
    $WebPage = Invoke-WebRequest "http://eastidaho.craigslist.org/search/cta?hasPic=1&min_price=$MinPrice&max_price=$MaxPrice&s=$($CurrentPage * 100)"
    $Pages += $WebPage.ParsedHtml.body.innerHTML
}

$Results = $Pages -split "<li class=`"result-row`""

ForEach ($Item in $Results) {

    # search for a 10-digit number
    if ($Item -match 'd{10}') {
        $ID = $matches[0]
    }

    # search for the text between two anchor tags, which contains a class of 'data-id' and a 10-digit number
    if ($Item -match 'data-id="d{10}">(.*?)</a>') {
        $Description = $matches[0]
        $Description = ($Description -replace ".*`">", "") -replace "</a>"
    }

    # look for 3 to 6 digits that are preceded by a dollar sign
    if ($Item -match '$d{3,6}') {
        $Price = $matches[0]
    }

    # look for a specific date-time format
    if ($Item -match 'd{4}-d{2}-d{2} d{2}:d{2}') {
        $DatePosted = $matches[0]
    }

    # look for a single word, wrapped in parentheses, then remove them
    if ($Item -match '(w+)') {
        $Location = $matches[0]
        $Location = ($Location -replace "[()]")
    }

    # insert our stripped data into a custom powershell object
    $ItemObject = New-Object -TypeName PSObject -Property @{
        'ID' = $ID
        'Description' = $Description
        'Price' = $Price
        'DatePosted' = $DatePosted
        'Location' = $Location

    }
    $Cars += $ItemObject
}

$Cars | Format-Table ID,Description,Price,Location,DatePosted

Leave a Reply

Your email address will not be published.