Testing Scrapers

One of the first things people new to the project tend to notice is that there aren’t a lot of tests in the scrapers.

Over the years we’ve evolved a de facto policy of somewhat discouraging tests, which is definitely an unusual stance to take and warrants explanation.

Intentionally Fragile Scrapers

When it comes to scrapers, there are two major types of breakage:

  1. the scraper collects bad information and inserts it into the database
  2. the scraper encounters an error and quits without importing data

Given a choice, the second is greatly preferable. Once bad data makes it into the database it can be difficult to detect and remove. On the other hand, the first can be triggered to alert us immediately and someone can evaluate the proper fix.

The best way to favor the second over first is to write “intentionally fragile” scrapers. That is, scrapers that raise an exception when they see unexpected input.

While it is possible to try to write a resilient scrapers that recovers, by nature these scrapers are more likely to produce the second kind of error, and so we encourage scraper writers to be conservative in what errors are suppressed.

Here’s an example of an overly permissive scraper:

party_abbr = doc.xpath('//span[@class="partyabbr"])
if party_abbr == 'D':
    party = 'Democratic'
elif party_abbr == 'R':
    party = 'Republican'
else:
    # haven't seen this yet, but let's just keep things moving
    party = party_abbr

The following would be preferred:

party_abbr = doc.xpath('//span[@class="partyabbr"])
party = {'D': 'Democratic', 'R': 'Republican'}[party_abbr]

This code would raise a KeyError the first time a new party is found. This forces someone to take a look, fix the scraper with an entry for the new party, and then the scraper will be able to run again w/ correct data.

Testing Scrapers Is Hard

On most software projects a failing test means that something is broken and passing tests should mean that things are working just fine.

In our experience however, the majority of the “breaks” that occur in scrapers are due to upstream site changes.

In the past the fragile nature of scrapers has led to people writing a lot of bad tests, which is where our stance of somewhat discouraging tests has come from. An example of a bad test:

def extract_name(doc):
    return doc.xpath('//h2[@class="legislatorName"]').text_content().strip()


def test_extract_name():
    # probably a snapshot of the page at some point in time
    EXAMPLE_LEGISLATOR_HTML = '...'

    doc = lxml.html.fromstring(EXAMPLE_LEGISLATOR_HTML)
    assert extract_name(doc) == 'Erica Example'
With a test like this:
  • As soon as the HTML changes, the scraper will start failing, but the tests will still pass.
  • The scraper will then be updated, breaking the test.
  • The test HTML will be updated, fixing the test.

But since the initial scraper breakage isn’t predicted by a failing test, this type of test really doesn’t serve us any purpose and just results in extra code to maintain every time the scraper needs a slight change.

Other Strategies

Of course this isn’t to say that we just just abandon the idea of testing altogether.

If you’re more comfortable writing tests, say you’re parsing a particularly nasty PDF and want to run it against some test data: a test might make sense there as a way to be confident in your own code, by all means, write a test.

We also have some other strategies to help ensure data quality:

Validate Scraper Output

Scraper output is verified against JSON schemas that protect against common regressions (missing sources, invalid formatted districts, etc.) - most of these tests can be done effectively across the board on scraper output, and in doing so also applies universally across all 50 states.

We also aim for our underlying libraries like billy to be as well-tested as possible. (To be 100% clear, our lax testing philosophy only applies to site-specific scraper code, not these support libraries.)

Run Scrapers Regularly

In a sense, the scrapers are tested every night by being run. This is why the intentionally fragile approach is so important, those failures are in essence the same as integration test failures. Of course, this doesn’t tell us if the scraper is picking up bad data/etc, but combined w/ validation we can be fairly confident in our data.

Test Utilities

One area we can definitely improve upon is our use of (and then thorough testing of) common functions. Right now (largely because the great variety of authors, etc.) many scrapers do similar things like conversion of party abbreviations and whitespace normalization in slightly different ways. We should be making a push to use common utility functions and thoroughly test those.