Contributing to Scrapers

Scrapers are at the core of what Open States does, each state requires several custom scrapers designed to extract bills, legislators, committees, and votes from the state website. All together there are around 200 scrapers, each one essentially independent, which means that there is always more work to do, but fortunately plenty of prior work to learn from.

Checking out

Fork and clone the main scraper repository:

  • Visit https://github.com/openstates/openstates-scrapers and click the ‘Fork’ button.

  • Clone your fork using your tool of choice or the command line:

    $ git clone git@github.com:yourname/openstates-scrapers.git
    Cloning into 'openstates'...
    
  • And remember to install pre-commit:

    $ pre-commit install
    pre-commit installed at .git/hooks/pre-commit
    

Warning

Before cloning on a Windows computer, you will need to disable line-ending conversion. git config --global core.autocrlf false After cloning and entering the repo, you’ll likely want to set global line-ending conversion back to true, and set local conversion to false.

Repository overview

At this point you’ll have a local openstates-scrapers directory. Within it, you’ll find a directory called scrapers lets take a look at it:

$ ls scrapers
__init__.py dc          in          mn          nj          pr          va
ak          de          ks          mo          nm          ri          vi
al          fl          ky          ms          nv          sc          vt
ar          ga          la          mt          ny          sd          wa
az          hi          ma          nc          oh          tn          wi
ca          ia          md          nd          ok          tx          wv
co          id          me          ne          or          ut          wy
ct          il          mi          nh          pa          utils

This directory has 50+ python modules, one for each state.

Let’s look inside one:

$ ls scrapers/nc
__init__.py    bills.py       committees.py  people.py      votes.py

Some states’ directories will differ a bit, but all will have __init__.py and bills.py.

The __init__.py file for each state has basic metadata on the state including a list of sessions.

Other files contain the scrapers, typically named bills, votes, etc.

Running Your First Scraper

Let’s run your state’s bills scraper (substitute your state for ‘nc’ below)

$ docker-compose run --rm scrape nc bills --fastmode --scrape

The parameters you pass after docker-compose run --rm scrape are passed to os-update. Here we’re saying that we’re running NC’s scrapers, and that we want to do it in “fast mode”. By default, os-update imports results into a postgres database; the --scrape flag skips that step.

You’ll see the run plan, which is what the update aims to capture; in this case we’re scraping the state website’s data into JSON files:

nc (scrape)
  bills: {}

Then legislative posts and organizations get created, which is mostly boilerplate:

08:46:35 INFO openstates: save jurisdiction North Carolina as jurisdiction_ocd-jurisdiction-country:us-state:nc-government.json
08:46:35 INFO openstates: save organization North Carolina General Assembly as organization_01d6327c-72d2-11e7-8df8-0242ac130003.json
08:46:35 INFO openstates: save organization Executive Office of the Governor as organization_01d63560-72d2-11e7-8df8-0242ac130003.json
08:46:35 INFO openstates: save organization Senate as organization_01d636e6-72d2-11e7-8df8-0242ac130003.json
08:46:35 INFO openstates: save post 1 as post_01d63a06-72d2-11e7-8df8-0242ac130003.json
08:46:35 INFO openstates: save post 2 as post_01d63b96-72d2-11e7-8df8-0242ac130003.json
08:46:35 INFO openstates: save post 3 as post_01d63cea-72d2-11e7-8df8-0242ac130003.json
08:46:35 INFO openstates: save post 4 as post_01d63e34-72d2-11e7-8df8-0242ac130003.json
08:46:35 INFO openstates: save post 5 as post_01d63f74-72d2-11e7-8df8-0242ac130003.json

And then the actual data scraping begins, defaulting to the most recent legislative session:

08:46:36 INFO openstates: no session specified, using 2017
08:46:36 INFO scrapelib: GET - http://www.ncga.state.nc.us/gascripts/SimpleBillInquiry/displaybills.pl?Session=2017&tab=Chamber&Chamber=Senate
08:46:38 INFO scrapelib: GET - http://www.ncga.state.nc.us/gascripts/BillLookUp/BillLookUp.pl?Session=2017&BillID=S1
08:46:39 INFO openstates: save bill SR 1 in 2017 as bill_03c7edb4-72d2-11e7-8df8-0242ac130003.json
08:46:39 INFO scrapelib: GET - http://www.ncga.state.nc.us/gascripts/BillLookUp/BillLookUp.pl?Session=2017&BillID=S2
08:46:39 INFO openstates: save bill SJR 2 in 2017 as bill_044a5fc4-72d2-11e7-8df8-0242ac130003.json
08:46:39 INFO scrapelib: GET - http://www.ncga.state.nc.us/gascripts/BillLookUp/BillLookUp.pl?Session=2017&BillID=S3
08:46:40 INFO openstates: save bill SB 3 in 2017 as bill_04e8c66e-72d2-11e7-8df8-0242ac130003.json
08:46:40 INFO scrapelib: GET - http://www.ncga.state.nc.us/gascripts/BillLookUp/BillLookUp.pl?Session=2017&BillID=S4
08:46:41 INFO openstates: save bill SB 4 in 2017 as bill_05781f08-72d2-11e7-8df8-0242ac130003.json
08:46:41 INFO scrapelib: GET - http://www.ncga.state.nc.us/gascripts/BillLookUp/BillLookUp.pl?Session=2017&BillID=S5

Depending on the scraper you run, this part takes a while. Some scrapers can take hours to run depending on the number of bills and speed of the state’s website.

Note

It is often desirable to bail out of running the whole scrape (Ctrl-C) after it has gotten a bit of data, instead of letting it run the entire scrape.

To review the data you just fetched, you can browse the _data/nc/ directory and inspect the JSON files. If you’re trying to make a small fix this is often sufficient, you can confirm that the scraped data looks correct and move on.

Note

It is of course possible that the scrape fails. If so, there’s a good chance that isn’t your fault, especially if it starts to run and then errors out. Scrapers do break, and there’s no guarantee North Carolina didn’t change their legislator page yesterday, breaking our tutorial here.

If that’s the case and you think the issue is with the scraper, feel free to get in touch with us or file an issue.

At this point you’re ready to run scrapers and contribute fixes. Hop onto our GitHub ticket queue, pick an issue to solve, and then submit a Pull Request!

Importing Data

Optionally, if you’d like to see how your scraped data imports into the database, perhaps to diagnose an issue that is happening after the scrape, pop over to getting a working database to see how to get a local database that you can import data into.

Once that’s done, make sure that the db image from openstates.org is running:

$ docker ps
CONTAINER ID        IMAGE                       COMMAND                  CREATED             STATUS              PORTS                    NAMES
27fe691ad7c5        mdillon/postgis:11-alpine   "docker-entrypoint.s…"   3 hours ago         Up 3 hours          0.0.0.0:5405->5432/tcp   openstatesorg_db_1

Your output will vary, but if you don’t see something named openstatesorg_db running you should run this command (from the openstates.org directory, not your scraper directory):

$ docker-compose up -d db

Now, when you want to run imports, you can drop the --scrape portion of the command you’ve been running. Or if you just want to test the import of already scraped data you can replace it with --import.

An import looks something like this:

$ docker-compose run --rm scrape fl bills --fast
... (truncated) ...
23:03:34 ERROR openstates: cannot resolve pseudo id to Person: ~{“name”: “Grant, M.“}
23:03:36 ERROR openstates: cannot resolve pseudo id to Person: ~{“name”: “Rodrigues, R.“}
fl (import)
  bills: {}
import:
  bill: 0 new 0 updated 2620 noop
  jurisdiction: 0 new 0 updated 1 noop
  vote_event: 21 new 12 updated 533 noop

The errors about unresolved psuedo-ids can safely be ignored, as long as you see the final run report the data you scraped is available in your database.

The number of objects of each type that were created & updated are available for spot checking, as well as the total number of items that were seen that already exactly matched what was in the database. These can be useful stats as you try to see if your local changes to a scraper have the impact you expect.