Contributing to Scrapers¶

Scrapers are at the core of what Open States does, each state requires several custom scrapers designed to extract bills, legislators, committees, and votes from the state website. All together there are around 200 scrapers, each one essentially independent, which means that there is always more work to do, but fortunately plenty of prior work to learn from.

Checking Out¶

Fork and clone the main scraper repository:

Visit https://github.com/openstates/openstates-scrapers and click the 'Fork' button.

Clone your fork using your tool of choice or the command line:

$ git clone git@github.com:yourname/openstates-scrapers.git
Cloning into 'openstates-scrapers'...

And remember to install pre-commit <pre-commit> hooks:

$ pre-commit install
pre-commit installed at .git/hooks/pre-commit

Be sure to run poetry install to fetch the correct version of dependencies.

Warning

Before cloning on a Windows computer, you will need to disable line-ending conversion. git config --global core.autocrlf false After cloning and entering the repo, you'll likely want to set global line-ending conversion back to true, and set local conversion to false.

Repository Overview¶

At this point you'll have a local openstates-scrapers directory. Within it, you'll find a directory called scrapers, lets take a look at it:

$ ls scrapers
__init__.py dc          in          mn          nj          pr          va
ak          de          ks          mo          nm          ri          vi
al          fl          ky          ms          nv          sc          vt
ar          ga          la          mt          ny          sd          wa
az          hi          ma          nc          oh          tn          wi
ca          ia          md          nd          ok          tx          wv
co          id          me          ne          or          ut          wy
ct          il          mi          nh          pa          utils

This directory has 50+ python modules, one for each state.

Let's look inside one:

$ ls scrapers/nc
__init__.py    bills.py     votes.py

Some states' directories will differ a bit, but all will have __init__.py and bills.py.

The __init__.py file for each state has basic metadata on the state including a list of sessions.

Other files contain the scrapers, typically named bills, votes, etc.

At the root, you'll also find a directory called scrapers_next. This directory also has python modules for each state.

Inside a state, you'll find people and potentially committee scrapers written using spatula. The plan is to port all scrapers to this framework and have scrapers_next replace the scraper directory.

Running Your First Scraper¶

Let's run your state's bills scraper (substitute your state for 'nc' below) :

$ docker-compose run --rm scrape nc bills --fastmode --scrape

The parameters you pass after docker-compose run --rm scrape are passed to os-update. Here we're saying that we're running NC's scrapers, and that we want to do it in "fast mode". By default, os-update imports results into a postgres database; the --scrape flag skips that step.

The following arguments are optional: To bring up a list of the optional arguments in the CLI use -h, --help\ -h, --help show this help message and exit\ --debug open debugger on error\ --loglevel {LOGLEVEL} set log level. options are: DEBUG|INFO|WARNING|ERROR|CRITICAL (default is INFO)\ --scrape only run scrape post-scrape step\ --import only run import post-scrape step\ --nonstrict skip validation on save\ --fastmode use cache and turn off throttling\ --datadir {SCRAPED_DATA_DIR} data directory\ --cachedir {CACHE_DIR} cache directory\ -r {SCRAPELIB_RPM} scraper rpm\ --rpm {SCRAPELIB_RPM} scraper rpm\
--timeout {SCRAPELIB_TIMEOUT} scraper timeout\ --no-verify skip tls verification\ --retries {SCRAPELIB_RETRIES} scraper retries\ --retry_wait {SCRAPELIB_RETRY_WAIT_SECONDS} scraper retry wait\ --realtime loads bills in realtime to database, this requires configuring an AWS S3 bucket and using the lambda function: openstates-realtime

You'll see the run plan, which is what the update aims to capture; in this case we're scraping the state website's data into JSON files:

nc (scrape)
  bills: {}

Then legislative posts and organizations get created, which is mostly boilerplate:

08:46:35 INFO openstates: save jurisdiction North Carolina as jurisdiction_ocd-jurisdiction-country:us-state:nc-government.json
08:46:35 INFO openstates: save organization North Carolina General Assembly as organization_01d6327c-72d2-11e7-8df8-0242ac130003.json
08:46:35 INFO openstates: save organization Executive Office of the Governor as organization_01d63560-72d2-11e7-8df8-0242ac130003.json
08:46:35 INFO openstates: save organization Senate as organization_01d636e6-72d2-11e7-8df8-0242ac130003.json
08:46:35 INFO openstates: save post 1 as post_01d63a06-72d2-11e7-8df8-0242ac130003.json
08:46:35 INFO openstates: save post 2 as post_01d63b96-72d2-11e7-8df8-0242ac130003.json
08:46:35 INFO openstates: save post 3 as post_01d63cea-72d2-11e7-8df8-0242ac130003.json
08:46:35 INFO openstates: save post 4 as post_01d63e34-72d2-11e7-8df8-0242ac130003.json
08:46:35 INFO openstates: save post 5 as post_01d63f74-72d2-11e7-8df8-0242ac130003.json

And then the actual data scraping begins, defaulting to the most recent legislative session:

08:46:36 INFO openstates: no session specified, using 2017
08:46:36 INFO scrapelib: GET - http://www.ncga.state.nc.us/gascripts/SimpleBillInquiry/displaybills.pl?Session=2017&tab=Chamber&Chamber=Senate
08:46:38 INFO scrapelib: GET - http://www.ncga.state.nc.us/gascripts/BillLookUp/BillLookUp.pl?Session=2017&BillID=S1
08:46:39 INFO openstates: save bill SR 1 in 2017 as bill_03c7edb4-72d2-11e7-8df8-0242ac130003.json
08:46:39 INFO scrapelib: GET - http://www.ncga.state.nc.us/gascripts/BillLookUp/BillLookUp.pl?Session=2017&BillID=S2
08:46:39 INFO openstates: save bill SJR 2 in 2017 as bill_044a5fc4-72d2-11e7-8df8-0242ac130003.json
08:46:39 INFO scrapelib: GET - http://www.ncga.state.nc.us/gascripts/BillLookUp/BillLookUp.pl?Session=2017&BillID=S3
08:46:40 INFO openstates: save bill SB 3 in 2017 as bill_04e8c66e-72d2-11e7-8df8-0242ac130003.json
08:46:40 INFO scrapelib: GET - http://www.ncga.state.nc.us/gascripts/BillLookUp/BillLookUp.pl?Session=2017&BillID=S4
08:46:41 INFO openstates: save bill SB 4 in 2017 as bill_05781f08-72d2-11e7-8df8-0242ac130003.json
08:46:41 INFO scrapelib: GET - http://www.ncga.state.nc.us/gascripts/BillLookUp/BillLookUp.pl?Session=2017&BillID=S5

Depending on the scraper you run, this part takes a while. Some scrapers can take hours to run depending on the number of bills and speed of the state's website.

Note

It is often desirable to bail out of running the whole scrape (Ctrl-C) after it has gotten a bit of data, instead of letting it run the entire scrape.

To review the data you just fetched, you can browse the _data/nc/ directory and inspect the JSON files. If you're trying to make a small fix this is often sufficient, you can confirm that the scraped data looks correct and move on.

Please see our document on Querying Scraper Output Data for tools you can use to investigate data issues across a set of many scraped data output files.

Note

It is of course possible that the scrape fails. If so, there's a good chance that isn't your fault, especially if it starts to run and then errors out. Scrapers do break, and there's no guarantee North Carolina didn't change their legislator page yesterday, breaking our tutorial here.

If that's the case and you think the issue is with the scraper, feel free to get in touch with us or file an issue.

At this point you're ready to run scrapers and contribute fixes. Hop onto our GitHub ticket queue, pick an issue to solve, and then submit a Pull Request!

Importing Data¶

Optionally, if you'd like to see how your scraped data imports into the database, perhaps to diagnose an issue that is happening after the scrape, pop over to getting a working database <working-database> to see how to get a local database that you can import data into.

Once that's done, make sure that the db image from openstates.org is running:

$ docker ps
CONTAINER ID        IMAGE                       COMMAND                  CREATED             STATUS              PORTS                    NAMES
27fe691ad7c5        mdillon/postgis:11-alpine   "docker-entrypoint.s…"   3 hours ago         Up 3 hours          0.0.0.0:5405->5432/tcp   openstatesorg_db_1

Your output will vary, but if you don't see something named openstatesorg_db running you should run this command (from the openstates.org directory, not your scraper directory):

$ docker-compose up -d db

Now, when you want to run imports, you can drop the --scrape portion of the command you've been running. Or if you just want to test the import of already scraped data you can replace it with --import.

An import looks something like this:

$ docker-compose run --rm scrape fl bills --fast
... (truncated) ...
23:03:34 ERROR openstates: cannot resolve pseudo id to Person: ~{“name”: “Grant, M.“}
23:03:36 ERROR openstates: cannot resolve pseudo id to Person: ~{“name”: “Rodrigues, R.“}
fl (import)
  bills: {}
import:
  bill: 0 new 0 updated 2620 noop
  jurisdiction: 0 new 0 updated 1 noop
  vote_event: 21 new 12 updated 533 noop

The errors about unresolved psuedo-ids can safely be ignored, as long as you see the final run report the data you scraped is available in your database.

The number of objects of each type that were created & updated are available for spot checking, as well as the total number of items that were seen that already exactly matched what was in the database. These can be useful stats as you try to see if your local changes to a scraper have the impact you expect.

Running Spatula Scrapers¶

Let's run a people scraper:

$ poetry run spatula scrape scrapers_next.nc.people.SenList

The command to run these scrapers is structured differently, as the parameters are set by giving the exact location of the function you want to run: directory.state.file.function.

Note

Function names do vary and scrapes for legislators are commonly split by chamber, so make sure to check you're passing the right function in your command.

The actual data scraping should look something like:

INFO:scrapers_next.nc.people.SenList:fetching https://www.ncleg.gov/Members/MemberTable/S
INFO:scrapers_next.nc.people.LegDetail:fetching https://www.ncleg.gov/Members/Biography/S/430
INFO:scrapers_next.nc.people.LegDetail:fetching https://www.ncleg.gov/Members/Biography/S/431
INFO:scrapers_next.nc.people.LegDetail:fetching https://www.ncleg.gov/Members/Biography/S/432
INFO:scrapers_next.nc.people.LegDetail:fetching https://www.ncleg.gov/Members/Biography/S/433
INFO:scrapers_next.nc.people.LegDetail:fetching https://www.ncleg.gov/Members/Biography/S/434

To review the data you scraped, you can inspect the JSON files in the dated directory within _scrapes/. Each time you run a scrape, a new numbered folder will be within the dated directory, so you can compare older data to new easily.

Note

If a scrape fails, it's likely an issue with the scraper. Feel free to get in touch with us or file an issue.

Spatula is incredibly powerful with lots of flexibility and useful CLI commands that are worth checking out as well.

At this point you're ready to run spatula scrapers and contribute fixes. Hop onto our GitHub ticket queue, pick an issue to solve, and then submit a Pull Request!