Converting Scrapers to pupa

Note

This document is a work-in-progress; if any part of it is unclear, suggest changes or improvements.

As of early 2017, we’ve embarked on a process to switch away from our legacy backend (billy) that has been in use since 2009, to a more modern backend (pupa) based on the Open Civic Data specification and tools. We’ve written in a few places why this change is important and worth our time:

This task will require updates to every single one of our scrapers. Given that this is such a big task, and will enable many more converting scrapers from billy to pupa is one of the best ways to help out on Open States right now. Follow this guide and start converting a state!

Before You Start

If you haven’t already, you should read our Start Contributing to Open States guide. It’s important to know how scrapers are run, before starting to convert them.

It’s also important to make sure that the state you’ve chosen to convert from billy to pupa hasn’t already been converted! This is tracked in ticket #1442. If your state of choice isn’t available, consider picking another state, or fixing some Open States bugs.

Begin

Comment on the aforementioned tracking ticket so that no one else accidentally works on this state as well, duplicating effort. And once you have a Git branch with some conversion work in it, open a WIP PR (Work-in-Progress Pull Request) against the Open States repository, so that others can follow along.

Converting Metadata

For example purposes, let’s look at what it takes to convert North Carolina. Going forward, you can replace nc with the abbreviation for your selected state.

Each state has static metadata found in openstates/{{state}}/__init__.py. The first step will be to convert this metadata from the billy format to the new pupa format.

  1. Start by moving the existing metadata to our new billy_metadata/ directory; until all states are in Pupa format, we’re going to need to keep the billy-specific metadata around:

    $ git mv openstates/nc/__init__.py billy_metadata/nc.py
    
  2. Edit billy_metadata/nc.py:

    Delete everything in the module besides the metadata dictionary and any imports it requires. You temporarily may want to leave session_list around as well, as we’ll be using it in the next step.

    Example diff: NC billy_metadata

  3. Create a new openstates/nc/__init__.py:

    There’s a script to help with this step. It isn’t guaranteed to work perfectly, but should at least provide a good starting point:

    $ ./scripts/convert_metadata.py nc > openstates/nc/__init__.py
    

    Then, set the url property inside to a valid URL representing the state’s government. You may need to modify the file further, such as indicating the number of seats in the upper and lower houses of your state.

    Delete the session_list function from billy_metadata/nc.py add it to the jurisdiction subclass in the newly created openstates/nc/__init__.py file, renamed to get_session_list.

    Example diff: updated NC metadata

At this point you should have a fairly complete OCD jurisdiction defined. Next, we’ll move on to converting the legislator scraper.

Converting Legislators

We won’t be able to test our pupa metadata file until we write a pupa scraper, so let’s do that!

For a state in the billy framework, we have a legislators.py file that contains a scraper instantiated from billy.scrape.legislators.Legislator. This scraper captures and saves Legislator objects.

In pupa, this scraper is called people scraper instead. (This is because OCD can easily model individuals who aren’t members of a legislature.)

As you’ll see, pupa scrapers yield scraped objects, whereas a billy scraper would call save_legislator; yield and yield from expose Python 3’s powerful generator and subroutine capabilities.

Before diving in, it’s helpful to look over the docs for billy legislator scrapers and pupa person scrapers.

  1. Rename the scraper file:

    $ git mv openstates/nc/legislators.py openstates/nc/people.py
    
  2. Open people.py and update the import statement:

    At the top of the file, you’ll see something like:

    from billy.scrape.legislators import LegislatorScraper, Legislator
    

    We instead want:

    from pupa.scrape import Person, Scraper
    

    (pupa doesn’t have different Scraper subclasses.)

    Also, rename the file’s instantiated scraper (so that it refers to Person rather than Legislator), and make it a subclass of Scraper rather than LegislatorScraper. Using the North Carolina example, you would convert this:

    class NCLegislatorScraper(LegislatorScraper):
    

    to this:

    class NCPersonScraper(Scraper):
    

    Note that if your class also subclasses from something else like LXMLMixin, do not remove that. For example, if you had class NCLegislatorScraper(LegislatorScraper, LXMLMixin):, then you would change it to class NCPersonScraper(Scraper, LXMLMixin):.

  3. Update the scrape method’s signature:

    In billy scrapers, the scrape method signature is scrape(term, chambers), and serves as an entrypoint for the scraper class.

    pupa scrapers also use a scrape method as an entrypoint, but the parameters are all optional.

    Because most legislator scrapers only scrape the current session, we’ll drop the term argument, and the chambers argument can be made into an optional chamber argument.

    The NC scraper already had a scrape_chamber method that was invoked by the scrape method. So, we updated our scrape method to dispatch like this:

    def scrape(self, chamber=None):
        if chamber:
            yield from self.scrape_chamber(chamber)
        else:
            yield from self.scrape_chamber('upper')
            yield from self.scrape_chamber('lower')
    

    pupa scrape methods (which are generators) must yield objects. Since the NC scraper’s scrape_chamber method (also a generator) will collect and yield the People objects initially, the scrape method must yield from that generator itself.

  4. Update the portion of the code that creates and saves Legislator objects:

    The billy scrapers create Legislator objects, and then call self.save_legislator. We’ll need to turn self.save_legislator into a yield of Person objects.

    This change is typically minimal; there’s a lot of code in billy legislator scrapers, but very little of it should need to be edited for the purposes of pupa.

    Instead of instantiating Legislator objects, instantiate Person objects. You should also name the variable that will hold your Person as person, whereas your Legislator was probably assigned to leg or legislator. Then, in all remaining code, replace leg/legislator with person.

    All arguments passed to Person need to be named; some states’ old scrapers do not assign names to all arguments, so you will need to add argument names in those cases. Also, some named arguments have changed names. Your Person should only take these five arguments:

    • primary_org (may have been chamber)
    • district
    • name (may have been full_name)
    • party
    • image (may have been photo_url)

    Using the chamber to primary_org change as an example, your instantiation of the Legislator will probably say either chamber=chamber or just chamber, but in either case should be changed to primary_org=chamber when instantiating the Person. Note that there is no need to change the variable name earlier in the code.

    Instead of passing url as an argument, add any such links with Person.add_link.

    term should no longer be given as an argument. Any extra arguments that were given to Legislator (besides the five listed above) can be placed in Person‘s extras dictionary. For example, if Legislator was given a town_represented, you would instead do something like this:

    person.extras['town_represented'] = town_represented
    

    For contact information, instead of using add_office, you’ll use Person‘s add_contact_detail method. For example, adding a district office’s phone number might look something like this:

    person.add_contact_detail(type='voice', value=contact_info['phone'], note='District Office')
    

    Note that contact_info['phone'] above should be replaced with wherever that phone number was stored earlier in the code. The type comes from the Popolo standard.

    Instead of self.save_legislator(Legislator) from billy, end with yield person. (Make sure that any function that creates Person objects outside of scrape is invoked by scrape using yield from, as described above.)

    Again, it might be a good idea to look over the docs for billy legislator scrapers and pupa person scrapers.

    Since you’re also switching from Python 2 (billy) to Python 3 (pupa), you may need to make syntax changes to the module. For instance, if Dict.iteritems() is used anywhere, it would have to be replaced by Dict.items(). Also, xrange will need to be replaced by range.

    At this point, your person scraper should essentially be converted.

    Example diff: converted legislator scraper (there may be significant differences between the North Carolina example and your state)

  5. Revisiting the metadata:

    We now need to make one small change to the metadata (ie, the __init__.py file) to let pupa know about our person scraper. Import our new scraper at the top of openstates/nc/__init__.py:

    from .people import NCPersonScraper
    

    And within the Jurisdiction object, update the scrapers dictionary to look like:

    scrapers = {
        'people': NCPersonScraper,
    }
    
  6. Running your first scraper:

    Now let’s try giving it a run:

    $ docker-compose run scrape nc
    

    This runs pupa scrapers for the state. A second script is then executed, back-porting the scraped pupa data to billy format; since the API and website currently rely on the billy format, this is necessary during the transition off of billy.

You’ll probably see output like:

no pupa_settings on path, using defaults
nc (scrape)
  people: {}
Not checking sessions...
15:35:05 INFO pupa: save jurisdiction North Carolina as jurisdiction_ocd-jurisdiction-country:us-state:nc-government.json
15:35:05 INFO pupa: save organization North Carolina General Assembly as organization_6ecadcc4-0122-11e7-91f7-0242ac130003.json
15:35:05 INFO pupa: save organization Senate as organization_6ecae228-0122-11e7-91f7-0242ac130003.json
15:35:05 INFO pupa: save post 1 as post_6ecb36e2-0122-11e7-91f7-0242ac130003.json
15:35:05 INFO pupa: save post 2 as post_6ecb3840-0122-11e7-91f7-0242ac130003.json
15:35:05 INFO pupa: save post 3 as post_6ecb3976-0122-11e7-91f7-0242ac130003.json
15:35:05 INFO pupa: save post 4 as post_6ecb3ab6-0122-11e7-91f7-0242ac130003.json

The people: {} line describes what type of data pupa is trying to scrape, that it has found your Person scraper, and that it is running without any arguments.

Next, you see the line Not checking sessions..., which we’ll revisit later.

If all goes well, the scraper will run for a while, writing JSON objects to the _data directory as it goes.

Finally, you’ll see output like:

nc (scrape)
  people: {}
jurisdiction scrape:
  duration:  0:00:00.561228
  objects:
    jurisdiction: 1
    organization: 5
    post: 170
people scrape:
  duration:  0:00:03.910275
  objects:
    membership: 340
    person: 170

This is the result of the scrape, including the metadata and person objects that were successfully collected.

Once that is done you’ll see the to-billy conversion begin, ultimately ending in some lines like:

15:43:34 INFO billy: billy-update abbr=nc
    actions=import,report
    types=bills,legislators,votes,committees,alldata
    sessions=2017
    terms=2017-2018
15:43:35 INFO billy: Finished importing 170 legislator files.
15:43:35 INFO billy: imported 0 vote files
15:43:35 INFO billy: imported 0 bill files
15:43:35 INFO billy: imported 0 committee files

The import part to check is the {{n}} legislator files, which ought to match the number of person objects reported by pupa.

Once you get to this point, you have successfully converted a scraper to pupa! Congratulations, and thank you! Let’s make sure your hard work gets integrated.

Creating Your Pull Request

Once you have this work done, go ahead and let us know so that we can avoid duplicating effort.

The preferred way to do this is to open a work-in-progress PR, naming your PR something like [WIP] Convert {{state}} to pupa. A helpful guide to making PRs with GitHub is here: https://help.github.com/articles/creating-a-pull-request/

Someone from the team will review the PR and possibly request that you make some minor fixes, but no matter the status your work will be helpful. If you’d like to continue on, Converting Scrapers to pupa (continued) has information on converting the remaining types of scrapers.