Text Extraction¶
The bill scrapers <contributing-to-scrapers>
scrape the web and pull down metadata, including links to
various versions of the bills. As a later step, we extract the actual
text of the bill so that it can be indexed for search and other uses.
Checking out¶
Fork and clone the text-extraction repository:
-
Visit https://github.com/openstates/text-extraction and click the 'Fork' button.
-
Clone your fork using your tool of choice or the command line:
$ git clone git@github.com:yourname/text-extraction.git Cloning into 'text-extraction'...
-
And remember to
install pre-commit <pre-commit>
:$ pre-commit install pre-commit installed at .git/hooks/pre-commit
Repository overview¶
The text extraction code itself is written as a standalone Python script
text_extract.py
that uses configuration and utility functions from
within extract/
.
You'll also notice a directory called raw/
-- this contains a
sampling of bills for each state that we can use to test
text-extraction.
Typically if you're making changes in the repository you'll be editing
files within extract/
, we'll come back to that later.
Running text_extract¶
Just like in other repositories, we'll use docker-compose to run the
code. In this case docker-compose is running text_extract.py
, an
all-in-one tool that has a few useful subcommands:
Usage: text_extract.py [OPTIONS] COMMAND [ARGS]...
Options:
--help Show this message and exit.
Commands:
reindex-state rebuild the search index objects for a given state
sample obtain a sample of bills to extract text from
status print a status table showing the current condition of...
test run sample on all states, used for CI
update update the saved bill text in the database
For the purposes of development, sample
and update
are the only two
commands that you'll need to look at.
Let's go ahead and run sample against NC:
$ docker-compose run --rm text-extract sample nc
raw/nc/2017-HR 924-Edition 1.pdf => text/nc/2017-HR 924-Edition 1.pdf.txt (1507 bytes)
raw/nc/2017-HB 1034-Edition 1.pdf => text/nc/2017-HB 1034-Edition 1.pdf.txt (3096 bytes)
raw/nc/2019-SB 421-Edition 1.pdf => text/nc/2019-SB 421-Edition 1.pdf.txt (961 bytes)
raw/nc/2019-HB 430-Edition 1.pdf => text/nc/2019-HB 430-Edition 1.pdf.txt (4831 bytes)
raw/nc/2017-SB 753-Edition 1.pdf => text/nc/2017-SB 753-Edition 1.pdf.txt (719 bytes)
raw/nc/2019-HB 788-Edition 1.pdf => text/nc/2019-HB 788-Edition 1.pdf.txt (2674 bytes)
raw/nc/2017-SB 373-Filed.pdf => text/nc/2017-SB 373-Filed.pdf.txt (18538 bytes)
raw/nc/2019-SB 574-Filed.pdf => text/nc/2019-SB 574-Filed.pdf.txt (1712 bytes)
raw/nc/2017-SJR 686-Resolution 2017-12.pdf => text/nc/2017-SJR 686-Resolution 2017-12.pdf.txt (15928 bytes)
raw/nc/2017-HB 1007-Filed.pdf => text/nc/2017-HB 1007-Filed.pdf.txt (6248 bytes)
nc: processed 10, 0 skipped, 0 missing, 0 empty
The exact output and number of bills will vary across states, but should be pretty similar.
This command just did a lot:
- Read in the file
raw/nc.csv
to get a list of bills to sample.- Downloaded those files (assuming this was the first run) to
raw/nc/
so future runs will be faster.- Used the extraction function(s) defined in
extract/__init__.py
for NC to extract text from the given documents.- Wrote that output to
text/nc/
so you can compare.
You'll also notice that it helpfully prints the number of bytes of text
extracted, this is useful as a first check. Let's go ahead and look at
the shortest one, text/nc/2017-SB 753-Edition 1.pdf.txt
. (Your run may
differ, pick whichever you prefer.) :
$ cat "text/nc/2017-SB 753 Edition 1.pdf.txt"
A BILL TO BE ENTITLED
AN ACT PROVIDING THAT THE DEPOSIT OF CURRENCY AND COINS INTO A CASH
VAULT THAT PHYSICALLY SECURES THE CASH AND ELECTRONICALLY
RECORDS THE DEPOSIT DAILY IN AN OFFICIAL DEPOSITORY BANK QUALIFIES
AS A DAILY DEPOSIT UNDER THE LOCAL GOVERNMENT BUDGET AND FISCAL
CONTROL ACT FOR FRANKLIN AND WAKE COUNTIES AND THE
MUNICIPALITIES IN THOSE COUNTIES.
The General Assembly of North Carolina enacts:
SECTION 1. Section 2 of S.L. 2011-89 reads as rewritten:
"SECTION 2. This act applies only to the City of Winston-Salem only.Winston-Salem,
Franklin County and the municipalities in Franklin County, and Wake County and the
municipalities in Wake County."
SECTION 2. This act is effective when it becomes law.
This looks complete, but to check, go ahead and open the equivalent
source file, in this case raw/nc/2017-SB 753-Edition 1.pdf
and confirm
visually that all the text was extracted. Don't worry about formatting,
or the preamble, as we'll often exclude that and just aim for the
interesting bits of the text.
Making changes¶
Let's say that we discover that a state has started publishing their
bills in a new format. Perhaps Alabama switches from PDF to HTML. It'd
first be good to add some of these new bills to the sample csv, which
you can do manually or by invoking sample with the --resample
flag.:
docker-compose run --rm text-extract sample --resample al
Running would result in some warnings being printed and some zero byte files.
To actually handle the HTML documents we'd open up
extract/__init__.py
and find the CONVERSION_FUNCTIONS
dictionary,
you'll see a line like:
CONVERSION_FUNCTIONS = {
"al": {"application/pdf": extract_line_numbered_pdf},
...
The way extraction works is by matching a document found in a scrape to
an appropriate function, in this case PDFs will be sent through the
extract_line_numbered_pdf
function.
If the new HTML was wrapped in a given element, perhaps with
<div id="billtext">
we could just update that line to look like:
CONVERSION_FUNCTIONS = {
"al": {
"application/pdf": extract_line_numbered_pdf,
"text/html": extractor_for_element_by_id("billtext"),
},
...
And we'd be good to go.
Tips & Tricks¶
- Functions already exist for common configurations of PDF, HTML, Word Doc, and even OCR. Rarely will you need to write a custom function, always look at the options first.
- When dealing with PDFs, most are either handled by
extract_line_numbered_pdf
orextract_sometimes_numbered_pdf
, the difference is that "sometimes numbered PDF" accounts for cases where 90% or so of bills are numbered, but a few (often resolutions) are not numbered.
Formatting Guidelines¶
How far do we go? Should we strip punctuation? Newlines? Whitespace? Section headings?
- Try not to be too aggressive with punctuation stripping, search indices/etc. can easily strip it later, but it can be handy if someone decides they want to search for things like "ยง 143C-4-8.b"
- Ideally leave newlines as-is since it makes looking at changes a lot nicer for humans and stripping newlines out for final products (search/text comparison/etc.) is trivial.
- Collapsing spaces/etc. is recommended, but not required.
- Removal of section headers/etc. is fine, but if the only reason you're writing a new function instead of using a common one is to do this, reconsider.
When in doubt, ask, you may have encountered something we haven't considered yet and we can discuss the best practice and add it here.
Should we include bill digests?
There isn't a need to, but it doesn't hurt if separating the two is difficult.