Serving US Legislative Markup Docs as JSON Using Django

The purpose of the API I created at authorityspoke.com is to take provisions of the US Code and Constitution that have been published in XML by the US government, extract their headings and content, and then serve those as JSON in response to web requests. So a question that a prospective user might ask is, why download laws as JSON when they’re already available for free as XML or as web pages that reflect their natural structure as documents published by Congress?

The answer is that I created the authorityspoke.com API to solve a problem I faced in creating the AuthoritySpoke Python library. The purpose of this library is to create computable objects that represent judicial holdings about the meanings of legislative provisions. In order for these objects to be useful, they need to contain a precise description of exactly what legislative text is being construed by the court.

Luckily, the Office of the Law Revision Counsel of the US House of Representatives uses a very standardized publishing process based on the United States Legislative Markup (USLM) XML standard. With USLM, a particular point-in-time version of a title of the US Code can be parsed automatically to get the text at a particular citation. However, an unsolved problem for me is that USLM editions of the US Code only go back to the year 2013, so by relying on them I can’t provide accurate snapshots of US Code provisions that were repealed before 2013.

My first approach, up through version 0.3 of AuthoritySpoke, was to require users to obtain the XML files themselves and then use the Python library to extract the text of the relevant provision. That approach had all kinds of problems. It was time-consuming and error-prone for users, it could fail if there were any errors or irregularities in the materials published by the government, and it required XML parsing libraries to be included among the dependencies of the AuthoritySpoke library. Also, the XML files didn’t contain the dates when each provision came into effect and then was repealed, so users would have to input that information themselves. For AuthoritySpoke to be reasonably useable, it needed a way to access a provision’s text, structure, and effective date range with a single API call. It is possible to create an API that serializes and loads Python objects to and from XML rather than JSON, but JSON is the easier choice when using serialization libraries like marshmallow and Django Rest Framework. So I started creating a JSON schema for legislative passages.

The USLM User’s Guide provides a great system for building identifiers for legislative provisions that look more like URLs than conventional legal citations. It involves creating nomenclature for each “level” of the citation, such as “section”, “subsection”, or “clause”. The names for each level are joined together with slashes. Optionally, the identifier can be followed by an @ sign and the date of the desired version of the provision. So for instance, if you wanted the current version of the United States Code paragraph about calculating credits against federal prison sentences for good behavior (18 U.S.C. Section 3624(b)(1)), you would use the identifier /us/usc/t18/s3624/b/1. But if you instead wanted the version of that paragraph in effect at the beginning of the year 2018, you would use /us/usc/t18/s3624/b/1@2018-01-01. And if you wanted to move up the tree and see the whole section about calculating a prisoner’s date of release, you could remove the last two parts of the identifier and just query for /us/usc/t18/s3624.

The treelike structure of the US Code and the USLM identifier format dictated the structure of the API I wanted to create. Since I’m a Python programmer who prefers tools with large user bases and a lot of documentation, the only web frameworks I considered were Flask and Django. I was a bit overwhelmed by the range of choices that needed to be made to set up a Flask project, so I chose Django for its more opinionated approach. The choice of Django basically locked me in to using a relational database instead of a document database, and since I wanted to use Django’s object-relational mapper (ORM), the database interface needed to be based on SQL, not GraphQL. So I needed a way to store the tree-shaped data about the citations that exist in the various versions of the US Code, with the constraints that every citation can have many children but can have only one parent, and that a particular citation might not exist at every point in time.

My first strategy for saving the citation tree to a database was to use django-mptt. MPTT stands for “Modified Preorder Tree Traversal”. This seemed to work well at first, but then as the number of rows in the database table rose into the tens of thousands, it slowed down so much that it would have been infeasible to finish populating the database. This slowness turned out to be a known issue, as shown in this GitHub discussion where django-mptt’s creator says he no longer uses it for new projects. Even though django-mptt has more GitHub stars than any other Django tree traversal library, it seems not to be a good choice for large projects.

I had better luck with django-treebeard. It has almost as much documentation as django-mptt, and it ran fast enough to be usable for building my database. However, I have to admit I see trouble on the horizon here too, because django-treebeard is updated very rarely, and I’m concerned it won’t maintain compatibility with future versions of Django. I think a longer-term solution might involve using PostgreSQL’s newish built-in ltree datatype for hierarchical data. Unfortunately I haven’t been able to find much in the way of tutorials for using ltree in Django.

Next I needed to decide how to handle changes in the text at a particular citation over time, or renumbering of the same text to a different citation. I chose a design that stored the actual text of each provision as a foreign key to its citation, because one citation can have different text content at different times, while the same text content may be exist at multiple citations, whether at the same time or different times. For instance, in the year 2014, section 55 of USC Title 2 was renumbered, along with two of its subsections and the “continuation” text after the last subsection, from their old citation to section 6316 of USC Title 2.

Here’s a class diagram showing how a section citation such as “/us/usc/t2/s6316” has a one-to-many relationship with its subsections, but it potentially has a many-to-many relationship with the text found at a citation, when the database stores the location of the text at different dates.

The idea of using tree-structured data required every enacted text passage to be located at a unique citation, but the law as codified isn’t quite so simple. The “continuation” XML elements contain statutory text, but they typically aren’t given identifiers. I chose to label them with the identifier of the provision immediately above, followed by “-con”. There are also a surprising number of places in the United States Code where the same identifier is used for multiple provisions in effect at the same time. In that case, I appended “-dup” to the later instance of the same citation, and if there were more than two, I also used “-du0”, “-du1”, and so on. As far as I can tell, that was enough to keep provisions from being overwritten in the database.

Here’s an example of fetching a provision from the API using the Legislice client (using an API token from the AuthoritySpoke account profile page):

from legislice.download import Client
client = Client(api_token=MY_API_TOKEN)
client.fetch(path="/us/const/amendment/XV")

The resulting JSON response is this:

{
  "heading": "AMENDMENT XV.",
  "content": "",
  "children": [
    {
      "heading": "Suffrage not to be abridged for race, color, etc.",
      "content": "The right of citizens of the United States to vote shall not be denied or abridged by the United States or by any State on account of race, color, or previous condition of servitude.",
      "children": [],
      "end_date": null,
      "node": "/us/const/amendment/XV/1",
      "start_date": "1870-03-30",
      "url": "https://authorityspoke.com/api/v1/us/const/amendment/XV/1/"
    },
    {
      "heading": "Section 2.",
      "content": "The Congress shall have power to enforce this article by appropriate legislation.",
      "children": [],
      "end_date": null,
      "node": "/us/const/amendment/XV/2",
      "start_date": "1870-03-30",
      "url": "https://authorityspoke.com/api/v1/us/const/amendment/XV/2/"
    }
  ],
  "end_date": null,
  "node": "/us/const/amendment/XV",
  "start_date": "1870-03-30",
  "url": "https://authorityspoke.com/api/v1/us/const/amendment/XV/",
  "parent": "https://authorityspoke.com/api/v1/us/const/amendment/"
}

The lingering problem with this API client is that in order to fetch a provision, it requires the user to have access to the provision’s identifier in the format used by the USLM schema and the Office of the Law Revision Counsel. Not many lawyers know that format. The logical solution is to provide a tool that converts conventional citations to the USLM style prior to making the API query. Preferably, this tool would extract the citations automatically from a document of interest to the user, rather than requiring users to retrieve the citations themselves. That’s a subject for another time.