Design Notes: YAML and JSON Schemas in AuthoritySpoke

AuthoritySpoke version 0.7 is available on PyPI, bringing with it a new data input format using YAML files. For documentation on that feature, check out the just-published user guide or the API documentation. With this blog post I’ll go more into my reasoning in making the changes, and where I see AuthoritySpoke going next.

I planned for AuthoritySpoke to load two kinds of data: machine-serialized JSON objects, and also handmade test data. I wanted this handmade data to allow various kinds of abbreviations, and even to be tolerant of certain kinds of errors. And then I made the fateful decision to create just one set of data loading schemas for loading both kinds of data. That was probably the most costly design mistake I’ve made on AuthoritySpoke (that I know of!) so far.

The functions that expanded abbreviated text in input files turned out to be easily the most finicky and error-prone parts of AuthoritySpoke. They also had a tendency to break in inscrutable ways when I modified functions in far-away parts of AuthoritySpoke that I had assumed were safely isolated from the text expansion functions. And they caused workflows that should have been simple to become lengthy and hard to debug. It was as if all the data that I loaded into AuthoritySpoke first had to be placed on a very long conveyor belt (or to be more literal, a very tall call stack) where the data would be poked and tweaked and adjusted by a long series of functions that corrected typos, expanded abbreviations, and the like. When something went wrong, I’d have to inspect all of the functions along the conveyor belt until I found the one that wasn’t working as designed. The Marshmallow data serialization library was permissive enough to let me introduce all kinds of anomalies into the data loading process, but in some ways I used that freedom to shoot myself in the foot. And of course, when I tried to use open source libraries to automatically generate a publishable OpenAPI specification by analyzing the schemas I’d written, the result made no sense because I’d used the serializers in nonstandard ways. (AuthoritySpoke’s current OpenAPI specification is better, I think.)

Also, the first process I established for loading handmade data was for the user to create a JSON file. But really, nobody wants to create JSON files by hand without purpose-built tools. So in version 0.7, my solution is to create a separate data loading workflow for handmade data, which should now be in YAML instead of in JSON. Here’s an example of a YAML file using the new data input format, with one of the rules from the “Beard Act” test dataset that I posted about before.

- holdings:
    - inputs:
        - type: fact
          content: "{the suspected beard} was facial hair"
        - type: fact
          content: the length of the suspected beard was >= 5 millimetres
        - type: fact
          content: the suspected beard occurred on or below the chin
      outputs:
        - type: fact
          content: the suspected beard was a beard
      enactments:
        - node: /test/acts/47/4
          exact:
            "In this Act, beard means any facial hair no shorter than 5 millimetres
            in length that: occurs on or below the chin"
      universal: true

(Nobody wants to create YAML files by hand either, but that’s a problem for another day.)

The YAML data loading module can now be kept separate from the rest of AuthoritySpoke, where it’ll be less likely to hurt anyone, and the workflow for loading data from JSON won’t include any features for handling abbreviations or typos. Most importantly for me, I’ll be able to write unit tests that get closer to isolating just the functions they’re really trying to test, without touching the text formatting functions.

I considered switching from Marshmallow to the trendier Pydantic serializer, but I decided against it for two related reasons. First, the AuthoritySpoke classes that represent units of legal analysis already have a very complicated subclass inheritance pattern. Pydantic requires any class that’s going to be serialized to also inherit from a Pydantic serialization parent class. I was afraid that inheriting another subclass would have added even more complexity that could have had unforeseen consequences. Second, I’ve had good experiences applying the design concept of dependency inversion. I want to think of serialization libraries as implementation details, not as core features of AuthoritySpoke. By sticking with Marshmallow, I can keep the serialization schemas in their own modules separate from the core business logic. The core modules of AuthoritySpoke don’t have to “know about” the serializer classes, and I can write unit tests for the core business logic that don’t touch Marshmallow in any way.

The biggest challenge remaining in AuthoritySpoke’s data schema (including the simpler non-YAML schema) is that it’s a polymorphic schema, meaning more than one object schema can occur in the same place. For instance, an “input” or “output” for AuthoritySpoke’s Holding class could be a Fact, or it could be an item of Evidence, or other things. In order to implement the feature of polymorphism, AuthoritySpoke needs to import not just Marshmallow but also a related library called marshmallow-oneofschema. I’ve learned that I should get nervous when I import a software package without a large and active community, and for me the easiest way to measure that community is GitHub stars, which basically correspond to satisfied users. Marshmallow has 5,500 stars, which is not that high compared to the 21,000 stars that its competitor Django Rest Framework has. (Pydantic has 6,500.) But if I want to generate an OpenAPI specification for my Marshmallow schema, I have to also download apispec, which has 859 stars at the time of writing. Then my polymorphic schema requires me to grab marshmallow-oneofschema, which has a mere 96 stars. And then the polymorphic part of my schema needs to be included in the OpenAPI specification too, so I have to import apispec-oneofschema, which has just eight stars including mine. Pretty scary. These libraries could have trouble in the future, and I expect to be relying on them a lot as I move forward with AuthoritySpoke.

The future of AuthoritySpoke depends on getting it working with web APIs. Not to mention a web user interface. Version 0.7 does a lot to simplify one of AuthoritySpoke’s data models to make it suitable for the web. An even simpler data model would be better, but I think the foundation exists to design ways to share and organize judicial rule models on authorityspoke.com.