Around the beginning of 2021, the Free Law Project extracted the code that it’s been using to link case citations within CourtListener, and released it as a new open source Python package called Eyecite. I think Eyecite could become the most widely useful open source legal analysis tool to be released by anyone so far. It seems to have incredible potential for citation network analysis, and for preparing caselaw for natural language processing. I’m sure tools like this have existed inside commercial publishers for a long time, but providing these capabilities to open source developers could make a huge difference in expanding access to law.

Eyecite is built atop two arduous research projects that were themselves released as Python packages: Courts-DB and Reporters-DB. These provide the data that lets Eyecite know which strings are valid case citations, and what courts published the opinions at each citation. Courts-DB and Reporters-DB were also created by the Free Law Project, building on earlier work by Frank Bennett and the Legal Resource Registry.

I’ll use the rest of this blog post to try out Eyecite’s basic features and give my first impressions. Eyecite is still under active development and I’m testing the version on the current master branch, which isn’t an official release version so it could be extra-buggy.

Detecting Citations

I tested Eyecite’s citation detection feature on the first paragraph of the discussion section of the US Supreme Court’s recent opinion in Google v. Oracle America.

>>> import eyecite
text_from_opinion = """Copyright and patents, the Constitution says,
    are to “promote the Progress of Science and useful Arts,
    by securing for limited Times to Authors and Inventors the
    exclusive Right to their respective Writings and Discoveries.”
    Art. I, §8, cl. 8. Copyright statutes and case law have made
    clear that copyright has practical objectives. It grants an
    author an exclusive right to produce his work (sometimes for
    a hundred years or more), not as a special reward, but in order
    to encourage the production of works that others might reproduce
    more cheaply. At the same time, copyright has negative features.
    Protection can raise prices to consumers. It can impose special
    costs, such as the cost of contacting owners to obtain reproduction
    permission. And the exclusive rights it awards can sometimes stand
    in the way of others exercising their own creative powers. See
    generally Twentieth Century Music Corp. v. Aiken, 422 U. S. 151,
    156 (1975); Mazer v. Stein, 347 U. S. 201, 219 (1954)."""

Eyecite successfully discovered all three citations in the paragraph.

>>> citations = eyecite.get_citations(text_from_opinion)
>>> len(citations)
3

Eyecite also successfully found that the first citation wasn’t a citation to a case.

>>> citations[0]
NonopinionCitation(
    token=SectionToken(
        data='§8,',
        start=254,
        end=257),
    index=93,
    span_start=None,
    span_end=None)

The only slight problem was that Eyecite only found three characters of the Non-opinion citation. If I needed to exclude the Non-opinion citations from the text for some reason, it would have been better if it had found the full citation text “Art. I, §8, cl. 8”.

>>> citations[0].token.data
'§8,'

Eyecite identified the other two citations in the paragraph as case citations. It came up with an amazing amount of information about them, almost all of which looks correct (it only came up with “Corp.” for the plaintiff’s name).

>>> citations[1]
FullCaseCitation(
    token=CitationToken(
        data='422 U. S. 151',
        start=984,
        end=997,
        volume='422',
        reporter='U. S.',
        page='151',
        exact_editions=(),
        variation_editions=(
            Edition(
                reporter=Reporter(
                    short_name='U.S.',
                    name='United States Supreme Court Reports',
                    cite_type='federal',
                    is_scotus=True),
                short_name='U.S.',
                start=datetime.datetime(1875, 1, 1, 0, 0),
                end=None),),
        short=False,
        extra_match_groups={}),
    index=365,
    span_start=None,
    span_end=None,
    reporter='U.S.',
    page='151',
    volume='422',
    canonical_reporter='U.S.',
    plaintiff='Corp.',
    defendant='Aiken,',
    pin_cite='156',
    extra=None,
    court='scotus',
    year=1975,
    parenthetical=None,
    reporter_found='U. S.',
    exact_editions=(),
    variation_editions=(
        Edition(
            reporter=Reporter(
                short_name='U.S.',
                name='United States Supreme Court Reports',
                cite_type='federal',
                is_scotus=True),
            short_name='U.S.',
            start=datetime.datetime(1875, 1, 1, 0, 0), end=None),),
    all_editions=(
        Edition(
            reporter=Reporter(
                short_name='U.S.',
                name='United States Supreme Court Reports', cite_type='federal',
                is_scotus=True),
            short_name='U.S.',
            start=datetime.datetime(1875, 1, 1, 0, 0), end=None),),
    edition_guess=Edition(
        reporter=Reporter(
            short_name='U.S.',
            name='United States Supreme Court Reports', cite_type='federal',
            is_scotus=True),
        short_name='U.S.',
        start=datetime.datetime(1875, 1, 1, 0, 0),
        end=None)
    )

Of course, the court that issued the cited opinion, and the reporter where it was published, are identified correctly.

>>> citations[1].court
'scotus'
>>> citations[1].reporter
'U.S.'

Eyecite can’t extract the exact date of the cited case, but it can get the start and end dates for the reporter series where the case was published, and it can also get the year from the parenthetical in the citation.

>>> citations[1].year
1975

Cleaning up Opinion Text

It’s also worth noticing how Eyecite handles “Id.” citations. I grabbed a paragraph from the Facts section of Google v. Oracle America with an example of an “Id.” citation. But this time, because the text looks like it probably has a problem with line breaks or whitespace, I’ll also try out Eyecite’s utility function for cleaning up opinion text.

facts_section = """Google envisioned an Android platform that was free and
    open, such that software developers could use the tools
    found there free of charge. Its idea was that more and more
    developers using its Android platform would develop ever
    more Android-based applications, all of which would make
    Google’s Android-based smartphones more attractive to ultimate consumers.
    Consumers would then buy and use ever
    more of those phones. Oracle America, Inc. v. Google Inc.,
    872 F. Supp. 2d 974, 978 (ND Cal. 2012); App. 111, 464.
    That vision required attracting a sizeable number of skilled
    programmers.
    At that time, many software developers understood and
    wrote programs using the Java programming language, a
    language invented by Sun Microsystems (Oracle’s predecessor). 872 F. Supp. 2d, at 975, 977. About six million programmers had spent considerable time learning, and then
    using, the Java language. App. 228. Many of those programmers used Sun’s own popular Java SE platform to develop new programs primarily for use in desktop and laptop
    computers. Id., at 151–152, 200. That platform allowed
    developers using the Java language to write programs that
    were able to run on any desktop or laptop computer, regardless of the underlying hardware (i.e., the programs were in
    large part “interoperable”). 872 F. Supp. 2d, at 977. Indeed, one of Sun’s slogans was “‘write once, run anywhere.’”
    886 F. 3d, at 1186."""

To use the clean_text function, you pass a parameter containing the names of the cleaning functions you want to use.

>>> clean_facts_section = eyecite.clean_text(facts_section, ["all_whitespace"])

I can verify that the cleaning function removed some whitespace by comparing the length of the two text strings.

>>> len(facts_section) - len(clean_facts_section)
78

Handling Short and “Id.” Citations

Running the get_citations method again, I found that it discovered all 5 citations.

>>> facts_section_citations = eyecite.get_citations(clean_facts_section)
>>> len(facts_section_citations)
5

Eyecite has special ShortCitation and IdCitation classes that will capture all the information available from a citation even when it’s not a full citation. Eyecite’s string representation of the ShortCitation class still looks a little wonky in the version I’m testing…

>>> print(facts_section_citations[1])
None, 872 F. Supp. 2d, at 975

…but by looking at the token attribute I can see that Eyecite found a lot of useful information.

>>> facts_section_citations[1].token
CitationToken(
    data='872 F. Supp. 2d, at 975',
    start=757,
    end=780,
    volume='872',
    reporter='F. Supp. 2d',
    page='975',
    exact_editions=(
        Edition(
            reporter=Reporter(
                short_name='F. Supp.',
                name='Federal Supplement',
                cite_type='federal',
                is_scotus=False),
            short_name='F. Supp. 2d',
            start=datetime.datetime(1988, 1, 1, 0, 0),
            end=datetime.datetime(2014, 8, 21, 0, 0)),),
    variation_editions=(),
    short=True,
    extra_match_groups={})

In the short citation 872 F. Supp. 2d, at 975, 977, the start page of the cited opinion is omitted, but Eyecite has recognized the pin cite to two different pages.

>>> facts_section_citations[1].pin_cite
'975, 977'

The next citation is an “Id.” citation, which provides even less information than a ShortCitation.

>>> facts_section_citations[2]
Id., at 151

It looks like Eyecite wasn’t able to collect much from the “Id.” citation, other than the pin cite and the position of the citation in the text I provided.

>>> facts_section_citations[2].__dict__
{
    'token': IdToken(data='Id.,', start=1041, end=1045),
    'index': 310,
    'span_start': None,
    'span_end': 1052,
    'pin_cite': 'at 151'
}

It might look like we’re going to have to match that “Id.” citation to the case it references manually. But no! Eyecite has another trick up its sleeve. If we pass an ordered list of citations to Eyecite’s resolve_citations method, it’ll match up the Id. citation to the case cited by its antecedent citation.

>>> resolved_citations = eyecite.resolve_citations(facts_section_citations)

Basically, Eyecite will use the citations its recognizes to create Resource objects, and then those Resources become keys for a lookup table to get all the citations that match the same Resource. When you look up the correct Resource in resolved_citations, it gives you all the citations that refer to that Resource, including any “Id.” citations. I think this feature is still under development, and honestly I’d like to see more documentation about how to use it efficiently. But there are definitely great gains to be made from a tool that can understand “Id.” and “Supra” citations automatically.

Annotating Citations in Text

Eyecite’s annotate function method is exciting for anybody publishing caselaw online. It can add HTML links or other markup to the text that Eyecite just searched through for citations. CourtListener’s URL structure doesn’t seem to lend itself to automatically creating links, so instead I’ll give an example of automatically creating links to Harvard’s case.law website. I’ll start by getting a list of citations again.

>>> discussion_text = eyecite.clean_text(text_from_opinion, ["all_whitespace"])
>>> discussion_citations = eyecite.get_citations(discussion_text)

Next, I need a function that can generate the URL for a court opinion on case.law based on its CaseCitation object. Unfortunately Eyecite’s CaseCitation object doesn’t provide the same abbreviation style that case.law uses for the names of reporter volumes, so I had to add a mockup of a conversion table using the reporter_abbreviations variable. But the CaseCitation object does supply the volume and page fields for the reporter where the case is published, and the pin_cite field seems to be easy to transform into the format case.law needs.

import re
from urllib.parse import urlunparse, ParseResult
from eyecite.models import CaseCitation

def url_from_citation(cite: CaseCitation) -> str:
    """Make a URL for linking to an opinion on case.law."""
    reporter_abbreviations = {
        'U.S.': "us",
        "F. Supp.": "f-supp"
    }
    reporter = reporter_abbreviations[cite.canonical_reporter]

    if cite.pin_cite:
        # Assumes that the first number in the pin_cite field is
        # the correct HTML fragment identifier for the URL.
        page_number = re.search(r'\d+', cite.pin_cite).group()
        fragment = f"p{page_number}"
    else:
        fragment = ""

    url_parts = ParseResult(
        scheme='https',
        netloc='cite.case.law',
        path=f'/{reporter}/{cite.volume}/{cite.page}/',
        params='',
        query='',
        fragment=fragment)

    return urlunparse(url_parts)
>>> url_from_citation(citations[2])
'https://cite.case.law/us/347/201/#p219'

Now I can write a short function to make annotations in the the expected format, and then use Eyecite to insert these links in the text anywhere that Eyecite finds a case citation.

def make_annotations(
    citations: list[CaseCitation]) -> list[tuple[tuple[int, int], str, str]]:
    result = []
    for cite in citations:
        if isinstance(cite, CaseCitation):
            caselaw_url = url_from_citation(cite)
            result.append(
                (cite.span(),
                f'<a href="{caselaw_url}">',
                "</a>")
            )
    return result
>>> annotations = make_annotations(discussion_citations)
>>> annotated_text = eyecite.annotate(discussion_text, annotations)
>>> print(annotated_text)
Copyright and patents, the Constitution says, are to promote the Progress of Science and useful Arts, by securing for limited Times to Authors and Inventors the exclusive Right to their respective Writings and Discoveries. Art. I, §8, cl. 8. Copyright statutes and case law have made clear that copyright has practical objectives. It grants an author an exclusive right to produce his work (sometimes for a hundred years or more), not as a special reward, but in order to encourage the production of works that others might reproduce more cheaply. At the same time, copyright has negative features. Protection can raise prices to consumers. It can impose special costs, such as the cost of contacting owners to obtain reproduction permission. And the exclusive rights it awards can sometimes stand in the way of others exercising their own creative powers. See generally Twentieth Century Music Corp. v. Aiken, <a href="https://cite.case.law/us/422/151/#p156">422 U. S. 151</a>, 156 (1975); Mazer v. Stein, <a href="https://cite.case.law/us/347/201/#p219">347 U. S. 201</a>, 219 (1954).

We can see that the annotate function has inserted hyperlink markup around the citations near the end of the text passage. And by displaying the text as Markdown, we can verify that the generated links go to the right places on case.law.

>>> from IPython.display import display, Markdown
>>> display(Markdown(annotated_text))

Copyright and patents, the Constitution says, are to “promote the Progress of Science and useful Arts, by securing for limited Times to Authors and Inventors the exclusive Right to their respective Writings and Discoveries.” Art. I, §8, cl. 8. Copyright statutes and case law have made clear that copyright has practical objectives. It grants an author an exclusive right to produce his work (sometimes for a hundred years or more), not as a special reward, but in order to encourage the production of works that others might reproduce more cheaply. At the same time, copyright has negative features. Protection can raise prices to consumers. It can impose special costs, such as the cost of contacting owners to obtain reproduction permission. And the exclusive rights it awards can sometimes stand in the way of others exercising their own creative powers. See generally Twentieth Century Music Corp. v. Aiken, 422 U. S. 151, 156 (1975); Mazer v. Stein, 347 U. S. 201, 219 (1954).

Overall, Eyecite is a powerful tool with great potential to help the legal field gain the benefits of Python’s data analysis and data science ecosystem.