How can formalizing regulations as computer code help us check whether the regulations satisfy reasonable policy objectives? Computer science researcher (and Catala creator) Denis Merigoux has tried to answer that question with a project called MLang. As Merigoux explained more fully on his blog, this project became possible because the French government open-sourced the code that they use to calculate residents’ taxes, which was written in a domain-specific language called M, but they didn’t release the compiler that would have been needed to actually run any code written in that language. MLang is a compiler for M code written by Merigoux, and it has two special features: it translates the tax calculation function to Python, and it enables formal verification of features of the tax calculation using an automatic theorem prover called Z3.

I’ve had a renewed interest in logic solvers ever since Raymond Hettinger’s PyCon presentation on the subject. And yet I’m suspicious of logic solvers too, because when I was just getting interested in legal informatics, I had the misconception that I needed to learn a logic programming language because legal reasoning was “too logical” for a mainstream language like Python. That’s not a good reason to resort to a logic programming language. But I digress.

In Merigoux’s blog post about MLang, he suggests using logic solvers to check whether the interactions of various tax provisions can ever cause an increase in a taxpayer’s pre-tax income to cause a net decrease in the taxpayer’s post-tax income. It’s true that if you wrote the same code in Python, you wouldn’t be able to generate a formal proof that an unwanted anomaly like this could never happen (except maybe if you wrote your Python code entirely in one of Python’s logic solving libraries, which would be rare). But there is an alternative to using formal verification to search for such edge cases. You could instead use a property-based testing library like Hypothesis.

I looked at the Python code machine-generated by MLang to try to find the lines of code that could cause a marginal tax rate over 100%. Unfortunately, like most machine-generated code, it’s pretty obscure, and I didn’t find what I was looking for. Here’s a truncated version of the critical function:

# The following keys must be present in the input:
# 1AX: CIMR - salaires revenus exceptionnels - dec1
# 4XD: CIMR - recettes exceptionnelles non retenues
# 1AO: Pensions alimentaires percues - Declarant 1
# 1BS: Pensions, retraites, rentes - Declarant 2
# 4BE: Regime Micro-foncier - Recettes brutes
# 1AJ: Salaires - Declarant 1
# 0AM: Case a cocher : situation de famille Maries
# 0CF: Nombre d'enfants mineurs ou handicapes
def extracted(input_variables):

    # First we extract the input variables from the dictionnary:
    var_1ax = input_variables["1AX"]
    var_4xd = input_variables["4XD"]
    var_1ao = input_variables["1AO"]
    var_1bs = input_variables["1BS"]
    var_4be = input_variables["4BE"]
    var_1aj = input_variables["1AJ"]
    var_0am = input_variables["0AM"]
    var_0cf = input_variables["0CF"]

    # Defined in file chap-83.m\:51\:7-31
    micfr_831010_0 = m_round(var_4be * 0.300000)
    # Defined in file chap-3.m, from 620:12 to 623:33
    sombcosv_301257_1 = var_1aj + var_1ao
    # Defined in file res-ser1.m\:293\:12-60
    bool_0am_111060_0 = m_cond(
        (m_cond((var_0am > 0.000000), 1.000000, 0.000000) > 0.000000), 1.000000, 0.000000)

    # ... about 4000 lines of arithmetic omitted ...

    out = {}
    out["iinet_221190_0"] = iinet_221190_0
    out["irnet_221230_0"] = irnet_221230_0
    out["napcr_221940_0"] = napcr_221940_0
    out["txmoyimp_222120_1"] = txmoyimp_222120_1
    out["revkire_871125_0"] = revkire_871125_0
    out["nbpt_601000_0"] = nbpt_601000_0
    out["ian_301050_1"] = ian_301050_1
    out["cimr_201908_0"] = cimr_201908_0
    out["csg_101210_0"] = csg_101210_0
    out["rdsn_101210_1"] = rdsn_101210_1
    out["psol_101210_2"] = psol_101210_2
    return out

So, not too easy to follow. It even made me wonder, if it’s just a long series of math operations and readability isn’t the goal, why compile this to Python rather than some lower-level language like C? But I think one reason might be that as Python, it could be used to calculate the output of a Docassemble interview. It might be missing the point to criticize the style and maintainability of this Python code, because I think all the editing and maintenance is supposed to happen in the M language, and the Python version exists mostly to make the code executable.

Since I couldn’t figure out how to test Hypothesis on the actual code exported from MLang, I made a toy example instead, and tried out Hypothesis on that. In this example, an education tax credit gradually decreases as the taxpayer’s salary increases from 20000 to 22000. However, the decrease in the tax credit is multiplied by the number of children the taxpayer has, so if the taxpayer has more than one child, the tax credit will decrease faster than the taypayer’s salary increases. Here’s the tax calculation code, followed by the Hypothesis test function.

def calculate_education_tax_credit(salary: int) -> int:
    if salary < 20000:
        credit_per_child = 800
    elif 20000 <= salary < 22000:
        credit_per_child = 800 - salary % 20000 * 0.2
    else:
        credit_per_child = 400
    return int(credit_per_child)


def calculate_tax(salary: int, num_children: int) -> int:
    credit_per_child = calculate_education_tax_credit(salary)
    total_tax = salary * 0.5 - num_children * credit_per_child
    return int(total_tax)
from hypothesis import given, strategies, settings

from tax import calculate_tax


@given(
    salary=strategies.integers(),
    salary_raise=strategies.integers(min_value=1),
    num_children=strategies.integers(),
)
@settings(max_examples=100)
def test_raise_in_salary_never_causes_a_loss(salary, salary_raise, num_children):
    """
    Test that a raise in salary never triggers a tax increase greater than the raise.

    Running this test should show a failure, because a salary raise can cause a tax
    increase of more than 100% of the amount of the raise.
    """
    income_tax_without_raise = calculate_tax(salary=salary, num_children=num_children)
    income_tax_after_raise = calculate_tax(
        salary=salary + salary_raise, num_children=num_children
    )
    increased_tax = income_tax_after_raise - income_tax_without_raise

    assert increased_tax <= salary_raise

Hypothesis came up with a falsifying example in 0.14 seconds:

---------------------------------------- Hypothesis -----------------------------------------
Falsifying example: test_raise_in_salary_never_causes_a_loss(
salary=75, salary_raise=21647, num_children=75,
)
================================== short test summary info ==================================
FAILED test_hypo.py::test_raise_in_salary_never_causes_a_loss - assert 21653 <= 21647
===================================== 1 failed in 0.14s =====================================

I wasn’t quite satisfied with this example because most people don’t have 75 children or an annual salary of 75 Euros. So I limited the possible range of salaries to 10000-200000, and I limited the number of children the taxpayer can have to 0-10. But this caused the test to “pass”, that is, to fail to find a contradiction. By narrowing the range of allowed input values to be more realistic, I seemed to have decreased the percentage of test runs that would conflict with my test assertion. So I used the @settings decorator on the function to increase the number of test runs from the default of 100 to 10000. After that, the test started failing again, this time with a scenario that fit within my parameters and seemed reasonably plausible.

---------------------------------------- Hypothesis -----------------------------------------
Falsifying example: test_raise_in_salary_never_causes_a_loss(
salary=19621, salary_raise=1910, num_children=9,
)
================================== short test summary info ==================================
FAILED test_hypo.py::test_raise_in_salary_never_causes_a_loss - assert 1911 <= 1910
===================================== 1 failed in 0.54s =====================================

Undoubtedly, a disadvantage of property-based testing is that it’s never guaranteed to find every contradiction, no matter how many tests it generates. However, even when I increased the number of trials to 100000, the test ran in 0.54 seconds on a desktop. The MLang blog post mentioned that when using formal verification techniques it became “necessary to reduce the mathematical complexity of the function to its essential bits to limit the solver’s resource consumption”, which makes me think property-based testing has the edge over formal verification when it comes to performance. If I was testing for anomolous effects of tax legislation, I’d probably choose to go as far as I could with Hypothesis before resorting to using a theorem prover.