Lab · Accountable AI
30 MIN · BREAK
Hands-on
Auditing fairness for an
AI court-recommendation system.
Strategies for Accountable AI · Wharton Executive Education
You are an auditor, assigned by New York City, to evaluate a machine-learning system
used by NYC courts to recommend whether defendants should be Released on Recognizance
(ROR) — released without bail on a promise to appear.
The system was built by an outside vendor and trained on historical court data.
Judges see its recommendations and may follow or ignore them. Your job: decide
whether the system is delivering equitable outcomes across demographic groups,
and what should happen if it isn't.
30MINUTES
02 MIN
Look at the data below.
12 MIN
Question 1. Are these the right fairness metrics?
12 MIN
Question 2. What additional data would you need?
04 MIN
Hold Question 3 for the group debrief when we return.
The data
What the court shared with you.
When you asked the court for data, they gave you exactly what's below — the
AI system's ROR_at_Arraign recommendation rates over the past
twelve months, alongside the historical rate at which judges granted ROR for
the same demographic groups (the data used to train the model).
Release-on-Recognizance rates by race · last 12 months
| Race |
AI recommendation rate |
Judges' historical rate |
| White |
64% |
68% |
| Asian / Pacific Islander |
62% |
65% |
| Hispanic |
53% |
51% |
| Black |
49% |
44% |
Illustrative figures patterned on real ROR-disparity findings. Largest
gap shown in red.
What to notice
The AI recommends ROR for Black defendants at 49% vs. 64% for
White defendants — a 15-point gap. The judges' historical gap was
larger (24 points). The AI narrowed the disparity, but did not
eliminate it. Bring that observation into Question 1.
Pre-break
01Is this the right way to measure fairness?
12 MIN · INDIVIDUAL OR PAIR
Looking at the ROR_at_Arraign rates by Gender, Race, and Ethnicity in
the AI recommendations sheet:
- Are these rates a good metric for evaluating fairness in this context?
If not, what would a better metric account for?
- What contextual factors beyond the raw numbers should an auditor weigh
in a high-stakes criminal-justice setting?
- Should it matter whether the AI's recommendations are more — or less —
unequal than the judges' historical decisions would have been without the
AI? Why?
02What's missing to make the decision?
12 MIN · INDIVIDUAL OR PAIR
The table above isn't enough to render a defensible audit. What more
would you ask for, and from whom?
- From the courts. What do you need on the outcomes after
an ROR recommendation — appearance rates, re-arrest, charge severity?
What about the cases the AI wasn't used on?
- From the vendor. What would tell you whether the model itself
is doing something defensible — training procedure, features used,
validation across groups, recent updates?
- What if the vendor can't share what you ask? U.S. data-privacy
law is
implemented
at the state level. If the vendor is in California (or anywhere
else), their state's rules constrain what crosses the line. How does
that limit your audit?
For the group debrief — after the break
03Where should accountability lie?
GROUP DISCUSSION
Assume you've judged the system unfair. Who in this chain should be
penalized?
- The vendor. They built the algorithm and chose how to train it.
- The court system. They picked the vendor, deployed the system, and
generated the historical data the model learned from.
- The judges. They use the recommendations to decide.
Make your argument. Consider:
- Who has the greatest influence over the fairness of the outcome?
- Who should bear responsibility for how the system is used — and why?
- What remedies would you recommend, and to which party?
Want to dig deeper after class? The full dataset (≈7,000 defendant records
with demographics, charges, and rulings) is available as an Excel workbook:
Fairness_Assignment_Data.xlsx.