Table of Contents
Fetching ...

AI-Assisted Moot Courts: Simulating Justice-Specific Questioning in Oral Arguments

Kylie Zhang, Nimra Nadeem, Lucia Zheng, Dominik Stammbach, Peter Henderson

TL;DR

A two-layer evaluation framework is introduced that assesses both the realism and pedagogical usefulness of simulated questions using complementary proxy metrics and finds that simulated questions are often perceived as realistic by human annotators and achieve high recall of ground truth substantive legal issues.

Abstract

In oral arguments, judges probe attorneys with questions about the factual record, legal claims, and the strength of their arguments. To prepare for this questioning, both law schools and practicing attorneys rely on moot courts: practice simulations of appellate hearings. Leveraging a dataset of U.S. Supreme Court oral argument transcripts, we examine whether AI models can effectively simulate justice-specific questioning for moot court-style training. Evaluating oral argument simulation is challenging because there is no single correct question for any given turn. Instead, effective questioning should reflect a combination of desirable qualities, such as anticipating substantive legal issues, detecting logical weaknesses, and maintaining an appropriately adversarial tone. We introduce a two-layer evaluation framework that assesses both the realism and pedagogical usefulness of simulated questions using complementary proxy metrics. We construct and evaluate both prompt-based and agentic oral argument simulators. We find that simulated questions are often perceived as realistic by human annotators and achieve high recall of ground truth substantive legal issues. However, models still face substantial shortcomings, including low diversity in question types and sycophancy. Importantly, these shortcomings would remain undetected under naive evaluation approaches.

AI-Assisted Moot Courts: Simulating Justice-Specific Questioning in Oral Arguments

TL;DR

A two-layer evaluation framework is introduced that assesses both the realism and pedagogical usefulness of simulated questions using complementary proxy metrics and finds that simulated questions are often perceived as realistic by human annotators and achieve high recall of ground truth substantive legal issues.

Abstract

In oral arguments, judges probe attorneys with questions about the factual record, legal claims, and the strength of their arguments. To prepare for this questioning, both law schools and practicing attorneys rely on moot courts: practice simulations of appellate hearings. Leveraging a dataset of U.S. Supreme Court oral argument transcripts, we examine whether AI models can effectively simulate justice-specific questioning for moot court-style training. Evaluating oral argument simulation is challenging because there is no single correct question for any given turn. Instead, effective questioning should reflect a combination of desirable qualities, such as anticipating substantive legal issues, detecting logical weaknesses, and maintaining an appropriately adversarial tone. We introduce a two-layer evaluation framework that assesses both the realism and pedagogical usefulness of simulated questions using complementary proxy metrics. We construct and evaluate both prompt-based and agentic oral argument simulators. We find that simulated questions are often perceived as realistic by human annotators and achieve high recall of ground truth substantive legal issues. However, models still face substantial shortcomings, including low diversity in question types and sycophancy. Importantly, these shortcomings would remain undetected under naive evaluation approaches.
Paper Structure (66 sections, 3 equations, 23 figures, 6 tables)

This paper contains 66 sections, 3 equations, 23 figures, 6 tables.

Figures (23)

  • Figure 1: Oral Argument Simulation Pipeline. A single task sample takes a) the facts of the case, b) the legal question, c) the context of the last $n-1$ turns in the oral argument, and d) name of the justice $j$ that speaks next in the conversation. The simulator predicts a given justice's $n^{th}$ turn. We implement two types of moot-court simulators: (1) Prompt-based --- we apply 3 prompt variants on both open and closed base models, (2) Agentic --- give larger reasoning models (Gemini and GPT variants) access to tools including closed search over case docket files and historical voting trends of justices. Finally, we evaluate each of our oral argument simulators using a two-layered evaluation framework, which assesses quality based on realism of simulation and pedagogical usefulness.
  • Figure 2: Overview of our Evaluation Framework. We evaluate oral argument simulators using two complementary layers. Realism assesses baseline plausibility through a) adversarial tests to check whether simulated justices respond appropriately to overtly provocative advocate behaviors, and b) human preference judgments. Pedagogical usefulness evaluates whether the simulator exhibits properties important for its use in moot court-style training settings, including a) coverage of substantive legal issues, (b) diversity of question types, c) detection of logical fallacies, and d) an appropriately adversarial tone of questioning.
  • Figure 3: The models we test push back against egregious decorum violations less than 40% of the time on the set of direct decorum violations and less than 10% of the time on the politicized rage bait and complete concession sets.
  • Figure 4: How well different models address issues using our broad topical comparison, aggregated across all 30 sampled sections.
  • Figure 5: Distribution of generated turns across the LEGALBENCH, METACOG and STETSON categories for both prompt-based and agentic simulators, with the ground truth distribution highlighted in green. For all three classifications schemes, models are less diverse than the ground truth turns (as indicated by the relatively flatter ground truth distribution).
  • ...and 18 more figures