Table of Contents
Fetching ...

CriticAL: Critic Automation with Language Models

Michael Y. Li, Vivek Vajipey, Noah D. Goodman, Emily B. Fox

TL;DR

CriticAL (Critic Automation with Language Models) uses LLMs to generate summary statistics that capture discrepancies between model predictions and data, and applies hypothesis tests to evaluate their significance, and is viewed as a verifier that validates models and their critiques by embedding them in a hypothesis testing framework.

Abstract

Understanding the world through models is a fundamental goal of scientific research. While large language model (LLM) based approaches show promise in automating scientific discovery, they often overlook the importance of criticizing scientific models. Criticizing models deepens scientific understanding and drives the development of more accurate models. Automating model criticism is difficult because it traditionally requires a human expert to define how to compare a model with data and evaluate if the discrepancies are significant--both rely heavily on understanding the modeling assumptions and domain. Although LLM-based critic approaches are appealing, they introduce new challenges: LLMs might hallucinate the critiques themselves. Motivated by this, we introduce CriticAL (Critic Automation with Language Models). CriticAL uses LLMs to generate summary statistics that capture discrepancies between model predictions and data, and applies hypothesis tests to evaluate their significance. We can view CriticAL as a verifier that validates models and their critiques by embedding them in a hypothesis testing framework. In experiments, we evaluate CriticAL across key quantitative and qualitative dimensions. In settings where we synthesize discrepancies between models and datasets, CriticAL reliably generates correct critiques without hallucinating incorrect ones. We show that both human and LLM judges consistently prefer CriticAL's critiques over alternative approaches in terms of transparency and actionability. Finally, we show that CriticAL's critiques enable an LLM scientist to improve upon human-designed models on real-world datasets.

CriticAL: Critic Automation with Language Models

TL;DR

CriticAL (Critic Automation with Language Models) uses LLMs to generate summary statistics that capture discrepancies between model predictions and data, and applies hypothesis tests to evaluate their significance, and is viewed as a verifier that validates models and their critiques by embedding them in a hypothesis testing framework.

Abstract

Understanding the world through models is a fundamental goal of scientific research. While large language model (LLM) based approaches show promise in automating scientific discovery, they often overlook the importance of criticizing scientific models. Criticizing models deepens scientific understanding and drives the development of more accurate models. Automating model criticism is difficult because it traditionally requires a human expert to define how to compare a model with data and evaluate if the discrepancies are significant--both rely heavily on understanding the modeling assumptions and domain. Although LLM-based critic approaches are appealing, they introduce new challenges: LLMs might hallucinate the critiques themselves. Motivated by this, we introduce CriticAL (Critic Automation with Language Models). CriticAL uses LLMs to generate summary statistics that capture discrepancies between model predictions and data, and applies hypothesis tests to evaluate their significance. We can view CriticAL as a verifier that validates models and their critiques by embedding them in a hypothesis testing framework. In experiments, we evaluate CriticAL across key quantitative and qualitative dimensions. In settings where we synthesize discrepancies between models and datasets, CriticAL reliably generates correct critiques without hallucinating incorrect ones. We show that both human and LLM judges consistently prefer CriticAL's critiques over alternative approaches in terms of transparency and actionability. Finally, we show that CriticAL's critiques enable an LLM scientist to improve upon human-designed models on real-world datasets.

Paper Structure

This paper contains 42 sections, 1 equation, 11 figures, 2 tables, 1 algorithm.

Figures (11)

  • Figure 1: Criticizing scientific models with CriticAL. First, an LLM generates summary statistics that capture potential discrepancies that are tailored to the model and dataset; the LLM conditions on dataset metadata and a symbolic representation of a scientific model. We use these summary statistics to perform hypothesis tests to evaluate the significance of each discrepancy.
  • Figure 2: Illustrating how CriticAL avoids hallucinated revisions. CriticAL hypothesizes discrepancies via summary statistics and makes targeted changes to the initial model, which is missing the feature floor. In contrast, the naive method hallucinates (see LLM explanation in figure for details) and introduces spurious features (e.g.,county, soil) to the initial model. In the revised model programs, we highlight spurious features in red and correct features in green.
  • Figure 3: CriticAL attempts fewer, more targeted revisions. The critiques produced by the naive approach drive greedy model revisions that indiscriminately add both spurious (red) and correct (green) features; we indicate features used in revised models as dark-colored squares. In contrast, CriticAL leads to fewer revisions because it filters discrepancies by significance. Furthermore, those revisions generally target the correct missing feature (floor).
  • Figure 4: Statistical analysis of CriticAL's ability to discover discrepancies and avoid hallucinations. (left) True positive rate (TPR) vs. false positive rate (FPR) at different significance thresholds. (right) FPR against significance threshold. CriticAL correctly identifies more discrepancies than the pre-specified method, at the same FPR level. The FPR is calibrated with the significance threshold, showing that CriticAL systematically avoids hallucinations.
  • Figure 5: CriticAL criticisms have higher win rates versus naively generated criticisms. Critiques are rated on three qualitative criteria by LLM-based judges (GPT-4o and Claude 3.5 Sonnet). LLM-based judges are aligned with human evaluators: GPT-4o and Claude 3.5 Sonnet have 100% alignment for transparent and tailored preferences, and are 80% and 90% aligned for actionable preferences, respectively. Error bars represent 95% confidence intervals (Wilson score).
  • ...and 6 more figures