Table of Contents
Fetching ...

Great Models Think Alike and this Undermines AI Oversight

Shashwat Goel, Joschka Struber, Ilze Amanda Auzina, Karuna K Chandra, Ponnurangam Kumaraguru, Douwe Kiela, Ameya Prabhu, Matthias Bethge, Jonas Geiping

TL;DR

As LM capabilities scale, evaluating and supervising them at scale becomes harder, prompting the use of other LMs for oversight. The authors introduce CAPA, a chance-adjusted probabilistic alignment metric that accounts for model accuracy and uses output probabilities to quantify functional similarity between LMs. Through studies of LLM-as-a-judge and inter-LM training, they show affinity bias toward similar models and that gains from weak-to-strong generalization depend on complementarity rather than similarity. They also reveal a troubling trend: as capabilities increase, models’ mistakes become more correlated, signaling risks from correlated failures in AI oversight. The work emphasizes reporting model similarity and lays groundwork for more robust, diversity-aware oversight in growing AI ecosystems.

Abstract

As Language Model (LM) capabilities advance, evaluating and supervising them at scale is getting harder for humans. There is hope that other language models can automate both these tasks, which we refer to as ''AI Oversight''. We study how model similarity affects both aspects of AI oversight by proposing Chance Adjusted Probabilistic Agreement (CAPA): a metric for LM similarity based on overlap in model mistakes. Using CAPA, we first show that LLM-as-a-judge scores favor models similar to the judge, generalizing recent self-preference results. Then, we study training on LM annotations, and find complementary knowledge between the weak supervisor and strong student model plays a crucial role in gains from ''weak-to-strong generalization''. As model capabilities increase, it becomes harder to find their mistakes, and we might defer more to AI oversight. However, we observe a concerning trend -- model mistakes are becoming more similar with increasing capabilities, pointing to risks from correlated failures. Our work underscores the importance of reporting and correcting for model similarity, especially in the emerging paradigm of AI oversight.

Great Models Think Alike and this Undermines AI Oversight

TL;DR

As LM capabilities scale, evaluating and supervising them at scale becomes harder, prompting the use of other LMs for oversight. The authors introduce CAPA, a chance-adjusted probabilistic alignment metric that accounts for model accuracy and uses output probabilities to quantify functional similarity between LMs. Through studies of LLM-as-a-judge and inter-LM training, they show affinity bias toward similar models and that gains from weak-to-strong generalization depend on complementarity rather than similarity. They also reveal a troubling trend: as capabilities increase, models’ mistakes become more correlated, signaling risks from correlated failures in AI oversight. The work emphasizes reporting model similarity and lays groundwork for more robust, diversity-aware oversight in growing AI ecosystems.

Abstract

As Language Model (LM) capabilities advance, evaluating and supervising them at scale is getting harder for humans. There is hope that other language models can automate both these tasks, which we refer to as ''AI Oversight''. We study how model similarity affects both aspects of AI oversight by proposing Chance Adjusted Probabilistic Agreement (CAPA): a metric for LM similarity based on overlap in model mistakes. Using CAPA, we first show that LLM-as-a-judge scores favor models similar to the judge, generalizing recent self-preference results. Then, we study training on LM annotations, and find complementary knowledge between the weak supervisor and strong student model plays a crucial role in gains from ''weak-to-strong generalization''. As model capabilities increase, it becomes harder to find their mistakes, and we might defer more to AI oversight. However, we observe a concerning trend -- model mistakes are becoming more similar with increasing capabilities, pointing to risks from correlated failures. Our work underscores the importance of reporting and correcting for model similarity, especially in the emerging paradigm of AI oversight.

Paper Structure

This paper contains 70 sections, 23 equations, 23 figures, 14 tables.

Figures (23)

  • Figure 1: Our Main Contributions. We develop a novel probabilistic metric for model similarity, CAPA ($\kappa_{p}$), which adjusts for chance agreement due to accuracy. Using this, we find (1) LLM-as-a-judge scores are biased towards more similar models controlling for the model's capability (2) Gain from training strong models on annotations of weak supervisors (weak-to-strong generalization) is higher when the two models are more different, (3) Concerningly, model errors are getting more correlated as capabilities increase.
  • Figure 2: Metric comparison for independent models with uncorrelated predictions. In this simulation, for each model we select an independent random subset of samples as correct, with the first having a fixed $90\%$ accuracy, while for the second accuracy is varied from $50\%$ to $90\%$. CAPA correctly reports 0 similarity when models have uncorrelated errors.
  • Figure 3: Judgment Score Relation with Model Similarity on only across family pairs. Each line is a regression model fit between judgment and similarity scores. The circle shape indicates that only across-family judge-model pairs are plotted. We report for each fit the corresponding Pearson correlation values, $r$. We found significant positive correlation between judgment scores and CAPA across all judges, $**$ indicates $p < 0.01$.
  • Figure 4: Similarity vs Gain from Weak-to-Strong Training. Across 12 model pairs, the strong student gains more from weak-to-strong training on tasks where it is more different from the weak supervisor ($p < 0.01$).
  • Figure 5: Role of Complementary Knowledge and Elicitation in Weak-to-Strong Generalization. We decompose the accuracy of the weak-to-strong trained model on four parts of the test data distribution, based on the correctness of the weak supervisor and an oracle strong elicited model which uses ground-truth annotations. Sub-rectangles represent weak, strong model pairs. Results are averaged across 15 tasks. Complementary knowledge transfer explains weak-to-strong model accuracy beyond elicitation.
  • ...and 18 more figures