Table of Contents
Fetching ...

Rich Insights from Cheap Signals: Efficient Evaluations via Tensor Factorization

Felipe Maia Polo, Aida Nematzadeh, Virginia Aglietti, Adam Fisch, Isabela Albuquerque

TL;DR

A novel statistical model based on tensor factorization that merges cheap autorater data with a limited set of human gold-standard labels is proposed that is robust to autorater quality, more accurately predicts human preferences on a per-prompt basis than standard baselines, and provides tight confidence intervals for key statistical parameters of interest.

Abstract

Moving beyond evaluations that collapse performance across heterogeneous prompts toward fine-grained evaluation at the prompt level, or within relatively homogeneous subsets, is necessary to diagnose generative models' strengths and weaknesses. Such fine-grained evaluations, however, suffer from a data bottleneck: human gold-standard labels are too costly at this scale, while automated ratings are often misaligned with human judgment. To resolve this challenge, we propose a novel statistical model based on tensor factorization that merges cheap autorater data with a limited set of human gold-standard labels. Specifically, our approach uses autorater scores to pretrain latent representations of prompts and generative models, and then aligns those pretrained representations to human preferences using a small calibration set. This sample-efficient methodology is robust to autorater quality, more accurately predicts human preferences on a per-prompt basis than standard baselines, and provides tight confidence intervals for key statistical parameters of interest. We also showcase the practical utility of our method by constructing granular leaderboards based on prompt qualities and by estimating model performance solely from autorater scores, eliminating the need for additional human annotations.

Rich Insights from Cheap Signals: Efficient Evaluations via Tensor Factorization

TL;DR

A novel statistical model based on tensor factorization that merges cheap autorater data with a limited set of human gold-standard labels is proposed that is robust to autorater quality, more accurately predicts human preferences on a per-prompt basis than standard baselines, and provides tight confidence intervals for key statistical parameters of interest.

Abstract

Moving beyond evaluations that collapse performance across heterogeneous prompts toward fine-grained evaluation at the prompt level, or within relatively homogeneous subsets, is necessary to diagnose generative models' strengths and weaknesses. Such fine-grained evaluations, however, suffer from a data bottleneck: human gold-standard labels are too costly at this scale, while automated ratings are often misaligned with human judgment. To resolve this challenge, we propose a novel statistical model based on tensor factorization that merges cheap autorater data with a limited set of human gold-standard labels. Specifically, our approach uses autorater scores to pretrain latent representations of prompts and generative models, and then aligns those pretrained representations to human preferences using a small calibration set. This sample-efficient methodology is robust to autorater quality, more accurately predicts human preferences on a per-prompt basis than standard baselines, and provides tight confidence intervals for key statistical parameters of interest. We also showcase the practical utility of our method by constructing granular leaderboards based on prompt qualities and by estimating model performance solely from autorater scores, eliminating the need for additional human annotations.
Paper Structure (57 sections, 1 theorem, 19 equations, 17 figures, 3 tables, 1 algorithm)

This paper contains 57 sections, 1 theorem, 19 equations, 17 figures, 3 tables, 1 algorithm.

Key Result

Theorem C.7

Under Conditions cond:samp-cond:columns, the cutoffs $\beta^{(k)}_{y}$, the capability tensor $\Psi$, and the factor matrices $\Theta, A, \Gamma$ are identifiable.

Figures (17)

  • Figure 1: Test cross-entropy loss comparison between our proposed method (default and fine-tuned) and baseline approaches (Constant, Prompt-specific, and P2L) across three benchmarks: Gecko (left), BigGen Bench (center), and LMArena (right). Our methods consistently achieve lower losses for different human annotation budgets, demonstrating the benefits of prompt-specific modeling and auxiliary autorater data.
  • Figure 2: Category cohesion rankings for Gecko (left) and BigGen Bench (right). Categories are ordered from most to least cohesive based on the metric described in § \ref{['sec:fine-grained']}.
  • Figure 3: Model rankings with $95\%$ simultaneous confidence intervals for two Gecko categories: "Lang/Compositional" (left) and "Additive" (right), estimated using only $10\%$ of human annotations. The plots reveal performance discrepancies across different skills; for instance, Imagen matches SDXL in compositional tasks but underperforms in additive tasks.
  • Figure 4: Model rankings with $95\%$ simultaneous confidence intervals for two BigGen Bench tasks: "Multi-step" (left) and "Interplanetary Diplomacy" (right), estimated using a $10\%$ sample of human annotations.
  • Figure 5: Fine-grained comparison of model capabilities using $10\%$ of human annotations. Top: Difference in estimated scores between Imagen and Muse on Gecko prompts, colored by category. Imagen excels in text rendering, while Muse shows advantages in object counting. Bottom: Difference between LLaMa-2-13b and GPT-3.5-Turbo on BigGen Bench prompts. GPT-3.5-Turbo demonstrates a significant advantage in reasoning-related prompts.
  • ...and 12 more figures

Theorems & Definitions (2)

  • Definition C.4: Kruskal rank rhodes2010concisekruskal1977threekruskal1989rank
  • Theorem C.7