Table of Contents
Fetching ...

Representation Consistency for Accurate and Coherent LLM Answer Aggregation

Junqi Jiang, Tom Bewley, Salim I. Amoukou, Francesco Leofante, Antonio Rago, Saumitra Mishra, Francesca Toni

TL;DR

Representation Consistency (RC) proposes a test-time scaling approach that leverages cached internal activations to improve LLM answer aggregation without model retraining. By weighting candidate answers with a combination of activation-consistency and candidate-frequency, RC-D (dense) and RC-S (sparse via SAEs) consistently improve accuracy (up to ~4%) over self-consistency across open-source LLMs and four reasoning datasets. RC-S, in particular, shows strong alignment with human notions of coherent reasoning, suggesting sparse latent activation signals capture meaningful reasoning structure. The work demonstrates practical gains in open-source settings and opens avenues for integrating activation-based signals with other test-time scaling methods and interpretability research.

Abstract

Test-time scaling improves large language models' (LLMs) performance by allocating more compute budget during inference. To achieve this, existing methods often require intricate modifications to prompting and sampling strategies. In this work, we introduce representation consistency (RC), a test-time scaling method for aggregating answers drawn from multiple candidate responses of an LLM regardless of how they were generated, including variations in prompt phrasing and sampling strategy. RC enhances answer aggregation by not only considering the number of occurrences of each answer in the candidate response set, but also the consistency of the model's internal activations while generating the set of responses leading to each answer. These activations can be either dense (raw model activations) or sparse (encoded via pretrained sparse autoencoders). Our rationale is that if the model's representations of multiple responses converging on the same answer are highly variable, this answer is more likely to be the result of incoherent reasoning and should be down-weighted during aggregation. Importantly, our method only uses cached activations and lightweight similarity computations and requires no additional model queries. Through experiments with four open-source LLMs and four reasoning datasets, we validate the effectiveness of RC for improving task performance during inference, with consistent accuracy improvements (up to 4%) over strong test-time scaling baselines. We also show that consistency in the sparse activation signals aligns well with the common notion of coherent reasoning.

Representation Consistency for Accurate and Coherent LLM Answer Aggregation

TL;DR

Representation Consistency (RC) proposes a test-time scaling approach that leverages cached internal activations to improve LLM answer aggregation without model retraining. By weighting candidate answers with a combination of activation-consistency and candidate-frequency, RC-D (dense) and RC-S (sparse via SAEs) consistently improve accuracy (up to ~4%) over self-consistency across open-source LLMs and four reasoning datasets. RC-S, in particular, shows strong alignment with human notions of coherent reasoning, suggesting sparse latent activation signals capture meaningful reasoning structure. The work demonstrates practical gains in open-source settings and opens avenues for integrating activation-based signals with other test-time scaling methods and interpretability research.

Abstract

Test-time scaling improves large language models' (LLMs) performance by allocating more compute budget during inference. To achieve this, existing methods often require intricate modifications to prompting and sampling strategies. In this work, we introduce representation consistency (RC), a test-time scaling method for aggregating answers drawn from multiple candidate responses of an LLM regardless of how they were generated, including variations in prompt phrasing and sampling strategy. RC enhances answer aggregation by not only considering the number of occurrences of each answer in the candidate response set, but also the consistency of the model's internal activations while generating the set of responses leading to each answer. These activations can be either dense (raw model activations) or sparse (encoded via pretrained sparse autoencoders). Our rationale is that if the model's representations of multiple responses converging on the same answer are highly variable, this answer is more likely to be the result of incoherent reasoning and should be down-weighted during aggregation. Importantly, our method only uses cached activations and lightweight similarity computations and requires no additional model queries. Through experiments with four open-source LLMs and four reasoning datasets, we validate the effectiveness of RC for improving task performance during inference, with consistent accuracy improvements (up to 4%) over strong test-time scaling baselines. We also show that consistency in the sparse activation signals aligns well with the common notion of coherent reasoning.

Paper Structure

This paper contains 24 sections, 12 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: An illustrative example of representation consistency. When aggregating an answer from multiple LLM responses (in the blue boxes on the middle left) sampled from semantically equivalent rephrasings of the same question (in the pink box on the left), we take into account the consistency of model internal activations within each response group (points in the blue circle). In this case, the answer "Canberra" is chosen (on the right) because the activations of its corresponding responses are more similar (the area covered by the orange points is smaller than that of the violet points).
  • Figure 2: Accuracy results (%) summarised for each number of responses configuration, dataset, and model. We report the absolute results for the main baseline, SC to the left of each subfigure with a dashed line for easy comparison. We report the relative results to SC for the remaining methods, the performance difference are shown on top of each bar.
  • Figure 3: All results for Llama3.1-8B-IT
  • Figure 4: All results for Gemma2-2B-IT
  • Figure 5: All results for Gemma2-9B-IT
  • ...and 5 more figures