Table of Contents
Fetching ...

Sample, Align, Synthesize: Graph-Based Response Synthesis with ConGrs

Sayan Ghosh, Shahzaib Saqib Warraich, Dhruv Tarsadiya, Gregory Yauney, Swabha Swayamdipta

TL;DR

This work introduces Consensus Graphs (ConGrs), a DAG-based representation that captures shared and divergent information across multiple LM responses to a single prompt. ConGrs are built by lexical alignment (Needleman–Wunsch) to identify anchor spans (consensus nodes) and a secondary LM to group semantically equivalent divergences (disagreement nodes). The authors present two decoding strategies—consensus decoding (aggregation) and guided self-verification (intervention)—and demonstrate improvements in factuality for long-form generation, abstention control in refusals, and reasoning performance on MATH/AIME tasks, often with substantial cost savings. By leveraging response variation as an epistemic signal, ConGrs offer a flexible, metadata-free approach to synthesize more reliable LM outputs across diverse tasks.

Abstract

Language models can be sampled multiple times to access the distribution underlying their responses, but existing methods cannot efficiently synthesize rich epistemic signals across different long-form responses. We introduce Consensus Graphs (ConGrs), a flexible DAG-based data structure that represents shared information, as well as semantic variation in a set of sampled LM responses to the same prompt. We construct ConGrs using a light-weight lexical sequence alignment algorithm from bioinformatics, supplemented by the targeted usage of a secondary LM judge. Further, we design task-dependent decoding methods to synthesize a single, final response from our ConGr data structure. Our experiments show that synthesizing responses from ConGrs improves factual precision on two biography generation tasks by up to 31% over an average response and reduces reliance on LM judges by more than 80% compared to other methods. We also use ConGrs for three refusal-based tasks requiring abstention on unanswerable queries and find that abstention rate is increased by up to 56%. We apply our approach to the MATH and AIME reasoning tasks and find an improvement over self-verification and majority vote baselines by up to 6 points of accuracy. We show that ConGrs provide a flexible method for capturing variation in LM responses and using the epistemic signals provided by response variation to synthesize more effective responses.

Sample, Align, Synthesize: Graph-Based Response Synthesis with ConGrs

TL;DR

This work introduces Consensus Graphs (ConGrs), a DAG-based representation that captures shared and divergent information across multiple LM responses to a single prompt. ConGrs are built by lexical alignment (Needleman–Wunsch) to identify anchor spans (consensus nodes) and a secondary LM to group semantically equivalent divergences (disagreement nodes). The authors present two decoding strategies—consensus decoding (aggregation) and guided self-verification (intervention)—and demonstrate improvements in factuality for long-form generation, abstention control in refusals, and reasoning performance on MATH/AIME tasks, often with substantial cost savings. By leveraging response variation as an epistemic signal, ConGrs offer a flexible, metadata-free approach to synthesize more reliable LM outputs across diverse tasks.

Abstract

Language models can be sampled multiple times to access the distribution underlying their responses, but existing methods cannot efficiently synthesize rich epistemic signals across different long-form responses. We introduce Consensus Graphs (ConGrs), a flexible DAG-based data structure that represents shared information, as well as semantic variation in a set of sampled LM responses to the same prompt. We construct ConGrs using a light-weight lexical sequence alignment algorithm from bioinformatics, supplemented by the targeted usage of a secondary LM judge. Further, we design task-dependent decoding methods to synthesize a single, final response from our ConGr data structure. Our experiments show that synthesizing responses from ConGrs improves factual precision on two biography generation tasks by up to 31% over an average response and reduces reliance on LM judges by more than 80% compared to other methods. We also use ConGrs for three refusal-based tasks requiring abstention on unanswerable queries and find that abstention rate is increased by up to 56%. We apply our approach to the MATH and AIME reasoning tasks and find an improvement over self-verification and majority vote baselines by up to 6 points of accuracy. We show that ConGrs provide a flexible method for capturing variation in LM responses and using the epistemic signals provided by response variation to synthesize more effective responses.

Paper Structure

This paper contains 45 sections, 7 figures, 16 tables, 5 algorithms.

Figures (7)

  • Figure 1: Consensus Graphs (ConGrs) capture the variation in a set of sampled LM responses. A ConGr is a weighted DAG of: consensus nodes for text spans present in all responses and disagreement nodes for lexical differences between responses. A node's weighted degree represents the fraction of responses which contain the information in that node. In the above example task of generating factual biographies from 3 sampled responses, disagreement nodes with lower weighted degree might indicate possible hallucinations. For this reason, none of the information in the disagreement nodes is included in the final synthesized response (further details in §\ref{['sec:consensus-decoding']}).
  • Figure 2: Responses for a biography generation task from an aligned model (Qwen2.5-72B-Instruct) contain lexical overlap across ordered segments of responses compared to a shuffled baseline. Differences in lexical similarity (measured with Jaccard similarity over word sets) are significant as measured using a paired $t$-test for all quantiles.
  • Figure 3: From a set of responses, we construct a ConGr by 1) Using Needleman-Wunsch lee2002poa to construct a lexical DAG where each node's text is a single token, 2) Merging consecutive sequences of nodes that are present in all responses to create consensus nodes, 3) Extracting paths between consecutive pairs of consecutive nodes and using a LM to create semantic equivalence classes 4) Creating a disagreement node for each semantic equivalence class.
  • Figure 4: Consensus decoding (left) uses a ConGr to combine text present in many responses. Guided self-verification (right) uses a ConGr to localize possible errors in reasoning chains.
  • Figure 5: Consensus decoding with ConGrs achieves the best trade-off between FActScore (the fraction of a response's claims that are true) and the number of true claims provided. Up and to the right is better. $\tau$ is the selection threshold for consensus decoding. $\Theta$ is the analogous parameter for the ASC baseline. Top row: Biography factuality. Bottom row: PopQA.
  • ...and 2 more figures