CaLM: Contrasting Large and Small Language Models to Verify Grounded Generation

I-Hung Hsu; Zifeng Wang; Long T. Le; Lesly Miculicich; Nanyun Peng; Chen-Yu Lee; Tomas Pfister

CaLM: Contrasting Large and Small Language Models to Verify Grounded Generation

I-Hung Hsu, Zifeng Wang, Long T. Le, Lesly Miculicich, Nanyun Peng, Chen-Yu Lee, Tomas Pfister

TL;DR

CaLM is introduced, a novel verification framework that empowers smaller LMs, which rely less on parametric memory and excel at processing relevant information given a query, to validate the output of larger LMs.

Abstract

Grounded generation aims to equip language models (LMs) with the ability to produce more credible and accountable responses by accurately citing verifiable sources. However, existing methods, by either feeding LMs with raw or preprocessed materials, remain prone to errors. To address this, we introduce CaLM, a novel verification framework. CaLM leverages the insight that a robust grounded response should be consistent with information derived solely from its cited sources. Our framework empowers smaller LMs, which rely less on parametric memory and excel at processing relevant information given a query, to validate the output of larger LMs. Larger LM responses that closely align with the smaller LMs' output, which relies exclusively on cited documents, are verified. Responses showing discrepancies are iteratively refined through a feedback loop. Experiments on three open-domain question-answering datasets demonstrate significant performance gains of 1.5% to 7% absolute average without any required model fine-tuning.

CaLM: Contrasting Large and Small Language Models to Verify Grounded Generation

TL;DR

Abstract

Paper Structure (24 sections, 12 figures, 6 tables)

This paper contains 24 sections, 12 figures, 6 tables.

Introduction
Problem Statement
Automated Verification for Grounded Generation
Key Factors for Automated Verification
Contrasting Large and Small LMs for Automated Verification
Analyzing Model Size Impact on LMs' Sensitivity to Input Document Relevance
CaLM Framework
Experimental Setup
The ASQA dataset
The QAMPARI dataset
The ELI5 dataset
Compared Methods
Experimental Results
Main Results
Analysis
...and 9 more sections

Figures (12)

Figure 1: Comparison between different categories of existing inference methods for grounded generation. (a) LLM with single-run can hallucinate easily due to the high complexity of the task. (b) Preprocessing methods reduce task complexity but the hallucination issues can propagate from preprocessing steps. (c) We propose using verification and rectification to ensure LLMs generate outputs with complete citations and accurate answers, maintaining quality.
Figure 2: The studies of the performance of an LM as a function of the input document's relevance score using the ASQA dataset. We show that, within the same LM family, smaller LMs demonstrate higher sensitivity to the relevance of the input document, when anchored to the largest model in the family. (a) The illustration of the function. This function is a monotonic increase function as the accuracy always increase when input document's relevance score increase. Hence, studying the second order relative improvements can help us know the incremental performance gain for the LM when the input document's relevance keep increasing. (b) The result of relative improvement. (c) The result of the second order relative improvement analysis. From (b)(c), we can observe that smaller models tend to exhibit greater relative improvements and achieve larger incremental performance gains compared to their larger counterparts.
Figure 3: Overview of CaLM: Top: The flow diagram of our method. Bottom: A detailed depiction of each step's operation. The algorithm starts with a retriever extract a relevant document pool $p$ for the input query (Step (1)). Then, the main language model (LM) takes the first batch of documents and employs retrieval-augmented generation to produce an answer candidate, which cites relevant supporting documents (Step (2)). Subsequently, this candidate is validated by contrasting it with the verifier output from the verifier LM (Steps (3) & (4)). Our verifier LM evaluates citation quality by accessing only the documents cited by the main LM's response, rather than the same input documents. For responses with sufficient consistency, we accept the answer candidate directly. If inconsistent, we break down the answer candidate into individual statements, retaining only those corroborated by similar arguments in the verifier output for further correction in next iteration (Step (5)).
Figure 4: The study examines the iterative performance improvements on the QAMPARI dataset. We use GPT-3.5-Turbo as the main LM for this running study.
Figure 5: A case study of CaLM on ASQA dataset. The question is "Who sings don't tell me what to do?" and all reference short answers are "Pam Tillis", "Marty Stuart", and "Baby Animals".
...and 7 more figures

CaLM: Contrasting Large and Small Language Models to Verify Grounded Generation

TL;DR

Abstract

CaLM: Contrasting Large and Small Language Models to Verify Grounded Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (12)