FineRef: Fine-Grained Error Reflection and Correction for Long-Form Generation with Citations

Yixing Peng; Licheng Zhang; Shancheng Fang; Yi Liu; Peijian Gu; Quan Wang

FineRef: Fine-Grained Error Reflection and Correction for Long-Form Generation with Citations

Yixing Peng, Licheng Zhang, Shancheng Fang, Yi Liu, Peijian Gu, Quan Wang

TL;DR

FineRef is a framework based on Fine-grained error Reflection, which explicitly teaches the model to self-identify and correct two key citation errors, mismatch and irrelevance, on a per-citation basis and exhibits strong generalization and robustness in domain transfer settings and noisy retrieval scenarios.

Abstract

Generating with citations is crucial for trustworthy Large Language Models (LLMs), yet even advanced LLMs often produce mismatched or irrelevant citations. Existing methods over-optimize citation fidelity while overlooking relevance to the user query, which degrades answer quality and robustness in real-world settings with noisy or irrelevant retrieved content. Moreover, the prevailing single-pass paradigm struggles to deliver optimal answers in long-form generation that requiring multiple citations. To address these limitations, we propose FineRef, a framework based on Fine-grained error Reflection, which explicitly teaches the model to self-identify and correct two key citation errors, mismatch and irrelevance, on a per-citation basis. FineRef follows a two-stage training strategy. The first stage instills an "attempt-reflect-correct" behavioral pattern via supervised fine-tuning, using fine-grained and controllable reflection data constructed by specialized lightweight models. An online self-reflective bootstrapping strategy is designed to improve generalization by iteratively enriching training data with verified, self-improving examples. To further enhance the self-reflection and correction capability, the second stage applies process-level reinforcement learning with a multi-dimensional reward scheme that promotes reflection accuracy, answer quality, and correction gain. Experiments on the ALCE benchmark demonstrate that FineRef significantly improves both citation performance and answer accuracy. Our 7B model outperforms GPT-4 by up to 18% in Citation F1 and 4% in EM Recall, while also surpassing the state-of-the-art model across key evaluation metrics. FineRef also exhibits strong generalization and robustness in domain transfer settings and noisy retrieval scenarios.

FineRef: Fine-Grained Error Reflection and Correction for Long-Form Generation with Citations

TL;DR

Abstract

Paper Structure (24 sections, 10 equations, 4 figures, 4 tables)

This paper contains 24 sections, 10 equations, 4 figures, 4 tables.

Introduction
Task Formulation
Method
Behavioral Pattern Learning
Behavior Data Construction.
Initial Training.
Online Self-Reflective Bootstrapping.
Process-Level RL with Multi-Dimensional Rewards
Experiments
Datasets & Metrics
Experimental Setup
Baselines
Main Results
Domain Transfer
Analysis
...and 9 more sections

Figures (4)

Figure 1: An example of citation errors in generated response: orange indicates citations that do not match the referenced passage (mismatch), while purple denotes citations that are irrelevant to the query (irrelevance).
Figure 2: FineRef involves two training stages: (1) Behavior pattern learning stage, where the model is supervised to generate the “attempt–reflection–correction” chain using fine-grained reflection data constructed via specialized FCM and reranker models, followed by online reflection bootstrapping, improving from self-generated reflection-correction data. (2) Process-level RL stage, a multi-dimensional reward function further enhances answer quality, reflection accuracy, and correction effectiveness.
Figure 3: Accuracy of self-reflection
Figure 4: Performance in the scenario with noisy passages

FineRef: Fine-Grained Error Reflection and Correction for Long-Form Generation with Citations

TL;DR

Abstract

FineRef: Fine-Grained Error Reflection and Correction for Long-Form Generation with Citations

Authors

TL;DR

Abstract

Table of Contents

Figures (4)