Table of Contents
Fetching ...

Large Language Models Can Self-Improve in Long-context Reasoning

Siheng Li, Cheng Yang, Zesen Cheng, Lemao Liu, Mo Yu, Yujiu Yang, Wai Lam

TL;DR

This paper tackles the challenge of long-context reasoning in large language models by proposing SeaLong, a self-improvement framework that generates multiple reasoning outputs, scores them with Minimum Bayes Risk based on embedding consensus, and fine-tunes models either via supervised learning or preference optimization. The approach eliminates the need for human or expert-generated data by leveraging self-supervision and consensus signals, and demonstrates solid improvements across multiple models and long-context tasks, surpassing some baselines and staying competitive with larger models. Key findings show that increasing sampled outputs and using MBR with semantic embeddings substantially boosts correctness, and that SeaLong maintains or improves long-context performance with minimal impact on short-context tasks. The work highlights data-efficiency, discusses limitations in scoring quality and data sources, and opens pathways for scalable self-improvement in long-context reasoning.

Abstract

Large language models (LLMs) have achieved substantial progress in processing long contexts but still struggle with long-context reasoning. Existing approaches typically involve fine-tuning LLMs with synthetic data, which depends on annotations from human experts or advanced models like GPT-4, thus restricting further advancements. To address this issue, we investigate the potential for LLMs to self-improve in long-context reasoning and propose \ours, an approach specifically designed for this purpose. This approach is straightforward: we sample multiple outputs for each question, score them with Minimum Bayes Risk, and then apply supervised fine-tuning or preference optimization based on these outputs. Extensive experiments on several leading LLMs demonstrate the effectiveness of \ours, with an absolute improvement of $4.2$ points for Llama-3.1-8B-Instruct. Furthermore, \ours achieves superior performance compared to prior approaches that depend on data produced by human experts or advanced models. We anticipate that this work will open new avenues for self-improvement techniques in long-context scenarios, which are essential for the continual advancement of LLMs.

Large Language Models Can Self-Improve in Long-context Reasoning

TL;DR

This paper tackles the challenge of long-context reasoning in large language models by proposing SeaLong, a self-improvement framework that generates multiple reasoning outputs, scores them with Minimum Bayes Risk based on embedding consensus, and fine-tunes models either via supervised learning or preference optimization. The approach eliminates the need for human or expert-generated data by leveraging self-supervision and consensus signals, and demonstrates solid improvements across multiple models and long-context tasks, surpassing some baselines and staying competitive with larger models. Key findings show that increasing sampled outputs and using MBR with semantic embeddings substantially boosts correctness, and that SeaLong maintains or improves long-context performance with minimal impact on short-context tasks. The work highlights data-efficiency, discusses limitations in scoring quality and data sources, and opens pathways for scalable self-improvement in long-context reasoning.

Abstract

Large language models (LLMs) have achieved substantial progress in processing long contexts but still struggle with long-context reasoning. Existing approaches typically involve fine-tuning LLMs with synthetic data, which depends on annotations from human experts or advanced models like GPT-4, thus restricting further advancements. To address this issue, we investigate the potential for LLMs to self-improve in long-context reasoning and propose \ours, an approach specifically designed for this purpose. This approach is straightforward: we sample multiple outputs for each question, score them with Minimum Bayes Risk, and then apply supervised fine-tuning or preference optimization based on these outputs. Extensive experiments on several leading LLMs demonstrate the effectiveness of \ours, with an absolute improvement of points for Llama-3.1-8B-Instruct. Furthermore, \ours achieves superior performance compared to prior approaches that depend on data produced by human experts or advanced models. We anticipate that this work will open new avenues for self-improvement techniques in long-context scenarios, which are essential for the continual advancement of LLMs.

Paper Structure

This paper contains 29 sections, 7 equations, 4 figures, 10 tables.

Figures (4)

  • Figure 1: Scaling up the number of sampled outputs improves the performance of both the oracle sample and MBR decoding (§\ref{['sec:self_supervision']}). The results are based on Llama-3.1-8B-Instruct.
  • Figure 2: SeaLong consists of two stages: self-supervision creation and fine-tuning. Given a long context and a corresponding query, multiple outputs are sampled, each assigned a score based on Minimum Bayes Risk. Fine-tuning is then conducted using either the highest-scoring output for supervised fine-tuning or both high-scoring and low-scoring outputs for preference optimization.
  • Figure 3: Long-context performance of SeaLong with varying numbers of synthetic training examples, evaluated based on Llama-3.1-8B-Instruct fine-tuned on the corresponding dataset.
  • Figure 4: Long-context performance of SeaLong with varying numbers of samples per example during data synthesis, evaluated based on Llama-3.1-8B-Instruct fine-tuned on the corresponding dataset.