Re-Tuning: Overcoming the Compositionality Limits of Large Language Models with Recursive Tuning

Eric Pasewark; Kyle Montgomery; Kefei Duan; Dawn Song; Chenguang Wang

Re-Tuning: Overcoming the Compositionality Limits of Large Language Models with Recursive Tuning

Eric Pasewark, Kyle Montgomery, Kefei Duan, Dawn Song, Chenguang Wang

TL;DR

Re-Tuning tackles the challenge that large language models struggle with compositional tasks by training them to solve problems recursively. The method trains LLMs to generate recursive subproblem calls, solve base cases in separate contexts, and propagate subproblem solutions up the call stack to obtain the final answer. Across three representative tasks (integer addition, dynamic programming, parity) and two model sizes (LLaMA 7B/13B) using LoRA, Re-Tuning significantly improves accuracy and especially boosts out-of-distribution generalization compared to baseline and scratchpad prompting, while also reducing training memory use. The work also demonstrates strong sample efficiency and robustness to prompt variations, with ablation and error analyses clarifying how subproblem isolation and carryover of subproblem solutions contribute to performance. Overall, Re-Tuning presents a principled, memory-efficient approach to enabling recursive computation in LLMs, with potential applicability to other inherently recursive tasks and resource-constrained settings.

Abstract

We present a new method for large language models to solve compositional tasks. Although they have shown strong performance on traditional language understanding tasks, large language models struggle to solve compositional tasks, where the solution depends on solving smaller instances of the same problem. We propose a natural approach to solve compositional tasks recursively. Our method, Re-Tuning, tunes models to break down a problem into subproblems, solve those subproblems, and combine the results. We show that our method significantly improves model performance on three representative compositional tasks: integer addition, dynamic programming, and parity. Compared to state-of-the-art methods that keep intermediate steps towards solving the problems, Re-Tuning achieves significantly higher accuracy and is more GPU memory efficient.

Re-Tuning: Overcoming the Compositionality Limits of Large Language Models with Recursive Tuning

TL;DR

Abstract

Paper Structure (40 sections, 11 figures, 5 tables, 1 algorithm)

This paper contains 40 sections, 11 figures, 5 tables, 1 algorithm.

Introduction
Approach
Experiments
Experimental Setup
Integer addition
Dynamic programming
Parity
Main Results
Integer addition
Dynamic programming
Parity
Analysis and Further Discussion
Ablation Study
Case Study
Error Analysis
...and 25 more sections

Figures (11)

Figure 1: Summary of our approach and results. Top: Our Re-Tuning pipeline generates and processes all the recursive subproblems for each randomly generated problem instance in order to train the base LLM. For a new question, our Re-Tuning pipeline allows the model to call itself on a subproblem of reduced size, which enables the subproblem to be solved in a new context and return the answer to the initial context. The top right shows the generation procedure to solve 1234+5678. Each separate context is indicated by a blue bubble. The arrows indicate copying of generated prompts or solutions. Bottom Left: On most problems, Re-Tuning trains on significantly fewer tokens than the scratchpad method, saving considerable GPU memory. Bottom Right: On average, Re-Tuning outperforms the baseline and scratchpad methods across all tasks, especially as the problems grow in size and complexity.
Figure 1: Psuedocode for the RecursiveGenerate method, a lightweight recursive wrapper around the standard generation function used with the baseline and scratchpad methods.
Figure 2: Performance of LLaMA 7B (top) and LLaMA 13B (bottom) on Addition (left), Dynamic Programming (middle), and Parity (right). The in-distribution range is shaded in gray.
Figure 3: Case study on dynamic programming problem. With scratchpad, the model makes an indexing error, while with Re-Tuning, the model correctly generates the recursive call.
Figure 4: Error classifications for each problem across samples of 20 instances per problem lengths.
...and 6 more figures

Re-Tuning: Overcoming the Compositionality Limits of Large Language Models with Recursive Tuning

TL;DR

Abstract

Re-Tuning: Overcoming the Compositionality Limits of Large Language Models with Recursive Tuning

Authors

TL;DR

Abstract

Table of Contents

Figures (11)