Table of Contents
Fetching ...

Beyond Single-Task: Robust Multi-Task Length Generalization for LLMs

Yi Hu, Shijia Kang, Haotong Yang, Haotian Xu, Muhan Zhang

TL;DR

This work introduces Meta Rule-Following Fine-Tuning (Meta-RFFT) to achieve robust cross-task length generalization in large language models. By first performing RF-pretraining on a diverse 74-task rule-following dataset spanning code, numeric, symbolic, and logical reasoning, and then adapting to unseen downstream tasks via minimal fine-tuning or one-shot prompts, Meta-RFFT demonstrates strong generalization to longer problem lengths and unseen rules. On an 86-task corpus, a 32B model fine-tuned with Meta-RFFT attains substantially higher accuracy on long-horizon tasks (e.g., 30-digit additions) than state-of-the-art long-CoT models, indicating transfer of transferable computational primitives rather than task-specific memorization. The findings also show that the approach generalizes to natural language rule formats and is more compute-efficient than RL-based alternatives, suggesting practical viability for real-world applications requiring strict rule adherence and scalable multi-task reasoning.

Abstract

Length generalization, the ability to solve problems longer than those seen during training, remains a critical challenge for large language models (LLMs). Previous work modifies positional encodings (PEs) and data formats to improve length generalization on specific symbolic tasks such as addition and sorting. However, these approaches are fundamentally limited to special tasks, often degrading general language performance. Furthermore, they are typically evaluated on small transformers trained from scratch on single tasks and can cause performance drop when applied during post-training stage of practical LLMs with general capabilities. Hu et al., (2024) proposed Rule-Following Fine-Tuning (RFFT) to improve length generalization in the post-training stage of LLMs. Despite its compatibility with practical models and strong performance, RFFT is proposed for single tasks too, requiring re-training for each individual task with extensive examples. In this paper, we study length generalization in multi-task settings and propose Meta Rule-Following Fine-Tuning (Meta-RFFT), the first framework enabling robust cross-task length generalization. As our first contribution, we construct a large length generalization dataset containing 86 tasks spanning code execution, number processing, symbolic and logical reasoning tasks, beyond the common addition or multiplication tasks. Secondly, we show that cross-task length generalization is possible with Meta-RFFT. After training on a large number of tasks and instances, the models achieve remarkable length generalization ability on unseen tasks with minimal fine-tuning or one-shot prompting. For example, after fine-tuning on 1 to 5 digit addition, our 32B model achieves 95% accuracy on 30 digit addition, significantly outperforming the state-of-the-art reasoning models (DeepSeek-R1-671B: 72%), despite never seeing this task during RF-pretraining.

Beyond Single-Task: Robust Multi-Task Length Generalization for LLMs

TL;DR

This work introduces Meta Rule-Following Fine-Tuning (Meta-RFFT) to achieve robust cross-task length generalization in large language models. By first performing RF-pretraining on a diverse 74-task rule-following dataset spanning code, numeric, symbolic, and logical reasoning, and then adapting to unseen downstream tasks via minimal fine-tuning or one-shot prompts, Meta-RFFT demonstrates strong generalization to longer problem lengths and unseen rules. On an 86-task corpus, a 32B model fine-tuned with Meta-RFFT attains substantially higher accuracy on long-horizon tasks (e.g., 30-digit additions) than state-of-the-art long-CoT models, indicating transfer of transferable computational primitives rather than task-specific memorization. The findings also show that the approach generalizes to natural language rule formats and is more compute-efficient than RL-based alternatives, suggesting practical viability for real-world applications requiring strict rule adherence and scalable multi-task reasoning.

Abstract

Length generalization, the ability to solve problems longer than those seen during training, remains a critical challenge for large language models (LLMs). Previous work modifies positional encodings (PEs) and data formats to improve length generalization on specific symbolic tasks such as addition and sorting. However, these approaches are fundamentally limited to special tasks, often degrading general language performance. Furthermore, they are typically evaluated on small transformers trained from scratch on single tasks and can cause performance drop when applied during post-training stage of practical LLMs with general capabilities. Hu et al., (2024) proposed Rule-Following Fine-Tuning (RFFT) to improve length generalization in the post-training stage of LLMs. Despite its compatibility with practical models and strong performance, RFFT is proposed for single tasks too, requiring re-training for each individual task with extensive examples. In this paper, we study length generalization in multi-task settings and propose Meta Rule-Following Fine-Tuning (Meta-RFFT), the first framework enabling robust cross-task length generalization. As our first contribution, we construct a large length generalization dataset containing 86 tasks spanning code execution, number processing, symbolic and logical reasoning tasks, beyond the common addition or multiplication tasks. Secondly, we show that cross-task length generalization is possible with Meta-RFFT. After training on a large number of tasks and instances, the models achieve remarkable length generalization ability on unseen tasks with minimal fine-tuning or one-shot prompting. For example, after fine-tuning on 1 to 5 digit addition, our 32B model achieves 95% accuracy on 30 digit addition, significantly outperforming the state-of-the-art reasoning models (DeepSeek-R1-671B: 72%), despite never seeing this task during RF-pretraining.

Paper Structure

This paper contains 54 sections, 2 equations, 20 figures, 9 tables.

Figures (20)

  • Figure 1: Comparison of input-output sequences across three methods: direct answer, scratchpad (top left), and RFFT (right), with single-task performance results shown at the bottom left.
  • Figure 1: The statistics of our dataset. We list the number of tasks collected from each data source and their corresponding split in the RF-pretraining stage or the downstream adaptation stage.
  • Figure 2: The pipeline of Meta-RFFT. LC, BBH, SR and NUPA stands for LeetCode, Big-Bench Hard suzgunChallengingBIGBenchTasks2022, Symbolic Reasoning weiChainofThoughtPromptingElicits2023a and the NUPA Benchmark yang2024numbercookbooknumberunderstanding respectively.
  • Figure 2: Overall metrics of performance of different methods across all 12 test tasks. Here, ACC_Len30 measures average accuracy at length 30; Max_Len_90% represents maximum length sustaining $\geq$90% accuracy averaged across tasks.
  • Figure 3: Length generalization performance of direct answer, scratchpad, vanilla RFFT and Meta-RFFT on LeetCode and NUPA tasks. The shaded region represents the in-distribution test results (length $\leq 5$), while the unshaded background corresponds to out-of-distribution lengths (length $\geq 6$). Here the base model is Qwen2.5-7B-Instruct.
  • ...and 15 more figures