Table of Contents
Fetching ...

Recursive Inference Scaling: A Winning Path to Scalable Inference in Language and Multimodal Systems

Ibrahim Alabdulmohsin, Xiaohua Zhai

TL;DR

RINS introduces Recursive Inference Scaling, a compute-aware, plug-in recursion strategy that exploits language fractal self-similarity to scale inference without increasing model size. By partitioning models into blocks and recursively applying an early block before a final block, RINS achieves outsized gains under fixed training compute and parameters, outperforming over 55 baselines including RAO and latent recurrent thinking. The paper shows that stochastic RINS with lightweight linear adapters yields a no-regret option, improves multimodal tasks (e.g., SigLIP-B/16) and 0-shot ImageNet performance, and derives data-scaling laws indicating better asymptotic limits and faster convergence. It also analyzes memory considerations via KV cache sharing and demonstrates that recursion benefits are domain-specific, with language showing advantages while vision does not. Collectively, RINS offers a viable component for scalable LLM pretraining and inference-time scaling across language and multimodal systems.

Abstract

Inspired by recent findings on the fractal geometry of language, we introduce Recursive INference Scaling (RINS) as a complementary, plug-in recipe for scaling inference time in language and multimodal systems. RINS is a particular form of recursive depth that significantly outperforms +55 other variants, including the recent "repeat-all-over" (RAO) strategy in Mobile LLM (Liu et al., 2024) and latent recurrent thinking (Geiping et al., 2025). Unlike prior works, we carry out our comparisons on a compute-matched regime, and demonstrate that for a fixed model size and training compute budget, RINS substantially improves language modeling performance. It also generalizes beyond pure language tasks, delivering gains in multimodal systems, including a +2% improvement in 0-shot ImageNet accuracy for SigLIP-B/16. Additionally, by deriving data scaling laws, we show that RINS improves both the asymptotic performance limits and the scaling exponents. More importantly, with light-weight (linear) adapters (comprising <1% of model parameters) and stochastic dropout, RINS offers a no-regret strategy, meaning that RINS-enabled pretraining improves performance in language modeling even when recursive depth is not applied at inference time. This corresponds to improving performance on a training compute-, parameter-, and inference-matched regime, suggesting its potential as a viable component of LLM pretraining!

Recursive Inference Scaling: A Winning Path to Scalable Inference in Language and Multimodal Systems

TL;DR

RINS introduces Recursive Inference Scaling, a compute-aware, plug-in recursion strategy that exploits language fractal self-similarity to scale inference without increasing model size. By partitioning models into blocks and recursively applying an early block before a final block, RINS achieves outsized gains under fixed training compute and parameters, outperforming over 55 baselines including RAO and latent recurrent thinking. The paper shows that stochastic RINS with lightweight linear adapters yields a no-regret option, improves multimodal tasks (e.g., SigLIP-B/16) and 0-shot ImageNet performance, and derives data-scaling laws indicating better asymptotic limits and faster convergence. It also analyzes memory considerations via KV cache sharing and demonstrates that recursion benefits are domain-specific, with language showing advantages while vision does not. Collectively, RINS offers a viable component for scalable LLM pretraining and inference-time scaling across language and multimodal systems.

Abstract

Inspired by recent findings on the fractal geometry of language, we introduce Recursive INference Scaling (RINS) as a complementary, plug-in recipe for scaling inference time in language and multimodal systems. RINS is a particular form of recursive depth that significantly outperforms +55 other variants, including the recent "repeat-all-over" (RAO) strategy in Mobile LLM (Liu et al., 2024) and latent recurrent thinking (Geiping et al., 2025). Unlike prior works, we carry out our comparisons on a compute-matched regime, and demonstrate that for a fixed model size and training compute budget, RINS substantially improves language modeling performance. It also generalizes beyond pure language tasks, delivering gains in multimodal systems, including a +2% improvement in 0-shot ImageNet accuracy for SigLIP-B/16. Additionally, by deriving data scaling laws, we show that RINS improves both the asymptotic performance limits and the scaling exponents. More importantly, with light-weight (linear) adapters (comprising <1% of model parameters) and stochastic dropout, RINS offers a no-regret strategy, meaning that RINS-enabled pretraining improves performance in language modeling even when recursive depth is not applied at inference time. This corresponds to improving performance on a training compute-, parameter-, and inference-matched regime, suggesting its potential as a viable component of LLM pretraining!

Paper Structure

This paper contains 34 sections, 9 figures, 6 tables.

Figures (9)

  • Figure 1: left: In RINS, the model $f:\mathcal{X}\to\mathcal{Y}$ is split into two parts: the first block $f_A:\mathcal{X}\to\mathcal{X}$ is applied iteratively to its own output $r$ times before passing the output to the second block. right: Illustrative examples of models with different signatures and degrees. From top to bottom: (1) Baseline (Signature: AB, Degree: 1), a feedforward architecture with no recursion. (2) repeat-all-over (RA) liu2024mobilellm, where the entire model is recursively applied on its output. When recursion is done twice, it has a signature of ABAB. (3) RINS with signature A$^3$B. (4) (A$^3$B)$_2$ whose degree is 2, in which the same parameter sharing signature is applied on each of the two blocks A and B.
  • Figure 2: Language models are trained on 200B tokens. The $x$-axis is the training cost in units of layer $\times$ step. Notably, the performance advantage of RINS increases with longer training. The long-sequence baseline, using a context length of 1,536 tokens, exhibits lower performance due to processing fewer examples to maintain the same FLOPs count. See Figure \ref{['fig:llm_long_dur']} for longer training durations and Figure \ref{['fig:stoch_lm']} for larger (1B parameter) models, further demonstrating the value of RINS. Sharp drops in perplexity near the end of training are due to learning rate cooldown.
  • Figure 3: Performance of stochastic RINS (A$^3$B) with varying inference costs for 1B parameter LMs. The $x$-axis represents the training compute cost. The legend indicates the inference cost of each stochastic RINS configuration relative to the baseline; e.g. $1.5x$ denotes 50% increase in inference cost. For $p_s=0$, RINS@1x is significantly worse, with perplexity scores $>3$. As expected, RINS converges in performance to the baseline as $p_s\to 1$. Similar results using C4 are in Appendix \ref{['app:one_b_c4']}.
  • Figure 4: left 2 plots:$y$-axis corresponds to performance when RINS is enabled during training but disabled at inference time in 600M-parameter LMs. $p_s=\frac{1}{2}$ in stochastic RINS with linear adapters matches the baseline at $1\times$ the inference cost while $p_s=0.8$ results in a better language model, even though all models have the same training compute, parameter count, and inference cost. We speculate this is because RINS provides a better inductive bias. right 2 plots: RINS with $p_s=0.5$ improves performance with KV cache sharing, although the improvement diminishes.
  • Figure 5: Numpy-like syntax for models with a fixed signature and degree. When no stochastic depth is applied, we have $p_{s} = 0$. In RINS, we expand $p_{s}$ into a tuple of the form $(0, p_s, p_s, \ldots, p_s, 0)$, where the first and last entries are zero to guarantee they are executed, which is equivalent to sampling the number of recursion rounds from a binomial distribution as described in Section \ref{['sect:stoch']}.
  • ...and 4 more figures

Theorems & Definitions (2)

  • Definition 2.1
  • Definition 2.2