Vision Tiny Recursion Model (ViTRM): Parameter-Efficient Image Classification via Recursive State Refinement

Ange-Clément Akazan; Abdoulaye Koroko; Verlon Roel Mbingui; Choukouriyah Arinloye; Hassan Fifen; Rose Bandolo

Vision Tiny Recursion Model (ViTRM): Parameter-Efficient Image Classification via Recursive State Refinement

Ange-Clément Akazan, Abdoulaye Koroko, Verlon Roel Mbingui, Choukouriyah Arinloye, Hassan Fifen, Rose Bandolo

Abstract

The success of deep learning in computer vision has been driven by models of increasing scale, from deep Convolutional Neural Networks (CNN) to large Vision Transformers (ViT). While effective, these architectures are parameter-intensive and demand significant computational resources, limiting deployment in resource-constrained environments. Inspired by Tiny Recursive Models (TRM), which show that small recursive networks can solve complex reasoning tasks through iterative state refinement, we introduce the \textbf{Vision Tiny Recursion Model (ViTRM)}: a parameter-efficient architecture that replaces the $L$-layer ViT encoder with a single tiny $k$-layer block ($k{=}3$) applied recursively $N$ times. Despite using up to $6 \times $ and $84 \times$ fewer parameters than CNN based models and ViT respectively, ViTRM maintains competitive performance on CIFAR-10 and CIFAR-100. This demonstrates that recursive computation is a viable, parameter-efficient alternative to architectural depth in vision.

Vision Tiny Recursion Model (ViTRM): Parameter-Efficient Image Classification via Recursive State Refinement

Abstract

-layer ViT encoder with a single tiny

-layer block (

) applied recursively

times. Despite using up to

and

fewer parameters than CNN based models and ViT respectively, ViTRM maintains competitive performance on CIFAR-10 and CIFAR-100. This demonstrates that recursive computation is a viable, parameter-efficient alternative to architectural depth in vision.

Paper Structure (27 sections, 6 equations, 3 figures, 2 tables)

This paper contains 27 sections, 6 equations, 3 figures, 2 tables.

Introduction
Related work
Transformers in Vision
Iterative and Recursive Refinement in Vision
Object-Centric Latents and Iterative Attention
Deep Equilibrium and Fixed-Point Methods
Adaptive Computation and Dynamic Inference
Recursive Reasoning Models: From HRM to TRM and Beyond
Vision Tiny Recursion Model
Patch Embedding
Recurrent States
Recursive Update Rules
Classification and Halting Heads
Training
Experiments
...and 12 more sections

Figures (3)

Figure 1: ViTRM: Recursive Reasoning with Working Memory and Deep Supervision.Top: At each reasoning step $t$, the model alternates between two phases sharing the same transformer weights $\theta$. In the Refine Memory phase, the concatenation of visual tokens $x$, answer token $y$, and memory tokens $z$ is processed by the shared transformer $T^\theta$ for $M$ iterations, retaining only the updated memory $z$. In the Update Answer phase, the concatenation of $y$ and $z$ is fed to $T^\theta$ to produce a new answer token $y$. This process is repeated $T$ times and then passed to two shallow MLP heads: a classification head producing class logits $\hat{y}_t$, and a halting head producing a halting probability $q_t$. Inference stops when $q_t > \tau$. Bottom: During training, deep supervision is applied at each of the $N$ reasoning steps. At step $n$, the predicted output $\hat{y}_n$ is supervised with a combined loss $L_n = L_\text{cls} + L_\text{halt}$, using stop-gradient to prevent interference between steps.
Figure 2: Ablation results for supervision depth $N_{\text{supervision}}$ on CIFAR-10.
Figure 3: Ablation results for reasoning depth $n_{\text{latent\_steps}}$ on CIFAR-10.

Vision Tiny Recursion Model (ViTRM): Parameter-Efficient Image Classification via Recursive State Refinement

Abstract

Vision Tiny Recursion Model (ViTRM): Parameter-Efficient Image Classification via Recursive State Refinement

Authors

Abstract

Table of Contents

Figures (3)