Table of Contents
Fetching ...

Fast Reasoning Segmentation for Images and Videos

Yiqing Shen, Mathias Unberath

TL;DR

Open-set reasoning segmentation enables implicit query-based object segmentation but is hampered by the need for large multimodal LLMs. FastReasonSeg introduces digital twin representations to decouple perception from reasoning, allowing smaller LLMs to perform complex spatial-temporal reasoning. It trains a large teacher LLM on these structured representations and transfers its reasoning through a two-stage distillation—supervised fine-tuning followed by reinforcement learning with a joint format- and accuracy-based reward, plus a reasoning-alignment term. The approach achieves state-of-the-art results on both video and image benchmarks, with distilled 0.6B models delivering real-time performance (7.79 FPS) and far lower memory use, enabling edge deployment for embodied AI applications. This work demonstrates that reasoning quality can be preserved in compact models by leveraging structured digital twins and principled distillation, paving the way for practical autonomous systems.

Abstract

Reasoning segmentation enables open-set object segmentation via implicit text queries, therefore serving as a foundation for embodied agents that should operate autonomously in real-world environments. However, existing methods for reasoning segmentation require multimodal large language models with billions of parameters that exceed the computational capabilities of edge devices that typically deploy the embodied AI systems. Distillation offers a pathway to compress these models while preserving their capabilities. Yet, existing distillation approaches fail to transfer the multi-step reasoning capabilities that reasoning segmentation demands, as they focus on matching output predictions and intermediate features rather than preserving reasoning chains. The emerging paradigm of reasoning over digital twin representations presents an opportunity for more effective distillation by re-framing the problem. Consequently, we propose FastReasonSeg, which employs digital twin representations that decouple perception from reasoning to enable more effective distillation. Our distillation scheme first relies on supervised fine-tuning on teacher-generated reasoning chains. Then it is followed by reinforcement fine-tuning with joint rewards evaluating both segmentation accuracy and reasoning quality alignment. Experiments on two video (JiTBench, RVTBench) and two image benchmarks (ReasonSeg, LLM-Seg40K) demonstrate that our FastReasonSeg achieves state-of-the-art reasoning segmentation performance. Moreover, the distilled 0.6B variant outperforms models with 20 times more parameters while achieving 7.79 FPS throughput with only 2.1GB memory consumption. This efficiency enables deployment in resource-constrained environments to enable real-time reasoning segmentation.

Fast Reasoning Segmentation for Images and Videos

TL;DR

Open-set reasoning segmentation enables implicit query-based object segmentation but is hampered by the need for large multimodal LLMs. FastReasonSeg introduces digital twin representations to decouple perception from reasoning, allowing smaller LLMs to perform complex spatial-temporal reasoning. It trains a large teacher LLM on these structured representations and transfers its reasoning through a two-stage distillation—supervised fine-tuning followed by reinforcement learning with a joint format- and accuracy-based reward, plus a reasoning-alignment term. The approach achieves state-of-the-art results on both video and image benchmarks, with distilled 0.6B models delivering real-time performance (7.79 FPS) and far lower memory use, enabling edge deployment for embodied AI applications. This work demonstrates that reasoning quality can be preserved in compact models by leveraging structured digital twins and principled distillation, paving the way for practical autonomous systems.

Abstract

Reasoning segmentation enables open-set object segmentation via implicit text queries, therefore serving as a foundation for embodied agents that should operate autonomously in real-world environments. However, existing methods for reasoning segmentation require multimodal large language models with billions of parameters that exceed the computational capabilities of edge devices that typically deploy the embodied AI systems. Distillation offers a pathway to compress these models while preserving their capabilities. Yet, existing distillation approaches fail to transfer the multi-step reasoning capabilities that reasoning segmentation demands, as they focus on matching output predictions and intermediate features rather than preserving reasoning chains. The emerging paradigm of reasoning over digital twin representations presents an opportunity for more effective distillation by re-framing the problem. Consequently, we propose FastReasonSeg, which employs digital twin representations that decouple perception from reasoning to enable more effective distillation. Our distillation scheme first relies on supervised fine-tuning on teacher-generated reasoning chains. Then it is followed by reinforcement fine-tuning with joint rewards evaluating both segmentation accuracy and reasoning quality alignment. Experiments on two video (JiTBench, RVTBench) and two image benchmarks (ReasonSeg, LLM-Seg40K) demonstrate that our FastReasonSeg achieves state-of-the-art reasoning segmentation performance. Moreover, the distilled 0.6B variant outperforms models with 20 times more parameters while achieving 7.79 FPS throughput with only 2.1GB memory consumption. This efficiency enables deployment in resource-constrained environments to enable real-time reasoning segmentation.

Paper Structure

This paper contains 19 sections, 2 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Overall framework of the proposed FastReasonSeg.
  • Figure 2: Qualitative comparison of video reasoning segmentation results on JiTBench. The figure displays two representative examples requiring complex spatial-temporal reasoning. Each row presents segmentation outputs from different methods across video frames, with green masks indicating accurate predictions and red masks denoting incorrect segmentations. FastReasonSeg-0.6B produces more accurate and consistent masks across temporal sequences compared to baseline approaches, despite operating with fewer parameters than competing methods
  • Figure 3: Ablation study examining individual digital twin representation components on JiTBench jit in terms of region similarity ($\mathcal{J}$).