Fast Reasoning Segmentation for Images and Videos
Yiqing Shen, Mathias Unberath
TL;DR
Open-set reasoning segmentation enables implicit query-based object segmentation but is hampered by the need for large multimodal LLMs. FastReasonSeg introduces digital twin representations to decouple perception from reasoning, allowing smaller LLMs to perform complex spatial-temporal reasoning. It trains a large teacher LLM on these structured representations and transfers its reasoning through a two-stage distillation—supervised fine-tuning followed by reinforcement learning with a joint format- and accuracy-based reward, plus a reasoning-alignment term. The approach achieves state-of-the-art results on both video and image benchmarks, with distilled 0.6B models delivering real-time performance (7.79 FPS) and far lower memory use, enabling edge deployment for embodied AI applications. This work demonstrates that reasoning quality can be preserved in compact models by leveraging structured digital twins and principled distillation, paving the way for practical autonomous systems.
Abstract
Reasoning segmentation enables open-set object segmentation via implicit text queries, therefore serving as a foundation for embodied agents that should operate autonomously in real-world environments. However, existing methods for reasoning segmentation require multimodal large language models with billions of parameters that exceed the computational capabilities of edge devices that typically deploy the embodied AI systems. Distillation offers a pathway to compress these models while preserving their capabilities. Yet, existing distillation approaches fail to transfer the multi-step reasoning capabilities that reasoning segmentation demands, as they focus on matching output predictions and intermediate features rather than preserving reasoning chains. The emerging paradigm of reasoning over digital twin representations presents an opportunity for more effective distillation by re-framing the problem. Consequently, we propose FastReasonSeg, which employs digital twin representations that decouple perception from reasoning to enable more effective distillation. Our distillation scheme first relies on supervised fine-tuning on teacher-generated reasoning chains. Then it is followed by reinforcement fine-tuning with joint rewards evaluating both segmentation accuracy and reasoning quality alignment. Experiments on two video (JiTBench, RVTBench) and two image benchmarks (ReasonSeg, LLM-Seg40K) demonstrate that our FastReasonSeg achieves state-of-the-art reasoning segmentation performance. Moreover, the distilled 0.6B variant outperforms models with 20 times more parameters while achieving 7.79 FPS throughput with only 2.1GB memory consumption. This efficiency enables deployment in resource-constrained environments to enable real-time reasoning segmentation.
