Table of Contents
Fetching ...

Online Reasoning Video Segmentation with Just-in-Time Digital Twins

Yiqing Shen, Bohan Liu, Chenjia Li, Lalithkumar Seenivasan, Mathias Unberath

TL;DR

This work tackles online video reasoning segmentation (RS) by decoupling perception from high-level reasoning through a just-in-time digital twin. An LLM planner selects specialist vision models and constructs a DAG-based execution graph, while online perception builds a dynamic scene graph (the digital twin) that supports semantic, spatial, and temporal reasoning without fine-tuning LLMs. A just-in-time approach preserves fine-grained spatial-temporal details and enables efficient online processing, with a sliding-window temporal integration and an alpha-smoothed mask generation scheme. The authors also present a new video RS benchmark with 200 videos and 895 implicit queries to assess semantic, spatial, and temporal reasoning, and demonstrate strong improvements over state-of-the-art methods on both video and image RS tasks, highlighting the framework’s potential for embodied AI and real-time robotics.

Abstract

Reasoning segmentation (RS) aims to identify and segment objects of interest based on implicit text queries. As such, RS is a catalyst for embodied AI agents, enabling them to interpret high-level commands without requiring explicit step-by-step guidance. However, current RS approaches rely heavily on the visual perception capabilities of multimodal large language models (LLMs), leading to several major limitations. First, they struggle with queries that require multiple steps of reasoning or those that involve complex spatial/temporal relationships. Second, they necessitate LLM fine-tuning, which may require frequent updates to maintain compatibility with contemporary LLMs and may increase risks of catastrophic forgetting during fine-tuning. Finally, being primarily designed for static images or offline video processing, they scale poorly to online video data. To address these limitations, we propose an agent framework that disentangles perception and reasoning for online video RS without LLM fine-tuning. Our innovation is the introduction of a just-in-time digital twin concept, where -- given an implicit query -- a LLM plans the construction of a low-level scene representation from high-level video using specialist vision models. We refer to this approach to creating a digital twin as "just-in-time" because the LLM planner will anticipate the need for specific information and only request this limited subset instead of always evaluating every specialist model. The LLM then performs reasoning on this digital twin representation to identify target objects. To evaluate our approach, we introduce a new comprehensive video reasoning segmentation benchmark comprising 200 videos with 895 implicit text queries. The benchmark spans three reasoning categories (semantic, spatial, and temporal) with three different reasoning chain complexity.

Online Reasoning Video Segmentation with Just-in-Time Digital Twins

TL;DR

This work tackles online video reasoning segmentation (RS) by decoupling perception from high-level reasoning through a just-in-time digital twin. An LLM planner selects specialist vision models and constructs a DAG-based execution graph, while online perception builds a dynamic scene graph (the digital twin) that supports semantic, spatial, and temporal reasoning without fine-tuning LLMs. A just-in-time approach preserves fine-grained spatial-temporal details and enables efficient online processing, with a sliding-window temporal integration and an alpha-smoothed mask generation scheme. The authors also present a new video RS benchmark with 200 videos and 895 implicit queries to assess semantic, spatial, and temporal reasoning, and demonstrate strong improvements over state-of-the-art methods on both video and image RS tasks, highlighting the framework’s potential for embodied AI and real-time robotics.

Abstract

Reasoning segmentation (RS) aims to identify and segment objects of interest based on implicit text queries. As such, RS is a catalyst for embodied AI agents, enabling them to interpret high-level commands without requiring explicit step-by-step guidance. However, current RS approaches rely heavily on the visual perception capabilities of multimodal large language models (LLMs), leading to several major limitations. First, they struggle with queries that require multiple steps of reasoning or those that involve complex spatial/temporal relationships. Second, they necessitate LLM fine-tuning, which may require frequent updates to maintain compatibility with contemporary LLMs and may increase risks of catastrophic forgetting during fine-tuning. Finally, being primarily designed for static images or offline video processing, they scale poorly to online video data. To address these limitations, we propose an agent framework that disentangles perception and reasoning for online video RS without LLM fine-tuning. Our innovation is the introduction of a just-in-time digital twin concept, where -- given an implicit query -- a LLM plans the construction of a low-level scene representation from high-level video using specialist vision models. We refer to this approach to creating a digital twin as "just-in-time" because the LLM planner will anticipate the need for specific information and only request this limited subset instead of always evaluating every specialist model. The LLM then performs reasoning on this digital twin representation to identify target objects. To evaluate our approach, we introduce a new comprehensive video reasoning segmentation benchmark comprising 200 videos with 895 implicit text queries. The benchmark spans three reasoning categories (semantic, spatial, and temporal) with three different reasoning chain complexity.

Paper Structure

This paper contains 28 sections, 5 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Overview of our proposed agent-based framework for video reasoning segmentation. Given an implicit text query, the framework operates in two main stages: (1) the planning stage where an LLM planner analyzes the query to construct an execution graph and selects query-specific specialist vision models; (2) the execution stage where perception nodes $V_p$ process incoming video frames to construct and maintain a just-in-time digital twin through scene graph generation and temporal integration. The reasoning nodes $V_r$ then operate on this digital twin representation, combining semantic reasoning nodes (handled by base LLM) and spatial/temporal reasoning nodes (executed through LLM-coder generated operations) to produce the final segmentation masks.
  • Figure 2: Three representative examples from our video RS benchmark dataset showcasing increasing reasoning complexity across different categories. Each example is annotated with structured graphs showing the reasoning relationships required at each level.
  • Figure 3: Distribution of samples across different reasoning categories and difficulty levels in our video RS benchmark. Left: Sample distribution for spatial reasoning queries shows a focus on L1 and L2 complexity. Middle: Semantic reasoning samples are concentrated in L3, reflecting more complex multi-step queries. Right: Temporal reasoning samples are relatively balanced across difficulty levels. The pie charts show the overall distribution of samples across difficulty levels (top) and reasoning categories (bottom), indicating a balanced representation of different reasoning types with spatial (52.7%) and semantic (38.7%) categories being predominant.
  • Figure 4: Qualitative comparison of segmentation results on four examples. For each example, we show from top to bottom: input video frames, ground truth masks, LISA-13B results, and our method's results. Our approach demonstrates superior performance in maintaining temporal consistency, understanding complex spatial relationships, and handling multi-step reasoning queries compared to LISA-13B, especially evident in the more challenging Level 3 scenarios.