Table of Contents
Fetching ...

Constructing and Interpreting Digital Twin Representations for Visual Reasoning via Reinforcement Learning

Yiqing Shen, Mathias Unberath

TL;DR

DT-R1 reframes visual reasoning as RL-driven construction and reasoning over structured digital twin representations, enabling a single LLM to handle diverse RVTs across image and video. It introduces a structured rollout and a rule-based reward that balances output format validity with final-answer accuracy, trained via GRPO with LoRA. The approach yields consistent improvements over task-specific baselines across segmentation, grounding, VQA, and summarization benchmarks, and shows strong cross-domain generalization. This unified framework reduces architectural specialization and opens pathways for scalable, multi-modal visual reasoning, while highlighting efficiency and multi-sensory extensions as future directions.

Abstract

Visual reasoning may require models to interpret images and videos and respond to implicit text queries across diverse output formats, from pixel-level segmentation masks to natural language descriptions. Existing approaches rely on supervised fine-tuning with task-specific architectures. For example, reasoning segmentation, grounding, summarization, and visual question answering each demand distinct model designs and training, preventing unified solutions and limiting cross-task and cross-modality generalization. Hence, we propose DT-R1, a reinforcement learning framework that trains large language models to construct digital twin representations of complex multi-modal visual inputs and then reason over these high-level representations as a unified approach to visual reasoning. Specifically, we train DT-R1 using GRPO with a novel reward that validates both structural integrity and output accuracy. Evaluations in six visual reasoning benchmarks, covering two modalities and four task types, demonstrate that DT-R1 consistently achieves improvements over state-of-the-art task-specific models. DT-R1 opens a new direction where visual reasoning emerges from reinforcement learning with digital twin representations.

Constructing and Interpreting Digital Twin Representations for Visual Reasoning via Reinforcement Learning

TL;DR

DT-R1 reframes visual reasoning as RL-driven construction and reasoning over structured digital twin representations, enabling a single LLM to handle diverse RVTs across image and video. It introduces a structured rollout and a rule-based reward that balances output format validity with final-answer accuracy, trained via GRPO with LoRA. The approach yields consistent improvements over task-specific baselines across segmentation, grounding, VQA, and summarization benchmarks, and shows strong cross-domain generalization. This unified framework reduces architectural specialization and opens pathways for scalable, multi-modal visual reasoning, while highlighting efficiency and multi-sensory extensions as future directions.

Abstract

Visual reasoning may require models to interpret images and videos and respond to implicit text queries across diverse output formats, from pixel-level segmentation masks to natural language descriptions. Existing approaches rely on supervised fine-tuning with task-specific architectures. For example, reasoning segmentation, grounding, summarization, and visual question answering each demand distinct model designs and training, preventing unified solutions and limiting cross-task and cross-modality generalization. Hence, we propose DT-R1, a reinforcement learning framework that trains large language models to construct digital twin representations of complex multi-modal visual inputs and then reason over these high-level representations as a unified approach to visual reasoning. Specifically, we train DT-R1 using GRPO with a novel reward that validates both structural integrity and output accuracy. Evaluations in six visual reasoning benchmarks, covering two modalities and four task types, demonstrate that DT-R1 consistently achieves improvements over state-of-the-art task-specific models. DT-R1 opens a new direction where visual reasoning emerges from reinforcement learning with digital twin representations.

Paper Structure

This paper contains 21 sections, 2 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Overall framework of DT-R1 for unified reasoning visual tasks. Given an implicit text query and visual input (image or video), DT-R1 generates a structured rollout sequences through reinforcement learning.
  • Figure 2: Prompt template for DT-R1. query will be replaced with the actual one during training and inference.
  • Figure 3: Qualitative comparison of video reasoning segmentation results on JiTBench examples. Each row shows the segmentation results from different methods across time. Green masks indicate correct segmentation where the predicted region matches the ground truth, while red masks denote incorrect predictions. DT-R1 consistently produces accurate segmentations across both temporal sequences, correctly identifying the target regions throughout the video frames.
  • Figure 4: Analysis of iterative reasoning cycles impact on DT-R1 performance. (a) It demonstrates the performance across segmentation, grounding, and visual question answering tasks as maximum iterations increase from 1 to 10, revealing performance saturation at 5 iterations. (b) It illustrates computational efficiency, where the average number of reasoning iterations actually utilized versus the theoretical maximum, with efficiency percentages indicating the model's adaptive termination behavior.