Table of Contents
Fetching ...

ReMoT: Reinforcement Learning with Motion Contrast Triplets

Cong Wan, Zeyu Guo, Jiangyang Li, SongLin Dong, Yifan Bai, Lin Peng, Zhiheng Ma, Yihong Gong

TL;DR

The resulting model achieves state-of-the-art performance on a new benchmark and multiple standard VLM benchmarks, culminating in a remarkable 25.1% performance leap on spatio-temporal reasoning tasks.

Abstract

We present ReMoT, a unified training paradigm to systematically address the fundamental shortcomings of VLMs in spatio-temporal consistency -- a critical failure point in navigation, robotics, and autonomous driving. ReMoT integrates two core components: (1) A rule-based automatic framework that generates ReMoT-16K, a large-scale (16.5K triplets) motion-contrast dataset derived from video meta-annotations, surpassing costly manual or model-based generation. (2) Group Relative Policy Optimization, which we empirically validate yields optimal performance and data efficiency for learning this contrastive reasoning, far exceeding standard Supervised Fine-Tuning. We also construct the first benchmark for fine-grained motion contrast triplets to measure a VLM's discrimination of subtle motion attributes (e.g., opposing directions). The resulting model achieves state-of-the-art performance on our new benchmark and multiple standard VLM benchmarks, culminating in a remarkable 25.1% performance leap on spatio-temporal reasoning tasks.

ReMoT: Reinforcement Learning with Motion Contrast Triplets

TL;DR

The resulting model achieves state-of-the-art performance on a new benchmark and multiple standard VLM benchmarks, culminating in a remarkable 25.1% performance leap on spatio-temporal reasoning tasks.

Abstract

We present ReMoT, a unified training paradigm to systematically address the fundamental shortcomings of VLMs in spatio-temporal consistency -- a critical failure point in navigation, robotics, and autonomous driving. ReMoT integrates two core components: (1) A rule-based automatic framework that generates ReMoT-16K, a large-scale (16.5K triplets) motion-contrast dataset derived from video meta-annotations, surpassing costly manual or model-based generation. (2) Group Relative Policy Optimization, which we empirically validate yields optimal performance and data efficiency for learning this contrastive reasoning, far exceeding standard Supervised Fine-Tuning. We also construct the first benchmark for fine-grained motion contrast triplets to measure a VLM's discrimination of subtle motion attributes (e.g., opposing directions). The resulting model achieves state-of-the-art performance on our new benchmark and multiple standard VLM benchmarks, culminating in a remarkable 25.1% performance leap on spatio-temporal reasoning tasks.
Paper Structure (48 sections, 9 equations, 8 figures, 16 tables, 1 algorithm)

This paper contains 48 sections, 9 equations, 8 figures, 16 tables, 1 algorithm.

Figures (8)

  • Figure 1: Common failure modes of large Vision–Language Models (VLMs) in spatio–temporal reasoning. The figure presents four multi‑image examples drawn from navigation, robotic manipulation, indoor exploration, and game simulation scenarios. Each example provides multiple related images and a question about their spatial or temporal relationship. Recent VLMs (GPT‑4o, Claude‑Sonnet‑4.5, Gemini‑2.5‑Pro, Qwen3‑VL) give incorrect responses—such as reversing camera rotation, misjudging object openness, or confusing character motion—which are indicated by red crosses. The errors illustrate that current VLMs struggle to reason consistently about spatial correspondence and physical change across multiple views.
  • Figure 2: Overview of the Triplet Motion Contrasts pipeline. Raw videos and meta‑annotations like camera parameters, are processed with rule‑based operations to construct motion‑contrast triplets that encode spatial and temporal changes. The figure shows representative cases, including camera rotation, manipulation, and masked‑frame contrast, as well as the training paradigms (SFT,SFT+GRPO,GRPO)
  • Figure 3: Data scaling analysis across construction pipelines.(a) Our multi-expert pipeline shows smooth scaling with GRPO reaching 0.61 and cross-validation variants peaking at 0.64--0.66. (b) VLM-generated data exhibits volatile scaling and lower ceiling ($\sim$0.49). Dashed lines show average word counts, which decline as models learn more concise reasoning with increased data.
  • Figure 4: Visual Comparisons. We compare Qwen3-VL and ReMoT across four challenging scenarios spanning gripper state transitions, camera movement analysis, object segmentation, and directional spatial reasoning. These tasks require distinguishing subtle motion attributes where visual appearances are highly similar but semantic meanings differ significantly. Qwen3-VL frequently misinterprets ambiguous cases and produces contradictory conclusions (underlined in red), while ReMoT leverages structured reasoning chains (highlighted in green) to accurately resolve fine-grained distinctions by integrating temporal dynamics and spatial relationships.
  • Figure 5: Example 1.Prompt:The image showed to you is what the robot seen by its eyes. In the image, the robotic arm on the left is the robot's left arm, and the robotic arm on the right is the robot's right arm. Focus only on robot arm/gripper motion across the three images. Please select from the following options the vertical movement direction of the left robotic arm from Image 1 to Image 2? A: Up, B: Down, C: No movement. Please select from the following options the vertical movement direction of the left robotic arm from Image 1 to Image 3? A: Up, B: No movement, C: Down. Please select from the following options the vertical movement direction of the left robotic arm from Image 2 to Image 3? A: Down, B: No movement, C: Up. Answer all three questions above in order. Only return the correct option A, B,or C for each of the three questions in order inside < answer></answer>, e.g., < answer> CAB</answer>. Answer:BAC
  • ...and 3 more figures