Table of Contents
Fetching ...

MotionBits: Video Segmentation through Motion-Level Analysis of Rigid Bodies

Howard H. Qian, Kejia Ren, Yu Xiang, Vicente Ordonez, Kaiyu Hang

TL;DR

MotionBit is introduced, a novel concept that, unlike prior formulations, defines the smallest unit in motion-based segmentation through kinematic spatial twist equivalence, independent of semantics, to address the gap in segmentation models trained on semantic grouping.

Abstract

Rigid bodies constitute the smallest manipulable elements in the real world, and understanding how they physically interact is fundamental to embodied reasoning and robotic manipulation. Thus, accurate detection, segmentation, and tracking of moving rigid bodies is essential for enabling reasoning modules to interpret and act in diverse environments. However, current segmentation models trained on semantic grouping are limited in their ability to provide meaningful interaction-level cues for completing embodied tasks. To address this gap, we introduce MotionBit, a novel concept that, unlike prior formulations, defines the smallest unit in motion-based segmentation through kinematic spatial twist equivalence, independent of semantics. In this paper, we contribute (1) the MotionBit concept and definition, (2) a hand-labeled benchmark, called MoRiBo, for evaluating moving rigid-body segmentation across robotic manipulation and human-in-the-wild videos, and (3) a learning-free graph-based MotionBits segmentation method that outperforms state-of-the-art embodied perception methods by 37.3\% in macro-averaged mIoU on the MoRiBo benchmark. Finally, we demonstrate the effectiveness of MotionBits segmentation for downstream embodied reasoning and manipulation tasks, highlighting its importance as a fundamental primitive for understanding physical interactions.

MotionBits: Video Segmentation through Motion-Level Analysis of Rigid Bodies

TL;DR

MotionBit is introduced, a novel concept that, unlike prior formulations, defines the smallest unit in motion-based segmentation through kinematic spatial twist equivalence, independent of semantics, to address the gap in segmentation models trained on semantic grouping.

Abstract

Rigid bodies constitute the smallest manipulable elements in the real world, and understanding how they physically interact is fundamental to embodied reasoning and robotic manipulation. Thus, accurate detection, segmentation, and tracking of moving rigid bodies is essential for enabling reasoning modules to interpret and act in diverse environments. However, current segmentation models trained on semantic grouping are limited in their ability to provide meaningful interaction-level cues for completing embodied tasks. To address this gap, we introduce MotionBit, a novel concept that, unlike prior formulations, defines the smallest unit in motion-based segmentation through kinematic spatial twist equivalence, independent of semantics. In this paper, we contribute (1) the MotionBit concept and definition, (2) a hand-labeled benchmark, called MoRiBo, for evaluating moving rigid-body segmentation across robotic manipulation and human-in-the-wild videos, and (3) a learning-free graph-based MotionBits segmentation method that outperforms state-of-the-art embodied perception methods by 37.3\% in macro-averaged mIoU on the MoRiBo benchmark. Finally, we demonstrate the effectiveness of MotionBits segmentation for downstream embodied reasoning and manipulation tasks, highlighting its importance as a fundamental primitive for understanding physical interactions.
Paper Structure (19 sections, 13 equations, 15 figures, 3 tables, 1 algorithm)

This paper contains 19 sections, 13 equations, 15 figures, 3 tables, 1 algorithm.

Figures (15)

  • Figure 1: The top row shows a robot physically interacting with a variety of complex composite objects, each constructed from colored blocks that have been glued together. The bottom row highlights the output of a standard semantic segmentation model, which incorrectly over-segments the objects. In contrast, the proposed MotionBits segmentation correctly groups composite objects. Example real-world robotic applications requiring accurate MotionBits segmentation are shown in \ref{['fig:application_study_visualizations']}.
  • Figure 2: Illustration of spatial twist for two rigid objects. Transparent objects represent initial positions, and solid objects represent positions after motion. The translations of body frames are represented by purple linear velocity vectors $\upsilon_{\{x\}}$. Although $\upsilon_{\{a_1\}} \neq \upsilon_{\{a_2\}}$ and $\upsilon_{\{b_1\}} \neq \upsilon_{\{b_2\}}$, transforming them to the fixed world frame $\{s\}$ yields identical linear velocities $\upsilon_{\{s\}}^x$ for frames on the same rigid object $x$ but distinct linear velocities for different rigid objects. Intuitively, the instantaneous motion observed at the world frame will appear identical for body frames on the same rigid object, regardless of their local motions, but will differ across rigid bodies.
  • Figure 3: Examples from the new MoRiBo benchmark across both tracks, robotic manipulation and human-in-the-wild. Hand-labeled final-frame segmentation masks of moving rigid bodies are provided for every video.
  • Figure 4: Our learning-free graph-based method for online MotionBits segmentation.
  • Figure 5: Qualitative comparison of moving rigid-body segmentation on the two-track MoRiBo benchmark between Qwen2.5-VL (QwenVL) Qwen2.5-VL, Segment Any Motion in Videos (SAMIV) seganymo and our method.
  • ...and 10 more figures

Theorems & Definitions (1)

  • Definition 1: MotionBit