Table of Contents
Fetching ...

RoboAct-CLIP: Video-Driven Pre-training of Atomic Action Understanding for Robotics

Zhiyuan Zhang, Yuxin He, Yong Sun, Junyu Shi, Lijiang Liu, Qiang Nie

TL;DR

This work tackles the lack of temporal action understanding in Vision-Language Models for robotics by introducing RoboAct-CLIP, which decouples temporal action dynamics from object and environment features. It combines a Temporal Diff-Transformer with a Feature Disentanglement module and a dataset reconstruction pipeline to learn atomic action representations and enable recombinable, text-guided understanding. The approach is trained with a CLIP-based objective plus disentanglement and auxiliary supervision, and validated on RH20T-derived data with simulation and real-robot experiments, showing a 12 percentage-point improvement over strong baselines in simulated manipulation tasks. The results suggest that temporal-aware, disentangled representations improve generalization and robustness for language-guided robotic manipulation in open-world settings.

Abstract

Visual Language Models (VLMs) have emerged as pivotal tools for robotic systems, enabling cross-task generalization, dynamic environmental interaction, and long-horizon planning through multimodal perception and semantic reasoning. However, existing open-source VLMs predominantly trained for generic vision-language alignment tasks fail to model temporally correlated action semantics that are crucial for robotic manipulation effectively. While current image-based fine-tuning methods partially adapt VLMs to robotic applications, they fundamentally disregard temporal evolution patterns in video sequences and suffer from visual feature entanglement between robotic agents, manipulated objects, and environmental contexts, thereby limiting semantic decoupling capability for atomic actions and compromising model generalizability.To overcome these challenges, this work presents RoboAct-CLIP with dual technical contributions: 1) A dataset reconstruction framework that performs semantic-constrained action unit segmentation and re-annotation on open-source robotic videos, constructing purified training sets containing singular atomic actions (e.g., "grasp"); 2) A temporal-decoupling fine-tuning strategy based on Contrastive Language-Image Pretraining (CLIP) architecture, which disentangles temporal action features across video frames from object-centric characteristics to achieve hierarchical representation learning of robotic atomic actions.Experimental results in simulated environments demonstrate that the RoboAct-CLIP pretrained model achieves a 12% higher success rate than baseline VLMs, along with superior generalization in multi-object manipulation tasks.

RoboAct-CLIP: Video-Driven Pre-training of Atomic Action Understanding for Robotics

TL;DR

This work tackles the lack of temporal action understanding in Vision-Language Models for robotics by introducing RoboAct-CLIP, which decouples temporal action dynamics from object and environment features. It combines a Temporal Diff-Transformer with a Feature Disentanglement module and a dataset reconstruction pipeline to learn atomic action representations and enable recombinable, text-guided understanding. The approach is trained with a CLIP-based objective plus disentanglement and auxiliary supervision, and validated on RH20T-derived data with simulation and real-robot experiments, showing a 12 percentage-point improvement over strong baselines in simulated manipulation tasks. The results suggest that temporal-aware, disentangled representations improve generalization and robustness for language-guided robotic manipulation in open-world settings.

Abstract

Visual Language Models (VLMs) have emerged as pivotal tools for robotic systems, enabling cross-task generalization, dynamic environmental interaction, and long-horizon planning through multimodal perception and semantic reasoning. However, existing open-source VLMs predominantly trained for generic vision-language alignment tasks fail to model temporally correlated action semantics that are crucial for robotic manipulation effectively. While current image-based fine-tuning methods partially adapt VLMs to robotic applications, they fundamentally disregard temporal evolution patterns in video sequences and suffer from visual feature entanglement between robotic agents, manipulated objects, and environmental contexts, thereby limiting semantic decoupling capability for atomic actions and compromising model generalizability.To overcome these challenges, this work presents RoboAct-CLIP with dual technical contributions: 1) A dataset reconstruction framework that performs semantic-constrained action unit segmentation and re-annotation on open-source robotic videos, constructing purified training sets containing singular atomic actions (e.g., "grasp"); 2) A temporal-decoupling fine-tuning strategy based on Contrastive Language-Image Pretraining (CLIP) architecture, which disentangles temporal action features across video frames from object-centric characteristics to achieve hierarchical representation learning of robotic atomic actions.Experimental results in simulated environments demonstrate that the RoboAct-CLIP pretrained model achieves a 12% higher success rate than baseline VLMs, along with superior generalization in multi-object manipulation tasks.

Paper Structure

This paper contains 14 sections, 20 equations, 3 figures, 3 tables, 1 algorithm.

Figures (3)

  • Figure 1: Overall framework of RoboAct-CLIP. .
  • Figure 2: Visualization of RoboAct-CLIP performing four manipulation tasks in the Franka Kitchen environment. Each row shows a different task. Our model demonstrates precise control throughout the action sequences, successfully completing diverse manipulation tasks requiring different interaction patterns.
  • Figure 3: Execution sequence of the real-world manipulation task using RoboAct-CLIP.