How Physics and Background Attributes Impact Video Transformers in Robotic Manipulation: A Case Study on Planar Pushing

Shutong Jin; Ruiyu Wang; Muhammad Zahid; Florian T. Pokorny

How Physics and Background Attributes Impact Video Transformers in Robotic Manipulation: A Case Study on Planar Pushing

Shutong Jin, Ruiyu Wang, Muhammad Zahid, Florian T. Pokorny

TL;DR

This study investigates how dataset composition, specifically physics attributes and background dynamics, affects video-transformer performance in robotic planar pushing. Using the large real-world CloudGripper-Push-1K dataset and the Video Occlusion Transformer (VOT) with three spatial encoders, the authors conduct extensive ablations across 18 sub-datasets to assess zero-shot generalization and fine-tuning behavior. Key findings show that background dynamics can improve generalization despite increased scene complexity, color changes exert strong sensitivity, and the amount of fine-tuning data significantly shapes achievable performance, with architecture-specific differences. The work provides practical guidance for dataset design and transfer-learning strategies in VT-based robotic manipulation, and makes the dataset and code available to the community.

Abstract

As model and dataset sizes continue to scale in robot learning, the need to understand how the composition and properties of a dataset affect model performance becomes increasingly urgent to ensure cost-effective data collection and model performance. In this work, we empirically investigate how physics attributes (color, friction coefficient, shape) and scene background characteristics, such as the complexity and dynamics of interactions with background objects, influence the performance of Video Transformers in predicting planar pushing trajectories. We investigate three primary questions: How do physics attributes and background scene characteristics influence model performance? What kind of changes in attributes are most detrimental to model generalization? What proportion of fine-tuning data is required to adapt models to novel scenarios? To facilitate this research, we present CloudGripper-Push-1K, a large real-world vision-based robot pushing dataset comprising 1278 hours and 460,000 videos of planar pushing interactions with objects with different physics and background attributes. We also propose Video Occlusion Transformer (VOT), a generic modular video-transformer-based trajectory prediction framework which features 3 choices of 2D-spatial encoders as the subject of our case study. The dataset and source code are available at https://cloudgripper.org.

How Physics and Background Attributes Impact Video Transformers in Robotic Manipulation: A Case Study on Planar Pushing

TL;DR

Abstract

Paper Structure (13 sections, 2 equations, 5 figures, 3 tables)

This paper contains 13 sections, 2 equations, 5 figures, 3 tables.

Introduction
Related Work
Model and Dataset Scalability
Existing Video Datasets
Video Transformers
Dataset
Model
Experiments
Implementation Details
Experiments with Varying Scene Background Characteristics
Experiments with Varying Physics Attributes
Fine-Tuning and Qualitative Observations
Limitations and Conclusions

Figures (5)

Figure 1: An illustration of the data collection process on the CloudGripper platform - an open source cloud robotics testbed with 32 robot arm cells. We collect planar robot pushing interaction videos from two camera views. (a) Collected top and bottom camera views and CloudGripper robot cell side-view. (b) An illustration of target and background objects used in the case-study. (c) Illustration of object and gripper self-occlusion present in collected top camera videos.
Figure 2: An illustration of CloudGripper-Push-1K with selected trajectories. (a) Target object Ball. (b) Target object Foam and one background object. (c) Target object Cube and two background objects. (d) Target object Icosahedron and four background objects.
Figure 3: (a) VOT Model Structure: Following the generic approach detailed in VTN neimark2021video, our framework employs a modular design, starting with a 2D-spatial and followed by a temporal encoder. (b) The three types of 2D-spatial encoders that were adopted in the design. (c) Temporal encoder, (d) The three types of attention mechanisms used in the 2D-spatial encoders.
Figure 4: PE vs fine-tuning dataset size graphs. Models are pre-trained for 90 epochs on Single dataset and then fine-tuned on Quintuple data for 10 epochs and evaluated on Quintuple. The x-axis represents the number of the fine-tuning videos. (a) Fine-tuning curves of VOT-MaxViT (VOT-1) and VOT-MaxViT-2 (VOT-2) with x-axis ranging from 300 to 3000. (b) Fine-tuning curves of VOT-Swin-T (SWIN) with x-axis ranging from 700 to 7000.
Figure 5: Examples of predicted and corresponding ground-truth trajectories, with Ball being the target object. (a) A complicated long trajectory of an object colliding with the boundary four times. (b) A successful prediction that featured heavy occlusion in the top camera view. (c) A failure mode. Note that the discontinuities in labels and predicted trajectories are a result of the downsampling of input videos. Refer to the supplementary video for more details.

How Physics and Background Attributes Impact Video Transformers in Robotic Manipulation: A Case Study on Planar Pushing

TL;DR

Abstract

How Physics and Background Attributes Impact Video Transformers in Robotic Manipulation: A Case Study on Planar Pushing

Authors

TL;DR

Abstract

Table of Contents

Figures (5)