Table of Contents
Fetching ...

TrackDeform3D: Markerless and Autonomous 3D Keypoint Tracking and Dataset Collection for Deformable Objects

Yeheng Zong, Yizhou Chen, Alexander Bowler, Chia-Tung Yang, Ram Vasudevan

Abstract

Structured 3D representations such as keypoints and meshes offer compact, expressive descriptions of deformable objects, jointly capturing geometric and topological information useful for downstream tasks such as dynamics modeling and motion planning. However, robustly extracting such representations remains challenging, as current perception methods struggle to handle complex deformations. Moreover, large-scale 3D data collection remains a bottleneck: existing approaches either require prohibitive data collection efforts, such as labor-intensive annotation or expensive motion capture setups, or rely on simplifying assumptions that break down in unstructured environments. As a result, large-scale 3D datasets and benchmarks for deformable objects remain scarce. To address these challenges, this paper presents an affordable and autonomous framework for collecting 3D datasets of deformable objects using only RGB-D cameras. The proposed method identifies 3D keypoints and robustly tracks their trajectories, incorporating motion consistency constraints to produce temporally smooth and geometrically coherent data. TrackDeform3D is evaluated against several state-of-the-art tracking methods across diverse object categories and demonstrates consistent improvements in both geometric and tracking accuracy. Using this framework, this paper presents a high-quality, large-scale dataset consisting of 6 deformable objects, totaling 110 minutes of trajectory data.

TrackDeform3D: Markerless and Autonomous 3D Keypoint Tracking and Dataset Collection for Deformable Objects

Abstract

Structured 3D representations such as keypoints and meshes offer compact, expressive descriptions of deformable objects, jointly capturing geometric and topological information useful for downstream tasks such as dynamics modeling and motion planning. However, robustly extracting such representations remains challenging, as current perception methods struggle to handle complex deformations. Moreover, large-scale 3D data collection remains a bottleneck: existing approaches either require prohibitive data collection efforts, such as labor-intensive annotation or expensive motion capture setups, or rely on simplifying assumptions that break down in unstructured environments. As a result, large-scale 3D datasets and benchmarks for deformable objects remain scarce. To address these challenges, this paper presents an affordable and autonomous framework for collecting 3D datasets of deformable objects using only RGB-D cameras. The proposed method identifies 3D keypoints and robustly tracks their trajectories, incorporating motion consistency constraints to produce temporally smooth and geometrically coherent data. TrackDeform3D is evaluated against several state-of-the-art tracking methods across diverse object categories and demonstrates consistent improvements in both geometric and tracking accuracy. Using this framework, this paper presents a high-quality, large-scale dataset consisting of 6 deformable objects, totaling 110 minutes of trajectory data.
Paper Structure (27 sections, 1 equation, 7 figures, 4 tables, 2 algorithms)

This paper contains 27 sections, 1 equation, 7 figures, 4 tables, 2 algorithms.

Figures (7)

  • Figure 1: This paper presents TrackDeform3D, a system for keypoint initialization, tracking, and dataset generation for 1D and 2D deformable objects during dual-arm robotic manipulation. For each object category, the figure illustrates tracked keypoint trajectories during manipulation and the resulting 3D trajectory data. The 3D trajectory is visualized from a perspective that is orthogonal to the visualized RGB camera frame. Color encodes spatial position across keypoints, while transparency encodes temporal progression, with earlier frames appearing more transparent and later frames more opaque.
  • Figure 2: Overview of TrackDeform3D. Given an RGB-D video, TrackDeform3D first lifts depth images to point clouds and segments the deformable object via point cloud differencing (§\ref{['sec:Segmentation']}). From the first segmented frame, the object type is classified and 3D keypoints are initialized by detecting anchor points, generating warm-start positions, inferring object topology, and solving a constrained geometric optimization (§\ref{['sec:problem_formulation']}, §\ref{['sec:initialization']}). During tracking, anchor points are re-detected at each frame and the same optimization is solved recursively, warm-started from the previous solution. A temporal moving-average filter is applied to suppress high-frequency jitter, producing smooth and temporally consistent 3D keypoint trajectories (§\ref{['sec:tracking']}).
  • Figure 3: An illustration of anchor point placement on different deformable object categories: DLO, BDLO, fabric, and T-shirt.
  • Figure 4: Top. Experimental setup. Bottom. Constructed deformable objects..
  • Figure 5: Upper: Qualitative comparison of keypoint tracking results between TrackDeform3D and all baseline methods on a cloth sequence at the start and end of manipulation. While CDCPD2, the best-performing baseline, maintains track of the deformable object, it exhibits inconsistent edge lengths and distorted mesh topology over time. SpatialTracker and CoTracker both show significant keypoint drift by the end of the sequence. Bottom: Edge length RMSE and Chamfer distance over time for all methods.
  • ...and 2 more figures