Table of Contents
Fetching ...

UniStateDLO: Unified Generative State Estimation and Tracking of Deformable Linear Objects Under Occlusion for Constrained Manipulation

Kangchen Lv, Mingrui Yu, Shihefeng Wang, Xiangyang Ji, Xiang Li

TL;DR

Deformable linear objects present severe perception challenges under occlusion, hindering reliable manipulation. UniStateDLO addresses this with a unified diffusion-based framework that performs both single-frame state estimation and cross-frame tracking, using a two-branch architecture to fuse global robustness with local geometric precision. Trained purely on synthetic data, it generalizes zero-shot to real DLOs and delivers real-time, temporally coherent reconstructions that enable stable closed-loop manipulation in constrained environments. Extensive simulations and real-world experiments demonstrate superior occlusion robustness and tracking stability compared with state-of-the-art baselines, establishing UniStateDLO as a strong front-end perception module for DLO manipulation.

Abstract

Perception of deformable linear objects (DLOs), such as cables, ropes, and wires, is the cornerstone for successful downstream manipulation. Although vision-based methods have been extensively explored, they remain highly vulnerable to occlusions that commonly arise in constrained manipulation environments due to surrounding obstacles, large and varying deformations, and limited viewpoints. Moreover, the high dimensionality of the state space, the lack of distinctive visual features, and the presence of sensor noises further compound the challenges of reliable DLO perception. To address these open issues, this paper presents UniStateDLO, the first complete DLO perception pipeline with deep-learning methods that achieves robust performance under severe occlusion, covering both single-frame state estimation and cross-frame state tracking from partial point clouds. Both tasks are formulated as conditional generative problems, leveraging the strong capability of diffusion models to capture the complex mapping between highly partial observations and high-dimensional DLO states. UniStateDLO effectively handles a wide range of occlusion patterns, including initial occlusion, self-occlusion, and occlusion caused by multiple objects. In addition, it exhibits strong data efficiency as the entire network is trained solely on a large-scale synthetic dataset, enabling zero-shot sim-to-real generalization without any real-world training data. Comprehensive simulation and real-world experiments demonstrate that UniStateDLO outperforms all state-of-the-art baselines in both estimation and tracking, producing globally smooth yet locally precise DLO state predictions in real time, even under substantial occlusions. Its integration as the front-end module in a closed-loop DLO manipulation system further demonstrates its ability to support stable feedback control in complex, constrained 3-D environments.

UniStateDLO: Unified Generative State Estimation and Tracking of Deformable Linear Objects Under Occlusion for Constrained Manipulation

TL;DR

Deformable linear objects present severe perception challenges under occlusion, hindering reliable manipulation. UniStateDLO addresses this with a unified diffusion-based framework that performs both single-frame state estimation and cross-frame tracking, using a two-branch architecture to fuse global robustness with local geometric precision. Trained purely on synthetic data, it generalizes zero-shot to real DLOs and delivers real-time, temporally coherent reconstructions that enable stable closed-loop manipulation in constrained environments. Extensive simulations and real-world experiments demonstrate superior occlusion robustness and tracking stability compared with state-of-the-art baselines, establishing UniStateDLO as a strong front-end perception module for DLO manipulation.

Abstract

Perception of deformable linear objects (DLOs), such as cables, ropes, and wires, is the cornerstone for successful downstream manipulation. Although vision-based methods have been extensively explored, they remain highly vulnerable to occlusions that commonly arise in constrained manipulation environments due to surrounding obstacles, large and varying deformations, and limited viewpoints. Moreover, the high dimensionality of the state space, the lack of distinctive visual features, and the presence of sensor noises further compound the challenges of reliable DLO perception. To address these open issues, this paper presents UniStateDLO, the first complete DLO perception pipeline with deep-learning methods that achieves robust performance under severe occlusion, covering both single-frame state estimation and cross-frame state tracking from partial point clouds. Both tasks are formulated as conditional generative problems, leveraging the strong capability of diffusion models to capture the complex mapping between highly partial observations and high-dimensional DLO states. UniStateDLO effectively handles a wide range of occlusion patterns, including initial occlusion, self-occlusion, and occlusion caused by multiple objects. In addition, it exhibits strong data efficiency as the entire network is trained solely on a large-scale synthetic dataset, enabling zero-shot sim-to-real generalization without any real-world training data. Comprehensive simulation and real-world experiments demonstrate that UniStateDLO outperforms all state-of-the-art baselines in both estimation and tracking, producing globally smooth yet locally precise DLO state predictions in real time, even under substantial occlusions. Its integration as the front-end module in a closed-loop DLO manipulation system further demonstrates its ability to support stable feedback control in complex, constrained 3-D environments.

Paper Structure

This paper contains 42 sections, 28 equations, 20 figures, 5 tables, 1 algorithm.

Figures (20)

  • Figure 1: We propose UniStateDLO, a novel unified perception framework for deformable linear objects (DLOs) that supports both single-frame state estimation and cross-frame tracking of DLOs under severe occlusions. Leveraging diffusion-based generative modeling, UniStateDLO reconstructs complete DLO configurations from even highly partial point clouds with strong accuracy, robustness and real-time performance. Trained entirely on synthetic data, it generalizes in a zero-shot manner to diverse real-world DLOs and provides a reliable perception front-end for constrained manipulation tasks.
  • Figure 2: Illustration of the DLO perception task and the notation of key variables. Given partial DLO point clouds (red points) extracted from RGB-D images, single-frame state estimation and cross-frame tracking aim to reconstruct a sequential chain of nodes (blue connected dots), either independently from each frame or across a temporal sequence.
  • Figure 3: Overview of the proposed UniStateDLO pipeline, comprising Single-Frame State Estimation for initialization and Cross-Frame State Tracking for sequential motion tracking. Given a partial DLO point cloud, state estimation module first produces coarse predictions through two complementary branches based on PointNet++ features, and then refines them via a diffusion model. For cross-frame tracking, a KNN-based feature aggregation module extracts node-wise local features around the previous frame's predictions, followed by another diffusion model to infer per-node cross-frame motion.
  • Figure 4: Demonstration of predicted point-wise heatmap value and unit offset. Considering the neighborhood of one node, the points closer to it will have a higher heatmap value (visualized as deeper color), and the unit offset represents the normalized direction from the input point to the desired node.
  • Figure 5: Illustration of the diffusion-based fusion module. The nodes estimated by regression (blue points) are always globally smooth but imprecise, whereas voting results (orange points) are locally precise but unreliable inside the occluded region. Conditioned on the coarse estimations from both branches, a diffusion-based generative model incorporated with graph convoluntional layer fuses their outputs to obtain the final node sequence (purple points).
  • ...and 15 more figures