Table of Contents
Fetching ...

cuRoboV2: Dynamics-Aware Motion Generation with Depth-Fused Distance Fields for High-DoF Robots

Balakumar Sundaralingam, Adithyavairavan Murali, Stan Birchfield

TL;DR

A ground-up codebase redesign for discoverability enabled LLM coding assistants to author up to 73% of new modules, including hand-optimized CUDA kernels, demonstrating that well-structured robotics code can unlock productive human--LLM collaboration.

Abstract

Effective robot autonomy requires motion generation that is safe, feasible, and reactive. Current methods are fragmented: fast planners output physically unexecutable trajectories, reactive controllers struggle with high-fidelity perception, and existing solvers fail on high-DoF systems. We present cuRoboV2, a unified framework with three key innovations: (1) B-spline trajectory optimization that enforces smoothness and torque limits; (2) a GPU-native TSDF/ESDF perception pipeline that generates dense signed distance fields covering the full workspace, unlike existing methods that only provide distances within sparsely allocated blocks, up to 10x faster and in 8x less memory than the state-of-the-art at manipulation scale, with up to 99% collision recall; and (3) scalable GPU-native whole-body computation, namely topology-aware kinematics, differentiable inverse dynamics, and map-reduce self-collision, that achieves up to 61x speedup while also extending to high-DoF humanoids (where previous GPU implementations fail). On benchmarks, cuRoboV2 achieves 99.7% success under 3kg payload (where baselines achieve only 72--77%), 99.6% collision-free IK on a 48-DoF humanoid (where prior methods fail entirely), and 89.5% retargeting constraint satisfaction (vs. 61% for PyRoki); these collision-free motions yield locomotion policies with 21% lower tracking error than PyRoki and 12x lower cross-seed variance than mink. A ground-up codebase redesign for discoverability enabled LLM coding assistants to author up to 73% of new modules, including hand-optimized CUDA kernels, demonstrating that well-structured robotics code can unlock productive human--LLM collaboration. Together, these advances provide a unified, dynamics-aware motion generation stack that scales from single-arm manipulators to full humanoids.

cuRoboV2: Dynamics-Aware Motion Generation with Depth-Fused Distance Fields for High-DoF Robots

TL;DR

A ground-up codebase redesign for discoverability enabled LLM coding assistants to author up to 73% of new modules, including hand-optimized CUDA kernels, demonstrating that well-structured robotics code can unlock productive human--LLM collaboration.

Abstract

Effective robot autonomy requires motion generation that is safe, feasible, and reactive. Current methods are fragmented: fast planners output physically unexecutable trajectories, reactive controllers struggle with high-fidelity perception, and existing solvers fail on high-DoF systems. We present cuRoboV2, a unified framework with three key innovations: (1) B-spline trajectory optimization that enforces smoothness and torque limits; (2) a GPU-native TSDF/ESDF perception pipeline that generates dense signed distance fields covering the full workspace, unlike existing methods that only provide distances within sparsely allocated blocks, up to 10x faster and in 8x less memory than the state-of-the-art at manipulation scale, with up to 99% collision recall; and (3) scalable GPU-native whole-body computation, namely topology-aware kinematics, differentiable inverse dynamics, and map-reduce self-collision, that achieves up to 61x speedup while also extending to high-DoF humanoids (where previous GPU implementations fail). On benchmarks, cuRoboV2 achieves 99.7% success under 3kg payload (where baselines achieve only 72--77%), 99.6% collision-free IK on a 48-DoF humanoid (where prior methods fail entirely), and 89.5% retargeting constraint satisfaction (vs. 61% for PyRoki); these collision-free motions yield locomotion policies with 21% lower tracking error than PyRoki and 12x lower cross-seed variance than mink. A ground-up codebase redesign for discoverability enabled LLM coding assistants to author up to 73% of new modules, including hand-optimized CUDA kernels, demonstrating that well-structured robotics code can unlock productive human--LLM collaboration. Together, these advances provide a unified, dynamics-aware motion generation stack that scales from single-arm manipulators to full humanoids.
Paper Structure (69 sections, 8 equations, 39 figures, 4 tables, 5 algorithms)

This paper contains 69 sections, 8 equations, 39 figures, 4 tables, 5 algorithms.

Figures (39)

  • Figure 1: Trajectory optimization loop. Each iteration evaluates B-spline waypoints, computes kinematics and inverse dynamics in parallel, evaluates costs (scene collision, self-collision, configuration-space bounds) concurrently on separate CUDA streams, aggregates them, backpropagates gradients through all forward operations, and updates the B-spline control points via L-BFGS.
  • Figure 2: Local support property of cubic B-splines. Perturbing knot $u_3$ only affects 4 neighboring curve segments (orange region). Interpolation points (dots) are uniformly spaced in time; during optimization, each point within this region contributes a gradient to $u_3$ (accumulated with GPU warp-level reductions). (For simplicity, $N_{\text{interp}}=2$ is shown.)
  • Figure 3: TSDF-ESDF pipeline. Depth images are fused into a block-sparse TSDF via voxel-centric projection (Sec. \ref{['sec:tsdf_integration']}), and known geometry is stamped analytically. After each update, TSDF weights are decayed (time + frustum factors), and blocks below a weight threshold are tombstoned and recycled. On demand, a dense ESDF is generated in three stages: (1) seeding surface sites from zero-crossings, (2) propagating distances via PBA+, and (3) recovering signs for geometry-layer voxels beyond the truncation band.
  • Figure 4: Block-sparse TSDF storage. Only blocks near observed surfaces are allocated. A hash table maps block coordinates to pool indices via CAS. Each voxel stores two independent float16 channels (depth and geometry) whose minimum is returned at query time. Recycled blocks are managed through a free-list stack.
  • Figure 5: Voxel-Project depth integration. (a) Per-pixel rays discover blocks touched by the current depth frame. (b) Duplicate keys are filtered and surviving blocks are allocated via CAS, yielding $K$ pool indices. (c) Phase 4 reverses the mapping: each voxel projects itself into the image, reads the depth at the projected pixel, and writes the signed distance directly, eliminating all atomic contention.
  • ...and 34 more figures