Table of Contents
Fetching ...

Robot-DIFT: Distilling Diffusion Features for Geometrically Consistent Visuomotor Control

Yu Deng, Yufeng Jin, Xiaogang Jia, Jiahong Xue, Gerhard Neumann, Georgia Chalvatzaki

TL;DR

Robot-DIFT addresses the core problem that geometry-sensitive visuomotor control requires representations preserving fine spatial structure, which standard discriminative backbones often discard. It distills the geometric priors of a pretrained diffusion model into a deterministic Spatial-Semantic Feature Pyramid Network via Manifold Distillation, enabling real-time, drift-free control. Pretrained on the DROID robot-demonstration dataset, the approach yields a multi-scale, geometry-preserving backbone and demonstrates superior geometric consistency and control performance over discriminative baselines and other diffusion-based approaches. The results on RoboCasa, LIBERO-10, and real-robot Panda experiments show improved precision in contact-rich tasks and favorable latency, underscoring the practical impact of aligning perception and action through diffusion priors.

Abstract

We hypothesize that a key bottleneck in generalizable robot manipulation is not solely data scale or policy capacity, but a structural mismatch between current visual backbones and the physical requirements of closed-loop control. While state-of-the-art vision encoders (including those used in VLAs) optimize for semantic invariance to stabilize classification, manipulation typically demands geometric sensitivity the ability to map millimeter-level pose shifts to predictable feature changes. Their discriminative objective creates a "blind spot" for fine-grained control, whereas generative diffusion models inherently encode geometric dependencies within their latent manifolds, encouraging the preservation of dense multi-scale spatial structure. However, directly deploying stochastic diffusion features for control is hindered by stochastic instability, inference latency, and representation drift during fine-tuning. To bridge this gap, we propose Robot-DIFT, a framework that decouples the source of geometric information from the process of inference via Manifold Distillation. By distilling a frozen diffusion teacher into a deterministic Spatial-Semantic Feature Pyramid Network (S2-FPN), we retain the rich geometric priors of the generative model while ensuring temporal stability, real-time execution, and robustness against drift. Pretrained on the large-scale DROID dataset, Robot-DIFT demonstrates superior geometric consistency and control performance compared to leading discriminative baselines, supporting the view that how a model learns to see dictates how well it can learn to act.

Robot-DIFT: Distilling Diffusion Features for Geometrically Consistent Visuomotor Control

TL;DR

Robot-DIFT addresses the core problem that geometry-sensitive visuomotor control requires representations preserving fine spatial structure, which standard discriminative backbones often discard. It distills the geometric priors of a pretrained diffusion model into a deterministic Spatial-Semantic Feature Pyramid Network via Manifold Distillation, enabling real-time, drift-free control. Pretrained on the DROID robot-demonstration dataset, the approach yields a multi-scale, geometry-preserving backbone and demonstrates superior geometric consistency and control performance over discriminative baselines and other diffusion-based approaches. The results on RoboCasa, LIBERO-10, and real-robot Panda experiments show improved precision in contact-rich tasks and favorable latency, underscoring the practical impact of aligning perception and action through diffusion priors.

Abstract

We hypothesize that a key bottleneck in generalizable robot manipulation is not solely data scale or policy capacity, but a structural mismatch between current visual backbones and the physical requirements of closed-loop control. While state-of-the-art vision encoders (including those used in VLAs) optimize for semantic invariance to stabilize classification, manipulation typically demands geometric sensitivity the ability to map millimeter-level pose shifts to predictable feature changes. Their discriminative objective creates a "blind spot" for fine-grained control, whereas generative diffusion models inherently encode geometric dependencies within their latent manifolds, encouraging the preservation of dense multi-scale spatial structure. However, directly deploying stochastic diffusion features for control is hindered by stochastic instability, inference latency, and representation drift during fine-tuning. To bridge this gap, we propose Robot-DIFT, a framework that decouples the source of geometric information from the process of inference via Manifold Distillation. By distilling a frozen diffusion teacher into a deterministic Spatial-Semantic Feature Pyramid Network (S2-FPN), we retain the rich geometric priors of the generative model while ensuring temporal stability, real-time execution, and robustness against drift. Pretrained on the large-scale DROID dataset, Robot-DIFT demonstrates superior geometric consistency and control performance compared to leading discriminative baselines, supporting the view that how a model learns to see dictates how well it can learn to act.
Paper Structure (29 sections, 9 equations, 2 figures, 4 tables)

This paper contains 29 sections, 9 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Overview of Robot-DIFT. Our framework distills generative diffusion priors into a deterministic backbone for efficient robotic control. Manifold Distillation: A Student U-Net is trained to replicate the multi-scale feature manifold of a frozen Teacher. The S2-FPN fuses these features to capture semantic context and fine geometry, while a Manifold Loss anchors the Student to the diffusion prior to prevent representation drift. Deployment: By discarding the Teacher, we enable a single-forward-pass pipeline that preserves geometric sensitivity without the latency or stochasticity of iterative sampling.
  • Figure 2: Real-robot task suite. We evaluate four manipulation tasks that progressively increase geometric sensitivity, from coarse semantic grounding to contact-rich interaction with tight tolerances: Sort Cup (place yellow cup in pink zone), Open Lid, Insert Pin, and Press Switch.