Table of Contents
Fetching ...

OmniGuide: Universal Guidance Fields for Enhancing Generalist Robot Policies

Yunzhou Song, Long Le, Yong-Hyun Park, Jie Wang, Junyao Shi, Lingjie Liu, Jiatao Gu, Eric Eaton, Dinesh Jayaraman, Kostas Daniilidis

TL;DR

Omniguide, a flexible framework that improves VLA performance on such tasks by leveraging arbitrary sources of guidance, such as 3D foundation models, semantic-reasoning VLMs, and human pose models, which matches or surpasses the performance of prior methods designed to incorporate specific sources of guidance into VLA policies.

Abstract

Vision-language-action(VLA) models have shown great promise as generalist policies for a large range of relatively simple tasks. However, they demonstrate limited performance on more complex tasks, such as those requiring complex spatial or semantic understanding, manipulation in clutter, or precise manipulation. We propose OMNIGUIDE, a flexible framework that improves VLA performance on such tasks by leveraging arbitrary sources of guidance, such as 3D foundation models, semantic-reasoning VLMs, and human pose models. We show how many kinds of guidance can be naturally expressed as differentiable energy functions with task-specific attractors and repellers located in 3D space, that influence the sampling of VLA actions. In this way, OMNIGUIDE enables guidance sources with complementary task-relevant strengths to improve a VLA model's performance on challenging tasks. Extensive experiments in both simulation and real-world environments, across diverse sources of guidance, demonstrate that OMNIGUIDE enhances the performance of state-of-the-art generalist policies (e.g., $π_{0.5}$, GR00T N1.6) significantly across success and safety rates. Critically, our unified framework matches or surpasses the performance of prior methods designed to incorporate specific sources of guidance into VLA policies. Project Page: $\href{https://omniguide.github.io/}{this \; url}$

OmniGuide: Universal Guidance Fields for Enhancing Generalist Robot Policies

TL;DR

Omniguide, a flexible framework that improves VLA performance on such tasks by leveraging arbitrary sources of guidance, such as 3D foundation models, semantic-reasoning VLMs, and human pose models, which matches or surpasses the performance of prior methods designed to incorporate specific sources of guidance into VLA policies.

Abstract

Vision-language-action(VLA) models have shown great promise as generalist policies for a large range of relatively simple tasks. However, they demonstrate limited performance on more complex tasks, such as those requiring complex spatial or semantic understanding, manipulation in clutter, or precise manipulation. We propose OMNIGUIDE, a flexible framework that improves VLA performance on such tasks by leveraging arbitrary sources of guidance, such as 3D foundation models, semantic-reasoning VLMs, and human pose models. We show how many kinds of guidance can be naturally expressed as differentiable energy functions with task-specific attractors and repellers located in 3D space, that influence the sampling of VLA actions. In this way, OMNIGUIDE enables guidance sources with complementary task-relevant strengths to improve a VLA model's performance on challenging tasks. Extensive experiments in both simulation and real-world environments, across diverse sources of guidance, demonstrate that OMNIGUIDE enhances the performance of state-of-the-art generalist policies (e.g., , GR00T N1.6) significantly across success and safety rates. Critically, our unified framework matches or surpasses the performance of prior methods designed to incorporate specific sources of guidance into VLA policies. Project Page:
Paper Structure (20 sections, 1 theorem, 25 equations, 14 figures, 1 algorithm)

This paper contains 20 sections, 1 theorem, 25 equations, 14 figures, 1 algorithm.

Key Result

Proposition 1

The integral value $Z = \int_\Omega \mathrm{SDF}_O(\mathbf{x}) \, d\mathbf{x}$ over ${\Omega}$ is finite on the domain ${\Omega} = \{\mathbf{x} \in \mathbb{R}^3 \mid 0 < \mathrm{SDF}_O(\mathbf{x}) \le d\}$, where $d\in(0,\infty)$.

Figures (14)

  • Figure 1: OmniGuide unifies different kinds of guidance via attractive and repulsive fields to improve the performance of generalist robot policies.
  • Figure 2: Method Overview. For each denoising step, OmniGuide first estimates the clean action chunk $\tilde{\mathbf{A}}^{\tau}$ by the base policy $\mathbf{v}_\theta$, and then decodes it into joint space. A differentiable dynamics/kinematic model is then used to obtain the robot's Cartesian trajectories $\mathbf{X}$, which are evaluated using the energy functions $\mathcal{L}_\mathbf{y}$ extracted from foundation models. Finally, the gradient will be backpropagated through the robot model and all neural networks, yielding a guidance vector on the noisy latent action chunk $\mathbf{A}^{\tau}$.
  • Figure 3: Guidance Visualization. We visualize the guidance gradient(lines) on the predicted Cartesian trajectories(dots), which will be backpropagated to the latent space as denoising guidance. The guidance gradient from collision energy repels the trajectories from obstacles, and the semantic energy gradient attracts the gripper for the grounded target. The two guidances are naturally blended in the space, yielding a joint guidance gradient which steers the denoising to a safe and task-oriented state.
  • Figure 4: Qualitative Results of Simulation Experiments.OmniGuide demonstrates strong generalization capabilities and flexibility across diverse tasks. We show the individual and joint effects of our method on different tasks. Left: collision avoidance guidance for the TurnSinkSpout task. Middle: semantic grounding guidance for the Multi-Choice task. Right: combination of the two guidance for the Multi-Choice in a clutter scene.
  • Figure 5: Quantitative Results of Simulation Experiments.OmniGuide is an effective method for enhancing base VLA's task performance and safety. Each component, including the initialization and denoising guidance, yields an improvement over the base policy, with denoising guidance being more effective, while our combined method produces the largest boost.
  • ...and 9 more figures

Theorems & Definitions (2)

  • Proposition 1
  • proof