Table of Contents
Fetching ...

MResT: Multi-Resolution Sensing for Real-Time Control with Vision-Language Models

Saumya Saxena, Mohit Sharma, Oliver Kroemer

TL;DR

MResT tackles the challenge of real-time, language-conditioned robotic manipulation under partial observability by fusing multi-spatial and multi-temporal sensing. It combines slow, frozen vision-language model features from a third-person view with fast, task-specific local features from a first-person view and high-frequency force-torque data, using cross-attention transformers to fuse modalities and output continuous actions via a lightweight MLP. The approach is evaluated across coarse, precise, and dynamic manipulation tasks, in both simulation and real-world settings, showing ~2x average improvements over strong multi-task baselines and better generalization to unseen visual-semantic targets, with real-world gains up to ~3x. By freezing large VLMs and leveraging asymmetric data augmentations, MResT achieves robust generalization while enabling high-frequency reactive control, offering practical impact for robust, adaptable robotic manipulation in varied environments.

Abstract

Leveraging sensing modalities across diverse spatial and temporal resolutions can improve performance of robotic manipulation tasks. Multi-spatial resolution sensing provides hierarchical information captured at different spatial scales and enables both coarse and precise motions. Simultaneously multi-temporal resolution sensing enables the agent to exhibit high reactivity and real-time control. In this work, we propose a framework, MResT (Multi-Resolution Transformer), for learning generalizable language-conditioned multi-task policies that utilize sensing at different spatial and temporal resolutions using networks of varying capacities to effectively perform real time control of precise and reactive tasks. We leverage off-the-shelf pretrained vision-language models to operate on low-frequency global features along with small non-pretrained models to adapt to high frequency local feedback. Through extensive experiments in 3 domains (coarse, precise and dynamic manipulation tasks), we show that our approach significantly improves (2X on average) over recent multi-task baselines. Further, our approach generalizes well to visual and geometric variations in target objects and to varying interaction forces.

MResT: Multi-Resolution Sensing for Real-Time Control with Vision-Language Models

TL;DR

MResT tackles the challenge of real-time, language-conditioned robotic manipulation under partial observability by fusing multi-spatial and multi-temporal sensing. It combines slow, frozen vision-language model features from a third-person view with fast, task-specific local features from a first-person view and high-frequency force-torque data, using cross-attention transformers to fuse modalities and output continuous actions via a lightweight MLP. The approach is evaluated across coarse, precise, and dynamic manipulation tasks, in both simulation and real-world settings, showing ~2x average improvements over strong multi-task baselines and better generalization to unseen visual-semantic targets, with real-world gains up to ~3x. By freezing large VLMs and leveraging asymmetric data augmentations, MResT achieves robust generalization while enabling high-frequency reactive control, offering practical impact for robust, adaptable robotic manipulation in varied environments.

Abstract

Leveraging sensing modalities across diverse spatial and temporal resolutions can improve performance of robotic manipulation tasks. Multi-spatial resolution sensing provides hierarchical information captured at different spatial scales and enables both coarse and precise motions. Simultaneously multi-temporal resolution sensing enables the agent to exhibit high reactivity and real-time control. In this work, we propose a framework, MResT (Multi-Resolution Transformer), for learning generalizable language-conditioned multi-task policies that utilize sensing at different spatial and temporal resolutions using networks of varying capacities to effectively perform real time control of precise and reactive tasks. We leverage off-the-shelf pretrained vision-language models to operate on low-frequency global features along with small non-pretrained models to adapt to high frequency local feedback. Through extensive experiments in 3 domains (coarse, precise and dynamic manipulation tasks), we show that our approach significantly improves (2X on average) over recent multi-task baselines. Further, our approach generalizes well to visual and geometric variations in target objects and to varying interaction forces.
Paper Structure (22 sections, 7 figures, 9 tables)

This paper contains 22 sections, 7 figures, 9 tables.

Figures (7)

  • Figure 1: Our proposed approach uses sensing at different spatial and temporal resolutions for real time control of coarse, precise and dynamic tasks while enabling generalization to novel visual features and interactions.
  • Figure 2: Overall architecture: Global low frequency information is extracted from third-person camera images using slow inference networks, local high frequency information is extracted from first-person camera images and proprioceptive, force-torque feedback using fast inference networks. These sensing modalities are then fused at different frequencies to enable real time high frequency control.
  • Figure 3: Task settings for evaluating our proposed approach. Left: Precision tasks. Middle-left: Dynamic tasks. Middle-right: Coarse tasks. Right: Real world pick and insertion tasks.
  • Figure 4: Temporal resolution and robustness baselines used to compare our multi-resolution approach.
  • Figure 5: Example failure case for MT-Dynamic (Ballbot) task. As can be seen in the figure, if the robot approaches the object but does not react fast enough to the object contact, the block can topple resulting, in task failure.
  • ...and 2 more figures