MResT: Multi-Resolution Sensing for Real-Time Control with Vision-Language Models
Saumya Saxena, Mohit Sharma, Oliver Kroemer
TL;DR
MResT tackles the challenge of real-time, language-conditioned robotic manipulation under partial observability by fusing multi-spatial and multi-temporal sensing. It combines slow, frozen vision-language model features from a third-person view with fast, task-specific local features from a first-person view and high-frequency force-torque data, using cross-attention transformers to fuse modalities and output continuous actions via a lightweight MLP. The approach is evaluated across coarse, precise, and dynamic manipulation tasks, in both simulation and real-world settings, showing ~2x average improvements over strong multi-task baselines and better generalization to unseen visual-semantic targets, with real-world gains up to ~3x. By freezing large VLMs and leveraging asymmetric data augmentations, MResT achieves robust generalization while enabling high-frequency reactive control, offering practical impact for robust, adaptable robotic manipulation in varied environments.
Abstract
Leveraging sensing modalities across diverse spatial and temporal resolutions can improve performance of robotic manipulation tasks. Multi-spatial resolution sensing provides hierarchical information captured at different spatial scales and enables both coarse and precise motions. Simultaneously multi-temporal resolution sensing enables the agent to exhibit high reactivity and real-time control. In this work, we propose a framework, MResT (Multi-Resolution Transformer), for learning generalizable language-conditioned multi-task policies that utilize sensing at different spatial and temporal resolutions using networks of varying capacities to effectively perform real time control of precise and reactive tasks. We leverage off-the-shelf pretrained vision-language models to operate on low-frequency global features along with small non-pretrained models to adapt to high frequency local feedback. Through extensive experiments in 3 domains (coarse, precise and dynamic manipulation tasks), we show that our approach significantly improves (2X on average) over recent multi-task baselines. Further, our approach generalizes well to visual and geometric variations in target objects and to varying interaction forces.
