Table of Contents
Fetching ...

Precise Mobile Manipulation of Small Everyday Objects

Arjun Gupta, Rishik Sathua, Saurabh Gupta

TL;DR

This work addresses the challenge of precise mobile manipulation of small everyday objects in novel environments by introducing Servoing with Vision Models (SVM), a training-free closed-loop framework that integrates visual servoing with vision foundation models. By out-painting the end-effector to mitigate occlusion and using open-vocabulary detectors or point trackers for target specification, SVM achieves robust 3D target localization and precise manipulation. In large-scale real-world tests across 10 environments and 72 object instances, SVM attains 71% zero-shot success—substantially outperforming open-loop control and large imitation-learning baselines—demonstrating strong generalization and practical viability for everyday tasks. The approach offers a modular, perception-driven alternative to end-to-end imitation learning, with potential impact on real-world service and domestic robotics where precise interaction with small objects is required.

Abstract

Many everyday mobile manipulation tasks require precise interaction with small objects, such as grasping a knob to open a cabinet or pressing a light switch. In this paper, we develop Servoing with Vision Models (SVM), a closed-loop framework that enables a mobile manipulator to tackle such precise tasks involving the manipulation of small objects. SVM uses state-of-the-art vision foundation models to generate 3D targets for visual servoing to enable diverse tasks in novel environments. Naively doing so fails because of occlusion by the end-effector. SVM mitigates this using vision models that out-paint the end-effector, thereby significantly enhancing target localization. We demonstrate that aided by out-painting methods, open-vocabulary object detectors can serve as a drop-in module for SVM to seek semantic targets (e.g. knobs) and point tracking methods can help SVM reliably pursue interaction sites indicated by user clicks. We conduct a large-scale evaluation spanning experiments in 10 novel environments across 6 buildings including 72 different object instances. SVM obtains a 71% zero-shot success rate on manipulating unseen objects in novel environments in the real world, outperforming an open-loop control method by an absolute 42% and an imitation learning baseline trained on 1000+ demonstrations also by an absolute success rate of 50%.

Precise Mobile Manipulation of Small Everyday Objects

TL;DR

This work addresses the challenge of precise mobile manipulation of small everyday objects in novel environments by introducing Servoing with Vision Models (SVM), a training-free closed-loop framework that integrates visual servoing with vision foundation models. By out-painting the end-effector to mitigate occlusion and using open-vocabulary detectors or point trackers for target specification, SVM achieves robust 3D target localization and precise manipulation. In large-scale real-world tests across 10 environments and 72 object instances, SVM attains 71% zero-shot success—substantially outperforming open-loop control and large imitation-learning baselines—demonstrating strong generalization and practical viability for everyday tasks. The approach offers a modular, perception-driven alternative to end-to-end imitation learning, with potential impact on real-world service and domestic robotics where precise interaction with small objects is required.

Abstract

Many everyday mobile manipulation tasks require precise interaction with small objects, such as grasping a knob to open a cabinet or pressing a light switch. In this paper, we develop Servoing with Vision Models (SVM), a closed-loop framework that enables a mobile manipulator to tackle such precise tasks involving the manipulation of small objects. SVM uses state-of-the-art vision foundation models to generate 3D targets for visual servoing to enable diverse tasks in novel environments. Naively doing so fails because of occlusion by the end-effector. SVM mitigates this using vision models that out-paint the end-effector, thereby significantly enhancing target localization. We demonstrate that aided by out-painting methods, open-vocabulary object detectors can serve as a drop-in module for SVM to seek semantic targets (e.g. knobs) and point tracking methods can help SVM reliably pursue interaction sites indicated by user clicks. We conduct a large-scale evaluation spanning experiments in 10 novel environments across 6 buildings including 72 different object instances. SVM obtains a 71% zero-shot success rate on manipulating unseen objects in novel environments in the real world, outperforming an open-loop control method by an absolute 42% and an imitation learning baseline trained on 1000+ demonstrations also by an absolute success rate of 50%.

Paper Structure

This paper contains 11 sections, 1 equation, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Many everyday mobile manipulation tasks require reaching a precise interaction site before executing a motion primitive, e.g. precise reaching of a knob / handle to pull open a cupboard in (a) and (b), or precisely reaching a user-indicated button / book before pushing it in (c) and (d) (shown via the red dot). Open loop execution is unable to meet the high-precision needed for these tasks. In this paper, we develop Servoing with Vision Models (SVM), a training-free framework that closes the loop to enable a commodity mobile manipulator to tackle these tasks.
  • Figure 2: When using off-the-shelf detectors on wrist camera data, knob detections (indicated by the red point in the second column) are incorrect. Errors stem from occlusion of the knob due to the end-effector ( top) and due to the presence of the end-effector (out-of-distribution object) even when the knob is unoccluded ( bottom). Out-painting the end-effector ( right two columns) fixes this.
  • Figure 3: Servoing with Vision Models (SVM) is a framework for precise reaching for mobile manipulators. Starting from an input RGB-D wrist camera image with a target specified either via a semantic label (e.g. handle) or a user-clicked point on the image, SVM outputs whole-body control commands to convey the end-effector to the target location by closing the loop with visual feedback. SVM first paints out the end-effector using a video outpainting model, uses vision foundation models to continuously detect the target object (or track the desired target point) to compute 3D servoing targets, which are passed to a servo to obtain whole-body control commands (see Section \ref{['sec:dummy-method']}).
  • Figure 4: Our evaluation consists of 10 environments across 6 buildings, including 72 different object instances. Note that we exclusively test on novel objects in novel environments not used for training or development in any manner.
  • Figure 5: SVM vs. Open-Loop (Eye-in-Hand) baseline.(top) In opening a cabinet with a knob, slight errors in getting to the target cause the end-effector to slip off, leading to failure for the baseline, whereas our method is able to successfully complete the task. (bottom) Slight errors in getting to the target cause failure, whereas SVM successfully turns the lights off. Note the high quality of CoTracker's track ( blue dot).
  • ...and 2 more figures