Table of Contents
Fetching ...

Which Reconstruction Model Should a Robot Use? Routing Image-to-3D Models for Cost-Aware Robotic Manipulation

Akash Anand, Aditya Agarwal, Leslie Pack Kaelbling

Abstract

Robotic manipulation tasks require 3D mesh reconstructions of varying quality: dexterous manipulation demands fine-grained surface detail, while collision-free planning tolerates coarser representations. Multiple reconstruction methods offer different cost-quality tradeoffs, from Image-to-3D models - whose output quality depends heavily on the input viewpoint - to view-invariant methods such as structured light scanning. Querying all models is computationally prohibitive, motivating per-input model selection. We propose SCOUT, a novel routing framework that decouples reconstruction scores into two components: (1) the relative performance of viewpoint-dependent models, captured by a learned probability distribution, and (2) the overall image difficulty, captured by a scalar partition function estimate. As the learned network operates only over the viewpoint-dependent models, view-invariant pipelines can be added, removed, or reconfigured without retraining. SCOUT also supports arbitrary cost constraints at inference time, accommodating the multi-dimensional cost constraints common in robotics. We evaluate on the Google Scanned Objects, BigBIRD, and YCB datasets under multiple mesh quality metrics, demonstrating consistent improvements over routing baselines adapted from the LLM literature across various cost constraints. We further validate the framework through robotic grasping and dexterous manipulation experiments. We release the code and additional results on our website.

Which Reconstruction Model Should a Robot Use? Routing Image-to-3D Models for Cost-Aware Robotic Manipulation

Abstract

Robotic manipulation tasks require 3D mesh reconstructions of varying quality: dexterous manipulation demands fine-grained surface detail, while collision-free planning tolerates coarser representations. Multiple reconstruction methods offer different cost-quality tradeoffs, from Image-to-3D models - whose output quality depends heavily on the input viewpoint - to view-invariant methods such as structured light scanning. Querying all models is computationally prohibitive, motivating per-input model selection. We propose SCOUT, a novel routing framework that decouples reconstruction scores into two components: (1) the relative performance of viewpoint-dependent models, captured by a learned probability distribution, and (2) the overall image difficulty, captured by a scalar partition function estimate. As the learned network operates only over the viewpoint-dependent models, view-invariant pipelines can be added, removed, or reconfigured without retraining. SCOUT also supports arbitrary cost constraints at inference time, accommodating the multi-dimensional cost constraints common in robotics. We evaluate on the Google Scanned Objects, BigBIRD, and YCB datasets under multiple mesh quality metrics, demonstrating consistent improvements over routing baselines adapted from the LLM literature across various cost constraints. We further validate the framework through robotic grasping and dexterous manipulation experiments. We release the code and additional results on our website.

Paper Structure

This paper contains 20 sections, 25 equations, 15 figures, 11 tables.

Figures (15)

  • Figure 1: Overview of SCOUT. Given an input image, (a) SCOUT routes to the most suitable reconstruction model and generates the selected 3D reconstruction (compared against other candidate models). (b) Grasp proposals generated on the reconstructed mesh, evaluated on the ground-truth mesh; colliding grasps are shown in red. (c) Utility of SCOUT's reconstruction-aware routing in downstream robust robot grasping and dexterous manipulation.
  • Figure 2: Effect of the number of view-invariant methods on routing performance when evaluated on viewpoint-dependent methods. With decoupling, regret over viewpoint-dependent models remains constant regardless of how many view-invariant methods are included during training. Without decoupling, regret increases as the number of view-invariant methods grows.
  • Figure 3: Deferral curves for the latency$\odot$memory (a) and latency (b) cost coefficient vectors, showing the normalized utility achieved by each method as a function of the allocated cost.
  • Figure 4: Degenerate and nondegenerate views of a cracker box.
  • Figure 5: Different lighting conditions for a fixed viewpoint.
  • ...and 10 more figures