Table of Contents
Fetching ...

AeroPlace-Flow: Language-Grounded Object Placement for Aerial Manipulators via Visual Foresight and Object Flow

Sarthak Mishra, Rishabh Dev Yadav, Naveen Nair, Wei Pan, Spandan Roy

TL;DR

A training-free framework for language-grounded aerial object placement that unifies visual foresight with explicit 3D geometric reasoning and object flow and produces executable placement targets without requiring predefined poses or task-specific training is presented.

Abstract

Precise object placement remains underexplored in aerial manipulation, where most systems rely on predefined target coordinates and focus primarily on grasping and control. Specifying exact placement poses, however, is cumbersome in real-world settings, where users naturally communicate goals through language. In this work, we present AeroPlace-Flow, a training-free framework for language-grounded aerial object placement that unifies visual foresight with explicit 3D geometric reasoning and object flow. Given RGB-D observations of the object and the placement scene, along with a natural language instruction, AeroPlace-Flow first synthesizes a task-complete goal image using image editing models. The imagined configuration is then grounded into metric 3D space through depth alignment and object-centric reasoning, enabling the inference of a collision-aware object flow that transports the grasped object to a language and contact-consistent placement configuration. The resulting motion is executed via standard trajectory tracking for an aerial manipulator. AeroPlace-Flow produces executable placement targets without requiring predefined poses or task-specific training. We validate our approach through extensive simulation and real-world experiments, demonstrating reliable language-conditioned placement across diverse aerial scenarios with an average success rate of 75% on hardware.

AeroPlace-Flow: Language-Grounded Object Placement for Aerial Manipulators via Visual Foresight and Object Flow

TL;DR

A training-free framework for language-grounded aerial object placement that unifies visual foresight with explicit 3D geometric reasoning and object flow and produces executable placement targets without requiring predefined poses or task-specific training is presented.

Abstract

Precise object placement remains underexplored in aerial manipulation, where most systems rely on predefined target coordinates and focus primarily on grasping and control. Specifying exact placement poses, however, is cumbersome in real-world settings, where users naturally communicate goals through language. In this work, we present AeroPlace-Flow, a training-free framework for language-grounded aerial object placement that unifies visual foresight with explicit 3D geometric reasoning and object flow. Given RGB-D observations of the object and the placement scene, along with a natural language instruction, AeroPlace-Flow first synthesizes a task-complete goal image using image editing models. The imagined configuration is then grounded into metric 3D space through depth alignment and object-centric reasoning, enabling the inference of a collision-aware object flow that transports the grasped object to a language and contact-consistent placement configuration. The resulting motion is executed via standard trajectory tracking for an aerial manipulator. AeroPlace-Flow produces executable placement targets without requiring predefined poses or task-specific training. We validate our approach through extensive simulation and real-world experiments, demonstrating reliable language-conditioned placement across diverse aerial scenarios with an average success rate of 75% on hardware.
Paper Structure (15 sections, 1 equation, 5 figures, 3 tables)

This paper contains 15 sections, 1 equation, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Overview of AeroPlace-Flow. Given a natural language instruction and RGB-D observations of the object and placement scene, our method infers a collision-free object flow for aerial manipulation in 3 main steps. (1) Visual Foresight: A language-conditioned image editing model generates a goal image of the scene with the object placed according to the instruction. (2) Object Flow Extraction: The generated image is converted into a metrically consistent 3D scene, contact footprints are estimated, and the original object geometry is used to compute a collision-free object flow trajectory. (3) Placement Execution: The aerial manipulator tracks the inferred object flow to execute the placement. Bottom: Hardware demonstrations of language-conditioned aerial placement tasks in diverse scenarios. *Cable connected to drone is only for supplying power.
  • Figure 2: Generating Visual Foresight. Given RGB-D observations of the object $(I_{obj}, D_{obj})$ and scene $(I_{scene}, D_{scene})$ with a task instruction $L$, an image generation model $\pi$ produces a goal image $I_{\text{gen}}$ depicting the desired final placement.
  • Figure 3: Object flow inference and placement execution. Given the generated goal image $I_{gen}$, we first recover a metrically consistent 3D scene and extract point clouds for the object $\mathcal{P}_{obj}$, generated object $\mathcal{P}_{obj\text{-}gen}$, and world $\mathcal{P}_{world}$. A contact footprint between $\mathcal{P}_{obj\text{-}gen}$ and $\mathcal{P}_{world}$ is estimated to identify the support region. The original object geometry $\mathcal{P}_{obj}$ is then aligned with the current gripper pose and virtually placed on the contact footprint to obtain the desired placement configuration. Known point correspondences between the gripped and placed object are used to generate an initial linear object flow, which is refined through optimization with collision and smoothness constraints to produce a collision-free trajectory. The resulting object flow $\mathbf{P}_{1:T}$ is executed by the aerial manipulator using standard trajectory tracking.
  • Figure 4: Representative examples from the 100-task benchmark evaluation. Each row illustrates the full AeroPlace-Flow pipeline for a language-conditioned placement task. From left to right: input object and scene images, generated visual foresight from the image editing model, reconstructed 3D scene, estimated contact footprint(colored in green), resulting placed object pose, and the inferred collision-free object flow trajectory.
  • Figure 5: Overview of the custom benchmark. Each row illustrates a representative task instance with (left) the target object, (middle) the scene configuration, and (right) the corresponding ground-truth final state after placement.