Table of Contents
Fetching ...

LILAC: Language-Conditioned Object-Centric Optical Flow for Open-Loop Trajectory Generation

Motonari Kambara, Koki Seno, Tomoya Kaichi, Yanan Wang, Komei Sugiura

Abstract

We address language-conditioned robotic manipulation using flow-based trajectory generation, which enables training on human and web videos of object manipulation and requires only minimal embodiment-specific data. This task is challenging, as object trajectory generation from pre-manipulation images and natural language instructions requires appropriate instruction-flow alignment. To tackle this challenge, we propose the flow-based Language Instruction-guided open-Loop ACtion generator (LILAC). This flow-based Vision-Language-Action model (VLA) generates object-centric 2D optical flow from an RGB image and a natural language instruction, and converts the flow into a 6-DoF manipulator trajectory. LILAC incorporates two key components: Semantic Alignment Loss, which strengthens language conditioning to generate instruction-aligned optical flow, and Prompt-Conditioned Cross-Modal Adapter, which aligns learned visual prompts with image and text features to provide rich cues for flow generation. Experimentally, our method outperformed existing approaches in generated flow quality across multiple benchmarks. Furthermore, in physical object manipulation experiments using free-form instructions, LILAC demonstrated a superior task success rate compared to existing methods. The project page is available at https://lilac-75srg.kinsta.page/.

LILAC: Language-Conditioned Object-Centric Optical Flow for Open-Loop Trajectory Generation

Abstract

We address language-conditioned robotic manipulation using flow-based trajectory generation, which enables training on human and web videos of object manipulation and requires only minimal embodiment-specific data. This task is challenging, as object trajectory generation from pre-manipulation images and natural language instructions requires appropriate instruction-flow alignment. To tackle this challenge, we propose the flow-based Language Instruction-guided open-Loop ACtion generator (LILAC). This flow-based Vision-Language-Action model (VLA) generates object-centric 2D optical flow from an RGB image and a natural language instruction, and converts the flow into a 6-DoF manipulator trajectory. LILAC incorporates two key components: Semantic Alignment Loss, which strengthens language conditioning to generate instruction-aligned optical flow, and Prompt-Conditioned Cross-Modal Adapter, which aligns learned visual prompts with image and text features to provide rich cues for flow generation. Experimentally, our method outperformed existing approaches in generated flow quality across multiple benchmarks. Furthermore, in physical object manipulation experiments using free-form instructions, LILAC demonstrated a superior task success rate compared to existing methods. The project page is available at https://lilac-75srg.kinsta.page/.

Paper Structure

This paper contains 22 sections, 3 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Overview of LILAC, 2D object-centric optical flow-based Vision-and-Language trajectory generation framework. In this figure, 'Act. De-Tokenizer' represents Action De-Tokenizer. Given a natural language instruction, LILAC generates 2D flow from an RGB image and the instruction, and converts the flow into a 6-DoF robot trajectory.
  • Figure 2: Overview of the LILAC framework. The inputs to the flow generation module are a single RGB image $\bm{x}_{\mathrm{img}}$ and a natural language instruction $\bm{x}_{\mathrm{inst}}$. In this module, first, visual prompts and bounding boxes are generated by the MLLM. Subsequently, 2D flow is autoregressively generated by the Prompt-Conditioned Multi-Modal Adapter and the Semantic Reconstruction Decoder. After generating the 2D flow, based on the flow, $\bm{x}_{\mathrm{img}}$, $\bm{x}_{\mathrm{inst}}$, and $\bm{x}_{\mathrm{depth}}$, a 6-DoF manipulator trajectory is generated by the Action De-Tokenizer.
  • Figure 3: Examples of visual prompts generated by GPT-4o. GPT-4o outputs the start and end points of an arrow that represents the rough shape of the action from a text prompt and an RGB image as input. In this figure, the red arrows indicate the rendered visual prompts.
  • Figure 4: Qualitative results on the Robot Flow benchmark. The given $\bm{x}_{\mathrm{inst}}$ was (a) "Take the yellow block to put it on top of the tower." and (b) "Close top drawer.", respectively.
  • Figure 5: Qualitative results with and without visual prompts. 'Visual prompt', 'w/o Visual prompt', and 'w/ Visual prompt' denote the visual prompt generated by the MLLM, the flow generated without a visual prompt, and the flow generated with a visual prompt, respectively. In each example, the given instruction sentence is as follows: (a) "Move 7up can near brown chip bag." (b) "Pick black chip bag."
  • ...and 5 more figures