Table of Contents
Fetching ...

Flex: End-to-End Text-Instructed Visual Navigation from Foundation Model Features

Makram Chahine, Alex Quach, Alaa Maalouf, Tsun-Hsuan Wang, Daniela Rus

TL;DR

This paper tackles end-to-end vision-language navigation under open-set text instructions with limited demonstrations. It introduces Flex, a minimalist framework that freezes Vision-Language Model encoders to produce dense patch-wise text-vision features and trains a lightweight policy head via imitation learning. Key findings show that patch-level fusion with two-object training enables robust generalization to unseen goals, objects, and real-world scenes, including zero-shot sim-to-real transfer. The approach reduces data and computation compared to large-scale RL or language-driven planners, enabling interactive, open-vocabulary robotic navigation in practical settings.

Abstract

End-to-end learning directly maps sensory inputs to actions, creating highly integrated and efficient policies for complex robotics tasks. However, such models often struggle to generalize beyond their training scenarios, limiting adaptability to new environments, tasks, and concepts. In this work, we investigate the minimal data requirements and architectural adaptations necessary to achieve robust closed-loop performance with vision-based control policies under unseen text instructions and visual distribution shifts. Our findings are synthesized in Flex (Fly lexically), a framework that uses pre-trained Vision Language Models (VLMs) as frozen patch-wise feature extractors, generating spatially aware embeddings that integrate semantic and visual information. We demonstrate the effectiveness of this approach on a quadrotor fly-to-target task, where agents trained via behavior cloning on a small simulated dataset successfully generalize to real-world scenes with diverse novel goals and command formulations.

Flex: End-to-End Text-Instructed Visual Navigation from Foundation Model Features

TL;DR

This paper tackles end-to-end vision-language navigation under open-set text instructions with limited demonstrations. It introduces Flex, a minimalist framework that freezes Vision-Language Model encoders to produce dense patch-wise text-vision features and trains a lightweight policy head via imitation learning. Key findings show that patch-level fusion with two-object training enables robust generalization to unseen goals, objects, and real-world scenes, including zero-shot sim-to-real transfer. The approach reduces data and computation compared to large-scale RL or language-driven planners, enabling interactive, open-vocabulary robotic navigation in practical settings.

Abstract

End-to-end learning directly maps sensory inputs to actions, creating highly integrated and efficient policies for complex robotics tasks. However, such models often struggle to generalize beyond their training scenarios, limiting adaptability to new environments, tasks, and concepts. In this work, we investigate the minimal data requirements and architectural adaptations necessary to achieve robust closed-loop performance with vision-based control policies under unseen text instructions and visual distribution shifts. Our findings are synthesized in Flex (Fly lexically), a framework that uses pre-trained Vision Language Models (VLMs) as frozen patch-wise feature extractors, generating spatially aware embeddings that integrate semantic and visual information. We demonstrate the effectiveness of this approach on a quadrotor fly-to-target task, where agents trained via behavior cloning on a small simulated dataset successfully generalize to real-world scenes with diverse novel goals and command formulations.

Paper Structure

This paper contains 29 sections, 5 equations, 9 figures, 5 tables, 2 algorithms.

Figures (9)

  • Figure 1: Flex pipeline: The input image is masked and, in conjunction with a user-specified text instruction, encoded via a pre-trained VLM to create a grid of rich per patch features. A policy network trained on these features directly computes robot commands.
  • Figure 2: Success Rate (%) as a function of Dataset Complexity, feature extractor Resolution, Policy Network architecture, and VLM Encoder model choice across all five simulation test scenarios (per column). Darker lines correspond to the InD scene, and lighter colors to the OoD background. Each data point is obtained from 100 runs with command syntax Navigate to the [OBJECT].
  • Figure 3: Flex sample real test run: Frames from a test run with text instruction Fly to the man with a wig. Time increases from left to right. In the last frame, the cardboard cutout is blown off the tripod support by the drone propellers. The wig remains.
  • Figure 4: Absolute error to expert per policy network for each output dimension ($v_x, v_y$, and $v_z$ in m/s; $\dot{\psi}$ in rad/s). Each data point is derived from 22.5k frame-instruction pairs.
  • Figure 5: Feature clustering and visualization through the 64-patch ViT policy network. The instruction is Navigate to the blue pyramid with a frame (top left with a grid overlay separating the patches) from the OoD simulation scene. The top row depicts the cluster memberships by color, with the goal belonging to blue. The bottom row visualizes the features' t-SNE embeddings in 3D.
  • ...and 4 more figures