Flex: End-to-End Text-Instructed Visual Navigation from Foundation Model Features

Makram Chahine; Alex Quach; Alaa Maalouf; Tsun-Hsuan Wang; Daniela Rus

Flex: End-to-End Text-Instructed Visual Navigation from Foundation Model Features

Makram Chahine, Alex Quach, Alaa Maalouf, Tsun-Hsuan Wang, Daniela Rus

TL;DR

This paper tackles end-to-end vision-language navigation under open-set text instructions with limited demonstrations. It introduces Flex, a minimalist framework that freezes Vision-Language Model encoders to produce dense patch-wise text-vision features and trains a lightweight policy head via imitation learning. Key findings show that patch-level fusion with two-object training enables robust generalization to unseen goals, objects, and real-world scenes, including zero-shot sim-to-real transfer. The approach reduces data and computation compared to large-scale RL or language-driven planners, enabling interactive, open-vocabulary robotic navigation in practical settings.

Abstract

End-to-end learning directly maps sensory inputs to actions, creating highly integrated and efficient policies for complex robotics tasks. However, such models often struggle to generalize beyond their training scenarios, limiting adaptability to new environments, tasks, and concepts. In this work, we investigate the minimal data requirements and architectural adaptations necessary to achieve robust closed-loop performance with vision-based control policies under unseen text instructions and visual distribution shifts. Our findings are synthesized in Flex (Fly lexically), a framework that uses pre-trained Vision Language Models (VLMs) as frozen patch-wise feature extractors, generating spatially aware embeddings that integrate semantic and visual information. We demonstrate the effectiveness of this approach on a quadrotor fly-to-target task, where agents trained via behavior cloning on a small simulated dataset successfully generalize to real-world scenes with diverse novel goals and command formulations.

Flex: End-to-End Text-Instructed Visual Navigation from Foundation Model Features

TL;DR

Abstract

Flex: End-to-End Text-Instructed Visual Navigation from Foundation Model Features

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (9)