Table of Contents
Fetching ...

I-Perceive: A Foundation Model for Active Perception with Language Instructions

Yongxi Huang, Zhuohang Wang, Wenjing Tang, Cewu Lu, Panpan Cai

TL;DR

I-Perceive is proposed, a foundation model for active perception conditioned on natural language instructions, designed for mobile manipulators and indoor environments, and significantly outperforms state-of-the-art VLMs in both prediction accuracy and instruction following of generated camera views, and exhibits strong zero-shot generalization to novel scenes and tasks.

Abstract

Active perception, the ability of a robot to proactively adjust its viewpoint to acquire task-relevant information, is essential for robust operation in unstructured real-world environments. While critical for downstream tasks such as manipulation, existing approaches have largely been confined to local settings (e.g., table-top scenes) with fixed perception objectives (e.g., occlusion reduction). Addressing active perception with open-ended intents in large-scale environments remains an open challenge. To bridge this gap, we propose I-Perceive, a foundation model for active perception conditioned on natural language instructions, designed for mobile manipulators and indoor environments. I-Perceive predicts camera views that follows open-ended language instructions, based on image-based scene contexts. By fusing a Vision-Language Model (VLM) backbone with a geometric foundation model, I-Perceive bridges semantic and geometric understanding, thus enabling effective reasoning for active perception. We train I-Perceive on a diverse dataset comprising real-world scene-scanning data and simulation data, both processed via an automated and scalable data generation pipeline. Experiments demonstrate that I-Perceive significantly outperforms state-of-the-art VLMs in both prediction accuracy and instruction following of generated camera views, and exhibits strong zero-shot generalization to novel scenes and tasks.

I-Perceive: A Foundation Model for Active Perception with Language Instructions

TL;DR

I-Perceive is proposed, a foundation model for active perception conditioned on natural language instructions, designed for mobile manipulators and indoor environments, and significantly outperforms state-of-the-art VLMs in both prediction accuracy and instruction following of generated camera views, and exhibits strong zero-shot generalization to novel scenes and tasks.

Abstract

Active perception, the ability of a robot to proactively adjust its viewpoint to acquire task-relevant information, is essential for robust operation in unstructured real-world environments. While critical for downstream tasks such as manipulation, existing approaches have largely been confined to local settings (e.g., table-top scenes) with fixed perception objectives (e.g., occlusion reduction). Addressing active perception with open-ended intents in large-scale environments remains an open challenge. To bridge this gap, we propose I-Perceive, a foundation model for active perception conditioned on natural language instructions, designed for mobile manipulators and indoor environments. I-Perceive predicts camera views that follows open-ended language instructions, based on image-based scene contexts. By fusing a Vision-Language Model (VLM) backbone with a geometric foundation model, I-Perceive bridges semantic and geometric understanding, thus enabling effective reasoning for active perception. We train I-Perceive on a diverse dataset comprising real-world scene-scanning data and simulation data, both processed via an automated and scalable data generation pipeline. Experiments demonstrate that I-Perceive significantly outperforms state-of-the-art VLMs in both prediction accuracy and instruction following of generated camera views, and exhibits strong zero-shot generalization to novel scenes and tasks.
Paper Structure (37 sections, 3 equations, 5 figures, 3 tables)

This paper contains 37 sections, 3 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: I-Perceive is a foundation model for vision-language active perception. Given context images and a natural language instruction, the model predicts a target camera pose that fulfills the observation intent. By fusing VLM semantic features into a geometric backbone, I-Perceive demonstrates strong zero-shot generalization to unseen real-world environments.
  • Figure 2: I-Perceive model architecture. The VLM backbone is not shown in detail for simplicity. Semantic features are extracted from intermediate layers of the VLM and fused into semantic tokens in the S-VGGT geometric backbone.
  • Figure 3: Qualitative results of I-Perceive and VLM baselines on our test set. The start frame and one context frame are shown on the left, with the language instruction at the bottom. The predicted target views from different methods are visualized by different colored frustums in 3D space and their corresponding rendered RGB images.
  • Figure 4: (a) Real photos captured by mobile cameras in indoor environments. (b) Rendered images from Coohom. Input frames are shown on the left with language instructions at the bottom. Predicted target views for different instruction is visualized by colored frustums in 3D space along with their rendered RGB images on the right. The rendered RGB images are directly obtained by projecting the dense point cloud estimation from S-VGGT on the predicted target views.
  • Figure 5: Calling I-Perceive in closed-loop manner. The first row shows the RGB images captured at each step and the second row visualizes the current target poses prediction as green frustums in 3D space.