See, Act, Adapt: Active Perception for Unsupervised Cross-Domain Visual Adaptation via Personalized VLM-Guided Agent

Tianci Tang; Tielong Cai; Hongwei Wang; Gaoang Wang

See, Act, Adapt: Active Perception for Unsupervised Cross-Domain Visual Adaptation via Personalized VLM-Guided Agent

Tianci Tang, Tielong Cai, Hongwei Wang, Gaoang Wang

TL;DR

This work transforms a vision-language model into a low-level pose controller through a two-stage training pipeline: first fine-tuning it on rule-based exploration trajectories that systematically probe indoor scenes, and then refining the policy via unsupervised reinforcement learning that constructs rewards from the perception module's outputs and confidence.

Abstract

Pre-trained perception models excel in generic image domains but degrade significantly in novel environments like indoor scenes. The conventional remedy is fine-tuning on downstream data which incurs catastrophic forgetting of prior knowledge and demands costly, scene-specific annotations. We propose a paradigm shift through Sea$^2$ (See, Act, Adapt): rather than adapting the perception modules themselves, we adapt how they are deployed through an intelligent pose-control agent. Sea$^2$ keeps all perception modules frozen, requiring no downstream labels during training, and uses only scalar perceptual feedback to navigate the agent toward informative viewpoints. Specially, we transform a vision-language model (VLM) into a low-level pose controller through a two-stage training pipeline: first fine-tuning it on rule-based exploration trajectories that systematically probe indoor scenes, and then refining the policy via unsupervised reinforcement learning that constructs rewards from the perception module's outputs and confidence. Unlike prior active perception methods that couple exploration with specific models or collect data for retraining them, Sea$^2$ directly leverages off-the-shelf perception models for various tasks without the need for retraining. We conducted experiments on three visual perception tasks, including visual grounding, segmentation and 3D box estimation, with performance improvements of 13.54%, 15.92% and 27.68% respectively on dataset ReplicaCAD.

See, Act, Adapt: Active Perception for Unsupervised Cross-Domain Visual Adaptation via Personalized VLM-Guided Agent

TL;DR

Abstract

(See, Act, Adapt): rather than adapting the perception modules themselves, we adapt how they are deployed through an intelligent pose-control agent. Sea

keeps all perception modules frozen, requiring no downstream labels during training, and uses only scalar perceptual feedback to navigate the agent toward informative viewpoints. Specially, we transform a vision-language model (VLM) into a low-level pose controller through a two-stage training pipeline: first fine-tuning it on rule-based exploration trajectories that systematically probe indoor scenes, and then refining the policy via unsupervised reinforcement learning that constructs rewards from the perception module's outputs and confidence. Unlike prior active perception methods that couple exploration with specific models or collect data for retraining them, Sea

directly leverages off-the-shelf perception models for various tasks without the need for retraining. We conducted experiments on three visual perception tasks, including visual grounding, segmentation and 3D box estimation, with performance improvements of 13.54%, 15.92% and 27.68% respectively on dataset ReplicaCAD.

Paper Structure (30 sections, 7 equations, 7 figures, 4 tables)

This paper contains 30 sections, 7 equations, 7 figures, 4 tables.

Introduction
Related Work
Embodied Active Perception
Multimodal Foundation Models
RL-based Post-training for VLMs
Method
Overview and Problem Formulation
Optimization Objective.
Framework Design
Perception Module Interface.
Training Pipeline
Supervised Fine-Tuning (SFT)
Reinforcement Learning (RL)
Format Reward.
Confidence Reward.
...and 15 more sections

Figures (7)

Figure 1: Overview of unsupervised cross-domain adaptation via VLM-guided active perception agent. (Left) Source domain including generic images (e.g., COCO lin2015microsoftcococommonobjects, Web) where perception models pre-trained on. (Middle) Our embodied agent actively explores the indoor environment and adjusts its camera pose to capture information-rich observations that maximize perception quality. (Right) Target domain including embodied scenes (e.g., Habitat savva2019habitat) where our VLM-guided agent trained on to unsupervised adapt the domain gap.
Figure 2: Illustration of the active perception process for segmentation. The agent decomposes the instruction into task-specific metadata and reasons about the scene to execute camera-adjusting actions. Starting from a highly occluded initial view (yellow) where the perception output (green mask) for the target object (red box) is poor, it follows a trajectory of navigational steps (blue) to reduce visual ambiguity. The final viewpoint (red) offers a significantly clearer perspective for the perception module, yielding a substantial improvement in perception score compared to the initial state.
Figure 3: Illustration of our Sea$^2$ framework. In Stage 1, the VLM is fine-tuned on rule-based trajectories generated by heuristic algorithms to align it with spatial reasoning and control formats. In Stage 2, the VLM serves as a low-level pose controller for the agent, where it is further refined using unsupervised reinforcement learning with GRPO. The agent interacts with the environment, receiving observations and taking actions to optimize its policy based on rewards derived from the frozen selected perception module's confidence and results (e.g., grounding confidence, mask area). The selected perception module remain frozen throughout the training process, ensuring no catastrophic forgetting of prior knowledge. The final policy enables the agent to navigate to informative viewpoints that enhance the performance of the perception modules without requiring any downstream annotations.
Figure 4: Illustration of the active perception process for visual grounding. From a poor initial view (yellow) where the prediction (green box) for the target (red box) is inaccurate, the agent takes navigational steps (blue) to reduce ambiguity, reaching a final viewpoint (red) that greatly improves the perception result.
Figure 5: Illustration of the active perception process for segmentation. From a poor initial view (yellow) where the prediction (green box) for the target (red box) is inaccurate, the agent takes navigational steps (blue) to reduce ambiguity, reaching a final viewpoint (red) that greatly improves the perception result.
...and 2 more figures

See, Act, Adapt: Active Perception for Unsupervised Cross-Domain Visual Adaptation via Personalized VLM-Guided Agent

TL;DR

Abstract

See, Act, Adapt: Active Perception for Unsupervised Cross-Domain Visual Adaptation via Personalized VLM-Guided Agent

Authors

TL;DR

Abstract

Table of Contents

Figures (7)