Table of Contents
Fetching ...

Act, Sense, Act: Learning Non-Markovian Active Perception Strategies from Large-Scale Egocentric Human Data

Jialiang Li, Yi Qiao, Yunhan Guo, Changwen Chen, Wenzhao Lian

TL;DR

The paper tackles generalizable robotic manipulation in unconstrained settings by reframing active perception as a non-Markovian decision process driven by information gain and branching. It introduces CoMe-VLA, a cognitive and memory-aware visual-language-action framework that distills priors from large-scale human egocentric data and aligns human–robot coordination in a unified egocentric action space, enabling robust long-horizon behavior. A three-stage training pipeline—cognitive pretraining, cognition–action pretraining, and robot finetuning—driven by a dual-track memory and a cognitive auxiliary head, yields strong performance across diverse active-perception tasks on a wheel-based humanoid. The results demonstrate that leveraging human priors significantly reduces robot data requirements while maintaining high success rates, with robust behavior under dynamic perturbations, highlighting meaningful progress toward practical active perception in unstructured environments.

Abstract

Achieving generalizable manipulation in unconstrained environments requires the robot to proactively resolve information uncertainty, i.e., the capability of active perception. However, existing methods are often confined in limited types of sensing behaviors, restricting their applicability to complex environments. In this work, we formalize active perception as a non-Markovian process driven by information gain and decision branching, providing a structured categorization of visual active perception paradigms. Building on this perspective, we introduce CoMe-VLA, a cognitive and memory-aware vision-language-action (VLA) framework that leverages large-scale human egocentric data to learn versatile exploration and manipulation priors. Our framework integrates a cognitive auxiliary head for autonomous sub-task transitions and a dual-track memory system to maintain consistent self and environmental awareness by fusing proprioceptive and visual temporal contexts. By aligning human and robot hand-eye coordination behaviors in a unified egocentric action space, we train the model progressively in three stages. Extensive experiments on a wheel-based humanoid have demonstrated strong robustness and adaptability of our proposed method across diverse long-horizon tasks spanning multiple active perception scenarios.

Act, Sense, Act: Learning Non-Markovian Active Perception Strategies from Large-Scale Egocentric Human Data

TL;DR

The paper tackles generalizable robotic manipulation in unconstrained settings by reframing active perception as a non-Markovian decision process driven by information gain and branching. It introduces CoMe-VLA, a cognitive and memory-aware visual-language-action framework that distills priors from large-scale human egocentric data and aligns human–robot coordination in a unified egocentric action space, enabling robust long-horizon behavior. A three-stage training pipeline—cognitive pretraining, cognition–action pretraining, and robot finetuning—driven by a dual-track memory and a cognitive auxiliary head, yields strong performance across diverse active-perception tasks on a wheel-based humanoid. The results demonstrate that leveraging human priors significantly reduces robot data requirements while maintaining high success rates, with robust behavior under dynamic perturbations, highlighting meaningful progress toward practical active perception in unstructured environments.

Abstract

Achieving generalizable manipulation in unconstrained environments requires the robot to proactively resolve information uncertainty, i.e., the capability of active perception. However, existing methods are often confined in limited types of sensing behaviors, restricting their applicability to complex environments. In this work, we formalize active perception as a non-Markovian process driven by information gain and decision branching, providing a structured categorization of visual active perception paradigms. Building on this perspective, we introduce CoMe-VLA, a cognitive and memory-aware vision-language-action (VLA) framework that leverages large-scale human egocentric data to learn versatile exploration and manipulation priors. Our framework integrates a cognitive auxiliary head for autonomous sub-task transitions and a dual-track memory system to maintain consistent self and environmental awareness by fusing proprioceptive and visual temporal contexts. By aligning human and robot hand-eye coordination behaviors in a unified egocentric action space, we train the model progressively in three stages. Extensive experiments on a wheel-based humanoid have demonstrated strong robustness and adaptability of our proposed method across diverse long-horizon tasks spanning multiple active perception scenarios.
Paper Structure (61 sections, 18 equations, 8 figures, 7 tables)

This paper contains 61 sections, 18 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 2: Comparison between passive perception and active perception. Left: Passive perception, where actions are executed based on static observation and limited information is acquired when targets are out of view or occluded. Right: Active perception, which enables proactive actions to resolve task-relevant information uncertainty, illustrated through three paradigms, i.e., information discovery via viewpoint change or manipulation, and information enrichment.
  • Figure 3: VR-based immersive teleoperation system. The operator performs perception-driven control based solely on egocentric video streaming from the robot.
  • Figure 4: CoMe-VLA Overview. CoMe-VLA integrates a pre-trained VLM (Qwen3-VL-2B bai2025qwen3vltechnicalreport) with a transformer-based proprioceptive memory encoder to construct temporal visual-semantic and proprioceptive contexts, which are fed into a flow-matching action decoder to generate a 29-D action chunk. The VLM also outputs a cognitive latent token for the cognitive auxiliary head, which predicts a binary label for autonomous task transition. CoMe-VLA is trained in three stages using human and robot data. See the main text for details.
  • Figure 5: Evaluated Tasks. All tasks are designed with uncertain initial conditions, where the locations of the target or task-critical objects are unknown to the model before execution, and can vary across multiple configurations. Croissant Search: Find a croissant on the table but initially out of view. Can Disposal: Throw cans into a dustbin initially out of view but in the room. Bottle Retrieval: Open the cabinet to retrieve an initially occluded bottle. Cylinder Hunt: Uncover two inverted bowls to locate an initially hidden cylinder. Ring Peg: Bring the peg closer to gather sufficient information for precise insertion.
  • Figure 6: Ablation on Cognitive Grounding.
  • ...and 3 more figures