SENSOR: Imitate Third-Person Expert's Behaviors via Active Sensoring
Kaichen Huang, Minghao Shao, Shenghua Wan, Hai-Hang Sun, Shuai Feng, Le Gan, De-Chuan Zhan
TL;DR
The paper tackles the challenge of misalignment between expert and agent viewpoints in visual imitation learning, where domain-alignment alone struggles under large perspective gaps. It introduces SENSOR, a model-based framework that uses active sensoring to adjust the agent's viewpoint to match the expert, combining a world model (RSSM), separate motor and sensor policies, a discriminator ensemble for robust rewards, and an adaptive $\epsilon$-reward to balance exploration and exploitation. Sensor learning is facilitated by a two-encoder, two-policy architecture and a likelihood-based ELBO objective that encourages accurate latent dynamics and observation reconstruction. Empirical results on DMC locomotion tasks demonstrate that SENSOR achieves superior performance and stability across hard perspectives, with ablations confirming the importance of separate actors, ensemble discrimination, and adaptive rewards; a variant with fully decoupled dynamics (SENSOR-decoupled) underperforms due to instability and looser theoretical guarantees. Overall, the work shows that active sensoring can effectively reduce viewpoint-induced imprecision in IL, improving robustness and sample efficiency in real-world perception-driven control scenarios, while outlining directions for extending to changing expert perspectives.
Abstract
In many real-world visual Imitation Learning (IL) scenarios, there is a misalignment between the agent's and the expert's perspectives, which might lead to the failure of imitation. Previous methods have generally solved this problem by domain alignment, which incurs extra computation and storage costs, and these methods fail to handle the \textit{hard cases} where the viewpoint gap is too large. To alleviate the above problems, we introduce active sensoring in the visual IL setting and propose a model-based SENSory imitatOR (SENSOR) to automatically change the agent's perspective to match the expert's. SENSOR jointly learns a world model to capture the dynamics of latent states, a sensor policy to control the camera, and a motor policy to control the agent. Experiments on visual locomotion tasks show that SENSOR can efficiently simulate the expert's perspective and strategy, and outperforms most baseline methods.
