Table of Contents
Fetching ...

SENSOR: Imitate Third-Person Expert's Behaviors via Active Sensoring

Kaichen Huang, Minghao Shao, Shenghua Wan, Hai-Hang Sun, Shuai Feng, Le Gan, De-Chuan Zhan

TL;DR

The paper tackles the challenge of misalignment between expert and agent viewpoints in visual imitation learning, where domain-alignment alone struggles under large perspective gaps. It introduces SENSOR, a model-based framework that uses active sensoring to adjust the agent's viewpoint to match the expert, combining a world model (RSSM), separate motor and sensor policies, a discriminator ensemble for robust rewards, and an adaptive $\epsilon$-reward to balance exploration and exploitation. Sensor learning is facilitated by a two-encoder, two-policy architecture and a likelihood-based ELBO objective that encourages accurate latent dynamics and observation reconstruction. Empirical results on DMC locomotion tasks demonstrate that SENSOR achieves superior performance and stability across hard perspectives, with ablations confirming the importance of separate actors, ensemble discrimination, and adaptive rewards; a variant with fully decoupled dynamics (SENSOR-decoupled) underperforms due to instability and looser theoretical guarantees. Overall, the work shows that active sensoring can effectively reduce viewpoint-induced imprecision in IL, improving robustness and sample efficiency in real-world perception-driven control scenarios, while outlining directions for extending to changing expert perspectives.

Abstract

In many real-world visual Imitation Learning (IL) scenarios, there is a misalignment between the agent's and the expert's perspectives, which might lead to the failure of imitation. Previous methods have generally solved this problem by domain alignment, which incurs extra computation and storage costs, and these methods fail to handle the \textit{hard cases} where the viewpoint gap is too large. To alleviate the above problems, we introduce active sensoring in the visual IL setting and propose a model-based SENSory imitatOR (SENSOR) to automatically change the agent's perspective to match the expert's. SENSOR jointly learns a world model to capture the dynamics of latent states, a sensor policy to control the camera, and a motor policy to control the agent. Experiments on visual locomotion tasks show that SENSOR can efficiently simulate the expert's perspective and strategy, and outperforms most baseline methods.

SENSOR: Imitate Third-Person Expert's Behaviors via Active Sensoring

TL;DR

The paper tackles the challenge of misalignment between expert and agent viewpoints in visual imitation learning, where domain-alignment alone struggles under large perspective gaps. It introduces SENSOR, a model-based framework that uses active sensoring to adjust the agent's viewpoint to match the expert, combining a world model (RSSM), separate motor and sensor policies, a discriminator ensemble for robust rewards, and an adaptive -reward to balance exploration and exploitation. Sensor learning is facilitated by a two-encoder, two-policy architecture and a likelihood-based ELBO objective that encourages accurate latent dynamics and observation reconstruction. Empirical results on DMC locomotion tasks demonstrate that SENSOR achieves superior performance and stability across hard perspectives, with ablations confirming the importance of separate actors, ensemble discrimination, and adaptive rewards; a variant with fully decoupled dynamics (SENSOR-decoupled) underperforms due to instability and looser theoretical guarantees. Overall, the work shows that active sensoring can effectively reduce viewpoint-induced imprecision in IL, improving robustness and sample efficiency in real-world perception-driven control scenarios, while outlining directions for extending to changing expert perspectives.

Abstract

In many real-world visual Imitation Learning (IL) scenarios, there is a misalignment between the agent's and the expert's perspectives, which might lead to the failure of imitation. Previous methods have generally solved this problem by domain alignment, which incurs extra computation and storage costs, and these methods fail to handle the \textit{hard cases} where the viewpoint gap is too large. To alleviate the above problems, we introduce active sensoring in the visual IL setting and propose a model-based SENSory imitatOR (SENSOR) to automatically change the agent's perspective to match the expert's. SENSOR jointly learns a world model to capture the dynamics of latent states, a sensor policy to control the camera, and a motor policy to control the agent. Experiments on visual locomotion tasks show that SENSOR can efficiently simulate the expert's perspective and strategy, and outperforms most baseline methods.
Paper Structure (27 sections, 1 theorem, 12 equations, 10 figures, 2 tables, 1 algorithm)

This paper contains 27 sections, 1 theorem, 12 equations, 10 figures, 2 tables, 1 algorithm.

Key Result

Proposition 2.1

(Divergence in latent space) Given POMDP $\mathcal{M}$, history $h_t=(o_{\leq t},a_{<t})$ and latent representation of history $\hat{s}_t=q(h_t)$. Let $s_t\sim P(s_t|h_t)\approx P(s_t|\hat{s}_t)$, $a^z\sim \pi^z$ and $a^c\sim \pi^c$. $\mathbb{D}_f$ means f-divergence. Then

Figures (10)

  • Figure 1: Left: Intuitive distinction between observations rendered with different viewpoints. The top is the expert viewpoint, while the bottom is a poorer agent viewpoint. Domain adaptation methods learn an encoder $f$ to map different observations to the same embedding $z$, while active vision explicitly adjusts the agent's viewpoint by taking a sensor policy $\pi^c$. Right: We evaluate SENSOR and other rival methods over three seeds in Cheetah Run under two hard initial perspectives shown in the left. We report the mean (solid) and standard deviation (shaded) of normalized return. SENSOR beats other methods on both performance and stability levels.
  • Figure 2: Results that explain the limitations of domain alignment methods. (a): The relationship between mutual information of different viewpoints and performance among different viewpoint settings learned by DisentanGAIL. Each point represents a single viewpoint which differs from the expert's on the label text around it. Details in \ref{['domain-limitation']}. (b): The t-SNE plotvan2008visualizing shows the difference of embedding on two domains learned by two agents which are trained on different viewpoints. Top is a viewpoint far from the expert's and below is a closer one.
  • Figure 3: The main framework of the SENSOR method. Left: We use the Recurrent State Space Model(RSSM) planet structure to capture the transitions of the latent state $s$, and design two encoders $G_z$ and $G_c$ to extract the motor state $z$ and the sensor state $c$ from $s$. We propose two policy networks $\pi^z$ and $\pi^c$ to make decisions based on $z$ and $c$ respectively, and discriminators $D_{\psi}$ to provide reward signals for the actor updating. Additionally, we feed the concatenation $a=\text{concat}(a^z,a^c)$ into the dynamics model $p_{\theta}$ and the state encoder $q_{\omega}$ to compute the prior and the posterior of state. Right: We apply $p_{\theta}$ and $q_{\omega}$ to compute the prior and the posterior of batch data sampled from $\mathcal{B}_E\cup\mathcal{B}_{\pi}$, and then we update the world model by minimizing the consistency loss $\mathcal{L}_c$ and the reconstruction loss $\mathcal{L}_r$ mentioned in \ref{['s-components']}.
  • Figure 4: Evaluation results of SENSOR and other baseline methods over three seeds in two visual control tasks in DMC for $1$M steps with different initializations of agent's perspective. The specific settings of the environment and the initial viewpoint are shown above each figure. The solid lines represent the average episodic returns, and the shaded areas around them represent the variance of the performances on different seeds. The gray dotted line denotes the return of expert policy, while the yellow dotted line is the performance of Behavior Cloning. SENSOR outperforms other methods in terms of both performance and stability under different views in different environments.
  • Figure 5: (a): Camera parameters can be represented as a tuple $(d,a,e)$, where $d$ denotes distance from camera to target point $O$, $a$ is the horizontal angle relative to $O$ and $e$ is the vertical angle relative to $O$. Detailed perspective specifications in \ref{['exp']}. (b): Selected hard viewpoints for latter experiments.
  • ...and 5 more figures

Theorems & Definitions (2)

  • Proposition 2.1
  • proof : Proof of \ref{['prop1']}