Table of Contents
Fetching ...

Self-Supervised Interpretable End-to-End Learning via Latent Functional Modularity

Hyunki Seong, David Hyunchul Shim

TL;DR

MoNet introduces a functionally modular end-to-end network for autonomous navigation, combining perception, planning, and control with an internal latent decision and a self-attention-based perception path. It trains with a supervised policy loss ${\mathcal{L}_{\pi}}$ and a self-supervised latent-guided contrastive loss ${\mathcal{L}_{LGC}}$ to promote task-specific planning without task-level labels and adds a post-hoc explainability pipeline that maps latent decisions to perceptual saliency and decoded task probabilities. The approach yields interpretable online inferences, including spatial saliency maps and probabilistic task intents, while achieving robust sensorimotor performance across multiple indoor driving tasks and environments. This work advances explainable AI in robotics by linking latent planning decisions to human-interpretable signals without sacrificing end-to-end learning efficiency or control quality.

Abstract

We introduce MoNet, a novel functionally modular network for self-supervised and interpretable end-to-end learning. By leveraging its functional modularity with a latent-guided contrastive loss function, MoNet efficiently learns task-specific decision-making processes in latent space without requiring task-level supervision. Moreover, our method incorporates an online, post-hoc explainability approach that enhances the interpretability of end-to-end inferences without compromising sensorimotor control performance. In real-world indoor environments, MoNet demonstrates effective visual autonomous navigation, outperforming baseline models by 7% to 28% in task specificity analysis. We further explore the interpretability of our network through post-hoc analysis of perceptual saliency maps and latent decision vectors. This provides valuable insights into the incorporation of explainable artificial intelligence into robotic learning, encompassing both perceptual and behavioral perspectives. Supplementary materials are available at https://sites.google.com/view/monet-lgc.

Self-Supervised Interpretable End-to-End Learning via Latent Functional Modularity

TL;DR

MoNet introduces a functionally modular end-to-end network for autonomous navigation, combining perception, planning, and control with an internal latent decision and a self-attention-based perception path. It trains with a supervised policy loss and a self-supervised latent-guided contrastive loss to promote task-specific planning without task-level labels and adds a post-hoc explainability pipeline that maps latent decisions to perceptual saliency and decoded task probabilities. The approach yields interpretable online inferences, including spatial saliency maps and probabilistic task intents, while achieving robust sensorimotor performance across multiple indoor driving tasks and environments. This work advances explainable AI in robotics by linking latent planning decisions to human-interpretable signals without sacrificing end-to-end learning efficiency or control quality.

Abstract

We introduce MoNet, a novel functionally modular network for self-supervised and interpretable end-to-end learning. By leveraging its functional modularity with a latent-guided contrastive loss function, MoNet efficiently learns task-specific decision-making processes in latent space without requiring task-level supervision. Moreover, our method incorporates an online, post-hoc explainability approach that enhances the interpretability of end-to-end inferences without compromising sensorimotor control performance. In real-world indoor environments, MoNet demonstrates effective visual autonomous navigation, outperforming baseline models by 7% to 28% in task specificity analysis. We further explore the interpretability of our network through post-hoc analysis of perceptual saliency maps and latent decision vectors. This provides valuable insights into the incorporation of explainable artificial intelligence into robotic learning, encompassing both perceptual and behavioral perspectives. Supplementary materials are available at https://sites.google.com/view/monet-lgc.
Paper Structure (31 sections, 10 equations, 9 figures, 2 tables)

This paper contains 31 sections, 10 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Our approach incorporates a functionally modular end-to-end network architecture, which includes a post-hoc method for an interpretable latent decision-making process.
  • Figure 2: Overview of our method. While the entire end-to-end network is optimized by the supervised imitation loss $\mathcal{L}_{\pi}$, the planning module is updated by the latent-guided contrastive loss $\mathcal{L}_{LGC}$, which is directed by the latent vector $z^p$.
  • Figure 3: Our self-supervised contrastive learning scheme assesses the similarity of the perceptual features to decide on positive and negative latent decision samples.
  • Figure 4: Overview of our post-hoc behavior interpretation process.
  • Figure 5: Hardware and experimental setup.
  • ...and 4 more figures