Table of Contents
Fetching ...

Activating Self-Attention for Multi-Scene Absolute Pose Regression

Miso Lee, Jihwan Kim, Jae-Pil Heo

TL;DR

This work presents an auxiliary loss that aligns queries and keys, preventing the distortion of query-key space and encouraging the model to find global relations by self-attention, thus outperforming existing methods in both outdoor and indoor scenes.

Abstract

Multi-scene absolute pose regression addresses the demand for fast and memory-efficient camera pose estimation across various real-world environments. Nowadays, transformer-based model has been devised to regress the camera pose directly in multi-scenes. Despite its potential, transformer encoders are underutilized due to the collapsed self-attention map, having low representation capacity. This work highlights the problem and investigates it from a new perspective: distortion of query-key embedding space. Based on the statistical analysis, we reveal that queries and keys are mapped in completely different spaces while only a few keys are blended into the query region. This leads to the collapse of the self-attention map as all queries are considered similar to those few keys. Therefore, we propose simple but effective solutions to activate self-attention. Concretely, we present an auxiliary loss that aligns queries and keys, preventing the distortion of query-key space and encouraging the model to find global relations by self-attention. In addition, the fixed sinusoidal positional encoding is adopted instead of undertrained learnable one to reflect appropriate positional clues into the inputs of self-attention. As a result, our approach resolves the aforementioned problem effectively, thus outperforming existing methods in both outdoor and indoor scenes.

Activating Self-Attention for Multi-Scene Absolute Pose Regression

TL;DR

This work presents an auxiliary loss that aligns queries and keys, preventing the distortion of query-key space and encouraging the model to find global relations by self-attention, thus outperforming existing methods in both outdoor and indoor scenes.

Abstract

Multi-scene absolute pose regression addresses the demand for fast and memory-efficient camera pose estimation across various real-world environments. Nowadays, transformer-based model has been devised to regress the camera pose directly in multi-scenes. Despite its potential, transformer encoders are underutilized due to the collapsed self-attention map, having low representation capacity. This work highlights the problem and investigates it from a new perspective: distortion of query-key embedding space. Based on the statistical analysis, we reveal that queries and keys are mapped in completely different spaces while only a few keys are blended into the query region. This leads to the collapse of the self-attention map as all queries are considered similar to those few keys. Therefore, we propose simple but effective solutions to activate self-attention. Concretely, we present an auxiliary loss that aligns queries and keys, preventing the distortion of query-key space and encouraging the model to find global relations by self-attention. In addition, the fixed sinusoidal positional encoding is adopted instead of undertrained learnable one to reflect appropriate positional clues into the inputs of self-attention. As a result, our approach resolves the aforementioned problem effectively, thus outperforming existing methods in both outdoor and indoor scenes.

Paper Structure

This paper contains 18 sections, 6 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: The figure shows query-key spaces, self-attention maps, and attended keys from the orientation transformer encoder of the baseline and ours, respectively. (a) In the case of the baseline, queries and keys are mapped in separate regions, while only a few keys are blended into the query region. Consequently, the self-attention map collapses and whole image features are represented by meaningless few keys, indicating waste of learning capacity of transformer encoder. (b) However, our solution makes queries and keys interact with each other, activating self-attention. This allows the model to obtain crucial global relations within image features, capturing salient global features.
  • Figure 2: (a) shows the purity levels in query regions with the baseline on 7Scenes, referring the Eq. \ref{['eq:purity']}. Note that the purity is 1.0 when the query region is composed only with queries, but slightly lower than 1.0 when a small subset of keys resides in the query region. According to (a), statistical evidence supports the prevalent occurrence of the blending of a few keys into the query region across the entire dataset, both in the position and orientation transformer encoders. (b) illustrates the increasing tendency of distance between the query region and the key region in the encoder. They lean away each other even at the beginning of the training. Here, the distance between query region and key region is an average value across layers and heads.
  • Figure 3: The figure shows L2 distances between the top-left token and other tokens based on the fixed 2D sinusoidal positional encoding and learnable positional embedding, respectively. Here, the learnable positional embedding is the result of training with the baseline. The fixed positional encoding preserves the order of input sequences, but in the case of the learnable positional embedding, tokens not aligned at the same height or width were all treated randomly further away.
  • Figure 4: Fig. \ref{['fig:main']} illustrates the training pipeline with our solutions. We apply additional objectives $\mathcal{L}_{\text{QKA}_t}$ and $\mathcal{L}_{\text{QKA}_r}$ to the model to activate the self-attention modules. Specifically, queries $Q$ and keys $K$ interact with each other by forcing the centroid of query region $\bar{\mathbf{q}}$ and the centroid of key region $\bar{\mathbf{k}}$ to become closer. Here, we encode all input queries and keys with fixed 2D sinusoidal positional encoding to ensure active interaction between $Q$ and $K$ with reliable positional clues.
  • Figure 5: The figure shows the attention entropy of the baseline and ours for each encoder layer. It demonstrates that our solutions significantly improve the utilization of encoder's learning capacity.
  • ...and 4 more figures