Table of Contents
Fetching ...

Hyperbolic Attention Networks

Caglar Gulcehre, Misha Denil, Mateusz Malinowski, Ali Razavi, Razvan Pascanu, Karl Moritz Hermann, Peter Battaglia, Victor Bapst, David Raposo, Adam Santoro, Nando de Freitas

TL;DR

The paper introduces hyperbolic attention networks that map neural activations into hyperbolic space and reframe attention as hyperbolic matching and hyperbolic aggregation. By leveraging the hyperboloid and Klein models, the approach enables hyperbolic versions of attention-based Relational Networks and Transformers. Empirical results across scale-free graphs, relational reasoning benchmarks, and neural machine translation show improved generalization, particularly in low-capacity models, suggesting hyperbolic geometry as a principled inductive bias for hierarchical and power-law structured data. The work highlights the feasibility and benefits of operating directly in hyperbolic space to enhance relational reasoning and compactness of representations.

Abstract

We introduce hyperbolic attention networks to endow neural networks with enough capacity to match the complexity of data with hierarchical and power-law structure. A few recent approaches have successfully demonstrated the benefits of imposing hyperbolic geometry on the parameters of shallow networks. We extend this line of work by imposing hyperbolic geometry on the activations of neural networks. This allows us to exploit hyperbolic geometry to reason about embeddings produced by deep networks. We achieve this by re-expressing the ubiquitous mechanism of soft attention in terms of operations defined for hyperboloid and Klein models. Our method shows improvements in terms of generalization on neural machine translation, learning on graphs and visual question answering tasks while keeping the neural representations compact.

Hyperbolic Attention Networks

TL;DR

The paper introduces hyperbolic attention networks that map neural activations into hyperbolic space and reframe attention as hyperbolic matching and hyperbolic aggregation. By leveraging the hyperboloid and Klein models, the approach enables hyperbolic versions of attention-based Relational Networks and Transformers. Empirical results across scale-free graphs, relational reasoning benchmarks, and neural machine translation show improved generalization, particularly in low-capacity models, suggesting hyperbolic geometry as a principled inductive bias for hierarchical and power-law structured data. The work highlights the feasibility and benefits of operating directly in hyperbolic space to enhance relational reasoning and compactness of representations.

Abstract

We introduce hyperbolic attention networks to endow neural networks with enough capacity to match the complexity of data with hierarchical and power-law structure. A few recent approaches have successfully demonstrated the benefits of imposing hyperbolic geometry on the parameters of shallow networks. We extend this line of work by imposing hyperbolic geometry on the activations of neural networks. This allows us to exploit hyperbolic geometry to reason about embeddings produced by deep networks. We achieve this by re-expressing the ubiquitous mechanism of soft attention in terms of operations defined for hyperboloid and Klein models. Our method shows improvements in terms of generalization on neural machine translation, learning on graphs and visual question answering tasks while keeping the neural representations compact.

Paper Structure

This paper contains 22 sections, 11 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: An intuitive depiction of how images might be embedded in 2D. The location of the embeddings reflects the similarity between each image and that of a pug. Since the number of instances within a given semantic distance from the central object grows exponentially, the Euclidean space is not able to compactly represent such structure (left). In hyperbolic space (right) the volume grows exponentially, allowing for sufficient room to embed the images. For visualization, we have shrunk the images in this Euclidean diagram, a trick also used by Escher.
  • Figure 2: The computational graph for the self-attention mechanism of the hyperbolic Transformer. We show the different operations in the blocks and their interactions are represented by the arrows.
  • Figure 3: Left: Performance of the Recursive Transformer models on the Shortest Path Length Prediction task on graphs of various sizes. The black dashed line indicates chance performance. Center: Results on Link Prediction Tasks. Right: The histogram of the radiuses for a model trained on a graph with 100 and 400 nodes.
  • Figure 4: Left: Comparison of our models with low-capacity on the Sort-of-CLEVR dataset. The "EA" refers to the model that uses hyperbolic attention weights with Euclidean aggregation. Right: Performance of Relation Network extended by attention mechanism in either Euclidean or hyperbolic space on the CLEVR dataset.
  • Figure 5: Relationships between different representations of points used in the paper. Left: The relationship between pseudo-polar coordinates in $\mathbb{R}^n$ and the hyperboloid in $\mathbb{R}^{n+1}$. Right: Projections relating the hyperboloid, Klein and Poincaré models of hyperbolic space.
  • ...and 4 more figures