Language Models as Hierarchy Encoders

Yuan He; Zhangdie Yuan; Jiaoyan Chen; Ian Horrocks

Language Models as Hierarchy Encoders

Yuan He, Zhangdie Yuan, Jiaoyan Chen, Ian Horrocks

TL;DR

A novel approach to re-train transformer encoder-based LMs as Hierarchy Transformer encoders (HiTs), harnessing the expansive nature of hyperbolic space and underscoring the effectiveness and transferability of the re-trained hierarchy encoders.

Abstract

Interpreting hierarchical structures latent in language is a key limitation of current language models (LMs). While previous research has implicitly leveraged these hierarchies to enhance LMs, approaches for their explicit encoding are yet to be explored. To address this, we introduce a novel approach to re-train transformer encoder-based LMs as Hierarchy Transformer encoders (HiTs), harnessing the expansive nature of hyperbolic space. Our method situates the output embedding space of pre-trained LMs within a Poincaré ball with a curvature that adapts to the embedding dimension, followed by training on hyperbolic clustering and centripetal losses. These losses are designed to effectively cluster related entities (input as texts) and organise them hierarchically. We evaluate HiTs against pre-trained LMs, standard fine-tuned LMs, and several hyperbolic embedding baselines, focusing on their capabilities in simulating transitive inference, predicting subsumptions, and transferring knowledge across hierarchies. The results demonstrate that HiTs consistently outperform all baselines in these tasks, underscoring the effectiveness and transferability of our re-trained hierarchy encoders.

Language Models as Hierarchy Encoders

TL;DR

Abstract

Paper Structure (19 sections, 6 equations, 3 figures, 8 tables)

This paper contains 19 sections, 6 equations, 3 figures, 8 tables.

Introduction
Preliminaries
Language Models
Hyperbolic Geometry
Hierarchy
Hierarchy Transformer Encoder
Evaluation
Task Definition
Dataset Construction
Baselines
Results
Analysis of HiT Embeddings
Related Work
Conclusion
Limitations and Future Work
...and 4 more sections

Figures (3)

Figure 1: Illustration of how hierarchies are explicitly encoded in HiTs. The square ($d$-dimensional hyper-cube) refers to the output embedding space of transformer encoder-based LMs whose final activation function is typically $\tanh$, and the circumscribed circle ($d$-dimensional hyper-sphere) refers to the Poincaré ball of radius $\sqrt{d}$. The distance and norm metrics involved in our hyperbolic losses are defined w.r.t. this manifold.
Figure 2: Illustration of the impact of $\mathcal{L}_{\textsc{HiT}\xspace}$ during training. In Euclidean space, it seems contradictory that both "phone" and "computer" are pulled towards "e-device" but are also pushed away from each other. However, in principle this is not a problem in hyperbolic space, where distances increase exponentially relative to Euclidean distances as one moves from the origin to the boundary of the manifold.
Figure 3: Distribution of WordNet entity embeddings generated by HiT w.r.t. their hyperbolic norms.

Language Models as Hierarchy Encoders

TL;DR

Abstract

Language Models as Hierarchy Encoders

Authors

TL;DR

Abstract

Table of Contents

Figures (3)