Table of Contents
Fetching ...

EIT: Enhanced Interactive Transformer

Tong Zheng, Bei Li, Huiwen Bao, Tong Xiao, Jingbo Zhu

TL;DR

The paper addresses the underexplored consensus aspect of multi-head self-attention by introducing Enhanced Multi-Head Attention (EMHA) within an Enhanced Interactive Transformer (EIT). It combines a many-to-many mapping (M2M) to expand attention capacity with two hierarchical interactions—inner-subspace (ISI) and cross-subspace (CSI)—to promote head consensus without sacrificing information. Empirical results across machine translation, grammar error correction, abstractive summarization, and language modeling show consistent improvements over standard Transformers, with a computationally efficient variant (E-Eit) offering favorable latency-accuracy tradeoffs. The findings demonstrate that balancing complementarity and consensus in multi-view learning for transformer heads yields more robust representations and facilitates practical deployment through easier pruning and modest resource overhead.

Abstract

Two principles: the complementary principle and the consensus principle are widely acknowledged in the literature of multi-view learning. However, the current design of multi-head self-attention, an instance of multi-view learning, prioritizes the complementarity while ignoring the consensus. To address this problem, we propose an enhanced multi-head self-attention (EMHA). First, to satisfy the complementary principle, EMHA removes the one-to-one mapping constraint among queries and keys in multiple subspaces and allows each query to attend to multiple keys. On top of that, we develop a method to fully encourage consensus among heads by introducing two interaction models, namely inner-subspace interaction and cross-subspace interaction. Extensive experiments on a wide range of language tasks (e.g., machine translation, abstractive summarization and grammar correction, language modeling), show its superiority, with a very modest increase in model size. Our code would be available at: https://github.com/zhengkid/EIT-Enhanced-Interactive-Transformer.

EIT: Enhanced Interactive Transformer

TL;DR

The paper addresses the underexplored consensus aspect of multi-head self-attention by introducing Enhanced Multi-Head Attention (EMHA) within an Enhanced Interactive Transformer (EIT). It combines a many-to-many mapping (M2M) to expand attention capacity with two hierarchical interactions—inner-subspace (ISI) and cross-subspace (CSI)—to promote head consensus without sacrificing information. Empirical results across machine translation, grammar error correction, abstractive summarization, and language modeling show consistent improvements over standard Transformers, with a computationally efficient variant (E-Eit) offering favorable latency-accuracy tradeoffs. The findings demonstrate that balancing complementarity and consensus in multi-view learning for transformer heads yields more robust representations and facilitates practical deployment through easier pruning and modest resource overhead.

Abstract

Two principles: the complementary principle and the consensus principle are widely acknowledged in the literature of multi-view learning. However, the current design of multi-head self-attention, an instance of multi-view learning, prioritizes the complementarity while ignoring the consensus. To address this problem, we propose an enhanced multi-head self-attention (EMHA). First, to satisfy the complementary principle, EMHA removes the one-to-one mapping constraint among queries and keys in multiple subspaces and allows each query to attend to multiple keys. On top of that, we develop a method to fully encourage consensus among heads by introducing two interaction models, namely inner-subspace interaction and cross-subspace interaction. Extensive experiments on a wide range of language tasks (e.g., machine translation, abstractive summarization and grammar correction, language modeling), show its superiority, with a very modest increase in model size. Our code would be available at: https://github.com/zhengkid/EIT-Enhanced-Interactive-Transformer.
Paper Structure (84 sections, 7 equations, 10 figures, 14 tables)

This paper contains 84 sections, 7 equations, 10 figures, 14 tables.

Figures (10)

  • Figure 1: The illustration of many-to-many mapping scheme ($M=4$).
  • Figure 2: Illustration of dual enhanced interaction in Eit ($M=4$). We omit the ReLU for simplicity.
  • Figure 3: Illustration of dual enhanced interaction in efficient Eit ($M=4$). We omit the ReLU for simplicity.
  • Figure 4: Cosine similarity among attention maps of different models on En-De task.
  • Figure 5: Analysis of token correlation on En-De task.
  • ...and 5 more figures