Table of Contents
Fetching ...

Addressing Information Loss and Interaction Collapse: A Dual Enhanced Attention Framework for Feature Interaction

Yi Xu, Zhiyuan Lu, Xiaochen Li, Jinxin Hu, Hong Wen, Zulong Chen, Yu Zhang, Jing Zhang

TL;DR

The paper tackles two core limitations of Transformer-based CTR models: information loss in feature interactions from inner-product attention and interaction collapse due to long-tailed feature distributions. It proposes a Dual Enhanced Attention framework comprising Combo-ID Attention, which memorizes feature interactions via an independent memory codebook (with gated Siamese codebooks to mitigate collisions), and Collapse-avoiding Attention, which uses dynamic thresholding to filter low-information interactions; these are fused through multiple schemes to yield robust attention scores for prediction. Key contributions include the memory-based Combo-ID mechanism, the dynamic thresholding strategy for long-tail features, and versatile fusion methods, all validated on a large industrial dataset where the method outperforms strong baselines in AUC and GAUC. The work demonstrates practical impact for production CTR systems by preserving rich interaction signals while avoiding degradation from data sparsity, suggesting a viable path for scalable, reliable recommendation models.

Abstract

The Transformer has proven to be a significant approach in feature interaction for CTR prediction, achieving considerable success in previous works. However, it also presents potential challenges in handling feature interactions. Firstly, Transformers may encounter information loss when capturing feature interactions. By relying on inner products to represent pairwise relationships, they compress raw interaction information, which can result in a degradation of fidelity. Secondly, due to the long-tail features distribution, feature fields with low information-abundance embeddings constrain the information abundance of other fields, leading to collapsed embedding matrices. To tackle these issues, we propose a Dual Attention Framework for Enhanced Feature Interaction, known as Dual Enhanced Attention. This framework integrates two attention mechanisms: the Combo-ID attention mechanism and the collapse-avoiding attention mechanism. The Combo-ID attention mechanism directly retains feature interaction pairs to mitigate information loss, while the collapse-avoiding attention mechanism adaptively filters out low information-abundance interaction pairs to prevent interaction collapse. Extensive experiments conducted on industrial datasets have shown the effectiveness of Dual Enhanced Attention.

Addressing Information Loss and Interaction Collapse: A Dual Enhanced Attention Framework for Feature Interaction

TL;DR

The paper tackles two core limitations of Transformer-based CTR models: information loss in feature interactions from inner-product attention and interaction collapse due to long-tailed feature distributions. It proposes a Dual Enhanced Attention framework comprising Combo-ID Attention, which memorizes feature interactions via an independent memory codebook (with gated Siamese codebooks to mitigate collisions), and Collapse-avoiding Attention, which uses dynamic thresholding to filter low-information interactions; these are fused through multiple schemes to yield robust attention scores for prediction. Key contributions include the memory-based Combo-ID mechanism, the dynamic thresholding strategy for long-tail features, and versatile fusion methods, all validated on a large industrial dataset where the method outperforms strong baselines in AUC and GAUC. The work demonstrates practical impact for production CTR systems by preserving rich interaction signals while avoiding degradation from data sparsity, suggesting a viable path for scalable, reliable recommendation models.

Abstract

The Transformer has proven to be a significant approach in feature interaction for CTR prediction, achieving considerable success in previous works. However, it also presents potential challenges in handling feature interactions. Firstly, Transformers may encounter information loss when capturing feature interactions. By relying on inner products to represent pairwise relationships, they compress raw interaction information, which can result in a degradation of fidelity. Secondly, due to the long-tail features distribution, feature fields with low information-abundance embeddings constrain the information abundance of other fields, leading to collapsed embedding matrices. To tackle these issues, we propose a Dual Attention Framework for Enhanced Feature Interaction, known as Dual Enhanced Attention. This framework integrates two attention mechanisms: the Combo-ID attention mechanism and the collapse-avoiding attention mechanism. The Combo-ID attention mechanism directly retains feature interaction pairs to mitigate information loss, while the collapse-avoiding attention mechanism adaptively filters out low information-abundance interaction pairs to prevent interaction collapse. Extensive experiments conducted on industrial datasets have shown the effectiveness of Dual Enhanced Attention.

Paper Structure

This paper contains 11 sections, 12 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Comparision of Inner-product and Combo-ID
  • Figure 2: Illustration of Combo-ID Attention Mechanism and Collapse-avoiding Attention Mechanism: (a) A set of input feature, e.g. user IDs, item IDs, and merchant attributes. (b) In the Combo-ID attention mechanism, each pair of features is combined by generating a unique Combo-ID through concatenation of their individual feature IDs. (c) The Gated Siamese Codebook method employs $k$ codebooks with distinct hash functions, using siamese representations to gate and re-weight the main codebook's outputs, reducing misrepresentation of long-tail feature interactions. (d) After re-weighting the main codebook, each embedding is projected into a scalar, eventually forming the attention score matrix of the Combo-ID Attention. (e) The traditional self-attention uses inner product to calculate attention score. (f) The dynamic-thresholding strategy filters out low information abundance embeddings by using the average modulus length within a batch as a threshold. (g) The attention score matrix of the Collapse-avoiding Attention Mechanism. (h) The final attention score matrix is fused by the Combo-ID Attention Mechanism and the Collapse-avoiding Mechanism Attention.
  • Figure 3: The impacts of collision on features individually