Modeling Selective Feature Attention for Representation-based Siamese Text Matching

Jianxiang Zang; Hui Liu

Modeling Selective Feature Attention for Representation-based Siamese Text Matching

Jianxiang Zang, Hui Liu

TL;DR

The paper addresses embedding-feature level attention in representation-based Siamese text matching by introducing FA, which applies a squeeze-and-excitation mechanism to reweight embedding features, and SFA, which adds a dynamic selection mechanism on a stacked BiGRU Inception to enable multi-scale semantic extraction. FA preserves the input tensor shape and reweights features via $\bm{u}_d=\bm{e}_d\cdot\bm{x}_d$ with $\bm{s}$ computed as $\bm{s}_d=\frac{1}{L}\sum_{l=1}^L\bm{x}_{l,d}$ and $\bm{e}=\sigma(\delta(\bm{s}\bm{W}_{FC1})\bm{W}_{FC2})$. SFA introduces a three-phase block (split-and-fusion, squeeze-and-excitation, selection) built on a bottleneck autoencoder and a multi-branch BiGRU, enabling adaptive, gradient-balanced learning across semantic scales and improving training stability. Empirically, FA improves performance across six lightweight baselines and seven benchmarks, while SFA yields larger gains, with ESIM+SFA and DRCN+SFA reaching notable accuracies (e.g., around 80.9–81.9% in reported tests) and maintaining modest parameter overhead. The work demonstrates the practical value of embedding-feature attention in NLP and suggests broad applicability to related tasks such as text classification and entity recognition.

Abstract

Representation-based Siamese networks have risen to popularity in lightweight text matching due to their low deployment and inference costs. While word-level attention mechanisms have been implemented within Siamese networks to improve performance, we propose Feature Attention (FA), a novel downstream block designed to enrich the modeling of dependencies among embedding features. Employing "squeeze-and-excitation" techniques, the FA block dynamically adjusts the emphasis on individual features, enabling the network to concentrate more on features that significantly contribute to the final classification. Building upon FA, we introduce a dynamic "selection" mechanism called Selective Feature Attention (SFA), which leverages a stacked BiGRU Inception structure. The SFA block facilitates multi-scale semantic extraction by traversing different stacked BiGRU layers, encouraging the network to selectively concentrate on semantic information and embedding features across varying levels of abstraction. Both the FA and SFA blocks offer a seamless integration capability with various Siamese networks, showcasing a plug-and-play characteristic. Experimental evaluations conducted across diverse text matching baselines and benchmarks underscore the indispensability of modeling feature attention and the superiority of the "selection" mechanism.

Modeling Selective Feature Attention for Representation-based Siamese Text Matching

TL;DR

with

computed as

and

. SFA introduces a three-phase block (split-and-fusion, squeeze-and-excitation, selection) built on a bottleneck autoencoder and a multi-branch BiGRU, enabling adaptive, gradient-balanced learning across semantic scales and improving training stability. Empirically, FA improves performance across six lightweight baselines and seven benchmarks, while SFA yields larger gains, with ESIM+SFA and DRCN+SFA reaching notable accuracies (e.g., around 80.9–81.9% in reported tests) and maintaining modest parameter overhead. The work demonstrates the practical value of embedding-feature attention in NLP and suggests broad applicability to related tasks such as text classification and entity recognition.

Abstract

Paper Structure (14 sections, 13 equations, 5 figures, 2 tables)

This paper contains 14 sections, 13 equations, 5 figures, 2 tables.

Introduction
Feature Attention
Squeeze-and-Excitation Network
FA Block
Selective Feature Attention
Inception Structure
SFA Block
Efficient Gradient Management
Experimental Results & Analysis
Main Results
Ablation Study
Inception Networks
Attention Analysis
Conclusion

Figures (5)

Figure 1: Our more robust downstream attention, composed of (a) Word-level Interaction Attention and (b) Feature Attention.
Figure 2: Selective Feature Attention
Figure 3: Ablation study on the components of SFA block on QQP and SNLI datasets, using RE2 and ESIM as baselines.
Figure 4: The average increase in evaluation accuracy (%) of SFA blocks on QQP and SNLI with different Inception networks (using RE2 and ESIM as baselines), along with the associated parameters and average inference latency growth.
Figure 5: The heatmap of the dot product matrix for sentence pair embeddings, where deeper colors indicate higher levels of activated word-level attention.

Modeling Selective Feature Attention for Representation-based Siamese Text Matching

TL;DR

Abstract

Modeling Selective Feature Attention for Representation-based Siamese Text Matching

Authors

TL;DR

Abstract

Table of Contents

Figures (5)