Modeling Selective Feature Attention for Representation-based Siamese Text Matching
Jianxiang Zang, Hui Liu
TL;DR
The paper addresses embedding-feature level attention in representation-based Siamese text matching by introducing FA, which applies a squeeze-and-excitation mechanism to reweight embedding features, and SFA, which adds a dynamic selection mechanism on a stacked BiGRU Inception to enable multi-scale semantic extraction. FA preserves the input tensor shape and reweights features via $\bm{u}_d=\bm{e}_d\cdot\bm{x}_d$ with $\bm{s}$ computed as $\bm{s}_d=\frac{1}{L}\sum_{l=1}^L\bm{x}_{l,d}$ and $\bm{e}=\sigma(\delta(\bm{s}\bm{W}_{FC1})\bm{W}_{FC2})$. SFA introduces a three-phase block (split-and-fusion, squeeze-and-excitation, selection) built on a bottleneck autoencoder and a multi-branch BiGRU, enabling adaptive, gradient-balanced learning across semantic scales and improving training stability. Empirically, FA improves performance across six lightweight baselines and seven benchmarks, while SFA yields larger gains, with ESIM+SFA and DRCN+SFA reaching notable accuracies (e.g., around 80.9–81.9% in reported tests) and maintaining modest parameter overhead. The work demonstrates the practical value of embedding-feature attention in NLP and suggests broad applicability to related tasks such as text classification and entity recognition.
Abstract
Representation-based Siamese networks have risen to popularity in lightweight text matching due to their low deployment and inference costs. While word-level attention mechanisms have been implemented within Siamese networks to improve performance, we propose Feature Attention (FA), a novel downstream block designed to enrich the modeling of dependencies among embedding features. Employing "squeeze-and-excitation" techniques, the FA block dynamically adjusts the emphasis on individual features, enabling the network to concentrate more on features that significantly contribute to the final classification. Building upon FA, we introduce a dynamic "selection" mechanism called Selective Feature Attention (SFA), which leverages a stacked BiGRU Inception structure. The SFA block facilitates multi-scale semantic extraction by traversing different stacked BiGRU layers, encouraging the network to selectively concentrate on semantic information and embedding features across varying levels of abstraction. Both the FA and SFA blocks offer a seamless integration capability with various Siamese networks, showcasing a plug-and-play characteristic. Experimental evaluations conducted across diverse text matching baselines and benchmarks underscore the indispensability of modeling feature attention and the superiority of the "selection" mechanism.
