Table of Contents
Fetching ...

Unleashing the Potential of Two-Tower Models: Diffusion-Based Cross-Interaction for Large-Scale Matching

Yihan Wang, Fei Xiong, Zhexin Han, Qi Song, Kaiqiao Zhan, Ben Wang

TL;DR

This work tackles the inherent limitation of late interaction in traditional two-tower recommender systems by introducing T2Diff, a diffusion-based cross-interaction framework for large-scale matching. It uses a diffusion module to reconstruct the next positive user intention and a mixed-attention module to fuse this reconstructed intent with historical and session-based user behavior, enabling rich cross-feature interactions without sacrificing efficiency. Empirical results on two public datasets and an industrial deployment show substantial improvements over state-of-the-art two-tower methods, with strong offline gains in Recall@K and MRR@K and notable online engagement improvements. The approach demonstrates practical impact for scalable, accurate candidate retrieval in real-world systems, balancing high accuracy with low-latency inference.

Abstract

Two-tower models are widely adopted in the industrial-scale matching stage across a broad range of application domains, such as content recommendations, advertisement systems, and search engines. This model efficiently handles large-scale candidate item screening by separating user and item representations. However, the decoupling network also leads to a neglect of potential information interaction between the user and item representations. Current state-of-the-art (SOTA) approaches include adding a shallow fully connected layer(i.e., COLD), which is limited by performance and can only be used in the ranking stage. For performance considerations, another approach attempts to capture historical positive interaction information from the other tower by regarding them as the input features(i.e., DAT). Later research showed that the gains achieved by this method are still limited because of lacking the guidance on the next user intent. To address the aforementioned challenges, we propose a "cross-interaction decoupling architecture" within our matching paradigm. This user-tower architecture leverages a diffusion module to reconstruct the next positive intention representation and employs a mixed-attention module to facilitate comprehensive cross-interaction. During the next positive intention generation, we further enhance the accuracy of its reconstruction by explicitly extracting the temporal drift within user behavior sequences. Experiments on two real-world datasets and one industrial dataset demonstrate that our method outperforms the SOTA two-tower models significantly, and our diffusion approach outperforms other generative models in reconstructing item representations.

Unleashing the Potential of Two-Tower Models: Diffusion-Based Cross-Interaction for Large-Scale Matching

TL;DR

This work tackles the inherent limitation of late interaction in traditional two-tower recommender systems by introducing T2Diff, a diffusion-based cross-interaction framework for large-scale matching. It uses a diffusion module to reconstruct the next positive user intention and a mixed-attention module to fuse this reconstructed intent with historical and session-based user behavior, enabling rich cross-feature interactions without sacrificing efficiency. Empirical results on two public datasets and an industrial deployment show substantial improvements over state-of-the-art two-tower methods, with strong offline gains in Recall@K and MRR@K and notable online engagement improvements. The approach demonstrates practical impact for scalable, accurate candidate retrieval in real-world systems, balancing high accuracy with low-latency inference.

Abstract

Two-tower models are widely adopted in the industrial-scale matching stage across a broad range of application domains, such as content recommendations, advertisement systems, and search engines. This model efficiently handles large-scale candidate item screening by separating user and item representations. However, the decoupling network also leads to a neglect of potential information interaction between the user and item representations. Current state-of-the-art (SOTA) approaches include adding a shallow fully connected layer(i.e., COLD), which is limited by performance and can only be used in the ranking stage. For performance considerations, another approach attempts to capture historical positive interaction information from the other tower by regarding them as the input features(i.e., DAT). Later research showed that the gains achieved by this method are still limited because of lacking the guidance on the next user intent. To address the aforementioned challenges, we propose a "cross-interaction decoupling architecture" within our matching paradigm. This user-tower architecture leverages a diffusion module to reconstruct the next positive intention representation and employs a mixed-attention module to facilitate comprehensive cross-interaction. During the next positive intention generation, we further enhance the accuracy of its reconstruction by explicitly extracting the temporal drift within user behavior sequences. Experiments on two real-world datasets and one industrial dataset demonstrate that our method outperforms the SOTA two-tower models significantly, and our diffusion approach outperforms other generative models in reconstructing item representations.

Paper Structure

This paper contains 20 sections, 24 equations, 5 figures, 5 tables, 2 algorithms.

Figures (5)

  • Figure 1: Real-world two-stage recommender system. (a)The two-stage architecture involves matching, which scores a large number of items, and ranking, which further refines the scoring for a smaller subset. (b) Intuitive view for accuracy and efficiency of matching and ranking method, where the proposed matching method is derived from ranking and optimized to a cross-interaction architecture.
  • Figure 2: (a): The main architecture of T2Diff includes a mixed-attention module and a diffusion module. The forward slash character used to connect the diffusion module and embedding layer indicates stop-gradient. (b): The details of the diffusion module, which adopts different processes for training and inference. For each reverse step in the diffusion module, we utilize a Unet approximator.
  • Figure 3: The detailed implementation of activation units used for target-attention.
  • Figure 4: This figure delineates the distribution of similarities between $z_0$ and $\hat{z}_0$ across the diffusion process. The horizontal axis represents cosine similarity, while the vertical axis corresponds to the number of iterations.
  • Figure 5: Live experiment results. On the x-axis is the date; on the y-axis is the relative difference in percentage between the experiment and control.