Table of Contents
Fetching ...

Beyond Interleaving: Causal Attention Reformulations for Generative Recommender Systems

Hailing Cheng

TL;DR

This work proposes a principled reformulation of generative recommendation that aligns sequence modeling with underlying causal structures and attention theory, and introduces two novel architectures that eliminate interleaved dependencies to reduce sequence complexity by 50.

Abstract

Generative Recommender Systems (GR) increasingly model user behavior as a sequence generation task by interleaving item and action tokens. While effective, this formulation introduces significant structural and computational inefficiencies: it doubles sequence length, incurs quadratic overhead, and relies on implicit attention to recover the causal relationship between an item and its associated action. Furthermore, interleaving heterogeneous tokens forces the Transformer to disentangle semantically incompatible signals, leading to increased attention noise and reduced representation efficiency.In this work, we propose a principled reformulation of generative recommendation that aligns sequence modeling with underlying causal structures and attention theory. We demonstrate that current interleaving mechanisms act as inefficient proxies for similarity-weighted action pooling. To address this, we introduce two novel architectures that eliminate interleaved dependencies to reduce sequence complexity by 50%: Attention-based Late Fusion for Actions (AttnLFA) and Attention-based Mixed Value Pooling (AttnMVP). These models explicitly encode the $i_n \rightarrow a_n$ causal dependency while preserving the expressive power of Transformer-based sequence modeling.We evaluate our framework on large-scale product recommendation data from a major social network. Experimental results show that AttnLFA and AttnMVP consistently outperform interleaved baselines, achieving evaluation loss improvements of 0.29% and 0.80%, and significant gains in Normalized Entropy (NE). Crucially, these performance gains are accompanied by training time reductions of 23% and 12%, respectively. Our findings suggest that explicitly modeling item-action causality provides a superior design paradigm for scalable and efficient generative ranking.

Beyond Interleaving: Causal Attention Reformulations for Generative Recommender Systems

TL;DR

This work proposes a principled reformulation of generative recommendation that aligns sequence modeling with underlying causal structures and attention theory, and introduces two novel architectures that eliminate interleaved dependencies to reduce sequence complexity by 50.

Abstract

Generative Recommender Systems (GR) increasingly model user behavior as a sequence generation task by interleaving item and action tokens. While effective, this formulation introduces significant structural and computational inefficiencies: it doubles sequence length, incurs quadratic overhead, and relies on implicit attention to recover the causal relationship between an item and its associated action. Furthermore, interleaving heterogeneous tokens forces the Transformer to disentangle semantically incompatible signals, leading to increased attention noise and reduced representation efficiency.In this work, we propose a principled reformulation of generative recommendation that aligns sequence modeling with underlying causal structures and attention theory. We demonstrate that current interleaving mechanisms act as inefficient proxies for similarity-weighted action pooling. To address this, we introduce two novel architectures that eliminate interleaved dependencies to reduce sequence complexity by 50%: Attention-based Late Fusion for Actions (AttnLFA) and Attention-based Mixed Value Pooling (AttnMVP). These models explicitly encode the causal dependency while preserving the expressive power of Transformer-based sequence modeling.We evaluate our framework on large-scale product recommendation data from a major social network. Experimental results show that AttnLFA and AttnMVP consistently outperform interleaved baselines, achieving evaluation loss improvements of 0.29% and 0.80%, and significant gains in Normalized Entropy (NE). Crucially, these performance gains are accompanied by training time reductions of 23% and 12%, respectively. Our findings suggest that explicitly modeling item-action causality provides a superior design paradigm for scalable and efficient generative ranking.
Paper Structure (6 sections, 2 equations, 8 figures, 2 tables)

This paper contains 6 sections, 2 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Interleaved generative recommenders treat items and actions as a single token stream. Action $a_2$ attends to all prior tokens, obscuring the direct causal dependency $i_2 \rightarrow a_2$ and introducing attention noise.
  • Figure 2: True causal structure of user interactions. Each action $a_n$ is a response to the corresponding item $i_n$, conditioned on prior history. This structure is not explicitly represented by interleaved self-attention.
  • Figure 3: Traditional Generative Recommender (Interleaving Item and Action Tokens) architecture: the item and action tokens are interleaved before the transformer layers
  • Figure 4: Illustrative toy sequences for Users A and B. The sequences demonstrate contrasting behavioral patterns: User A consistently exhibits positive interactions (e.g., "Like") with dog-related items and negative interactions with cat-related items, while User B exhibits the inverse preference profile. This highlights the model's task of capturing item-action dependencies for future state prediction.
  • Figure 5: Attention-based Late Fusion for Action (AttnLFA). Item embeddings are transformed through a series of Transformer blocks (labeled as "Transformers" for clarity) to generate latent sequence representations. These representations serve as both Queries and Keys for the subsequent attention mechanism. In the final stage, action embeddings are integrated as Values via a causally-constrained attention pooling operation, conditioned on the sequence context. The resulting aggregated action representation is then passed to the prediction head for the final output.
  • ...and 3 more figures