Table of Contents
Fetching ...

MixFormer: Co-Scaling Up Dense and Sequence in Industrial Recommenders

Xu Huang, Hao Zhang, Zhifang Fan, Yunwen Huang, Zhuoxing Wei, Zheng Chai, Jinan Ni, Yuchao Zheng, Qiwei Chen

TL;DR

This work proposes MixFormer, a unified Transformer-style architecture tailored for recommender systems, which jointly models sequential behaviors and feature interactions within a single backbone, and introduces a user-item decoupling strategy for efficiency optimizations that significantly reduce redundant computation and inference latency.

Abstract

As industrial recommender systems enter a scaling-driven regime, Transformer architectures have become increasingly attractive for scaling models towards larger capacity and longer sequence. However, existing Transformer-based recommendation models remain structurally fragmented, where sequence modeling and feature interaction are implemented as separate modules with independent parameterization. Such designs introduce a fundamental co-scaling challenge, as model capacity must be suboptimally allocated between dense feature interaction and sequence modeling under a limited computational budget. In this work, we propose MixFormer, a unified Transformer-style architecture tailored for recommender systems, which jointly models sequential behaviors and feature interactions within a single backbone. Through a unified parameterization, MixFormer enables effective co-scaling across both dense capacity and sequence length, mitigating the trade-off observed in decoupled designs. Moreover, the integrated architecture facilitates deep interaction between sequential and non-sequential representations, allowing high-order feature semantics to directly inform sequence aggregation and enhancing overall expressiveness. To ensure industrial practicality, we further introduce a user-item decoupling strategy for efficiency optimizations that significantly reduce redundant computation and inference latency. Extensive experiments on large-scale industrial datasets demonstrate that MixFormer consistently exhibits superior accuracy and efficiency. Furthermore, large-scale online A/B tests on two production recommender systems, Douyin and Douyin Lite, show consistent improvements in user engagement metrics, including active days and in-app usage duration.

MixFormer: Co-Scaling Up Dense and Sequence in Industrial Recommenders

TL;DR

This work proposes MixFormer, a unified Transformer-style architecture tailored for recommender systems, which jointly models sequential behaviors and feature interactions within a single backbone, and introduces a user-item decoupling strategy for efficiency optimizations that significantly reduce redundant computation and inference latency.

Abstract

As industrial recommender systems enter a scaling-driven regime, Transformer architectures have become increasingly attractive for scaling models towards larger capacity and longer sequence. However, existing Transformer-based recommendation models remain structurally fragmented, where sequence modeling and feature interaction are implemented as separate modules with independent parameterization. Such designs introduce a fundamental co-scaling challenge, as model capacity must be suboptimally allocated between dense feature interaction and sequence modeling under a limited computational budget. In this work, we propose MixFormer, a unified Transformer-style architecture tailored for recommender systems, which jointly models sequential behaviors and feature interactions within a single backbone. Through a unified parameterization, MixFormer enables effective co-scaling across both dense capacity and sequence length, mitigating the trade-off observed in decoupled designs. Moreover, the integrated architecture facilitates deep interaction between sequential and non-sequential representations, allowing high-order feature semantics to directly inform sequence aggregation and enhancing overall expressiveness. To ensure industrial practicality, we further introduce a user-item decoupling strategy for efficiency optimizations that significantly reduce redundant computation and inference latency. Extensive experiments on large-scale industrial datasets demonstrate that MixFormer consistently exhibits superior accuracy and efficiency. Furthermore, large-scale online A/B tests on two production recommender systems, Douyin and Douyin Lite, show consistent improvements in user engagement metrics, including active days and in-app usage duration.
Paper Structure (28 sections, 9 equations, 6 figures, 2 tables)

This paper contains 28 sections, 9 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Overview of the MixFormer framework. The architecture consists of a feature embedding and split layer followed by $L$ stacked MixFormer blocks and task-specific heads. Each MixFormer block integrates three core components: a Query Mixer, Cross-Attention, and Output Fusion.
  • Figure 2: Overview of the User-Item Decoupled MixFormer with a request-level reduction technique. The green modules denote user-side computations that can be request-level shared, while the red modules indicate item-side computations that are performed independently for each candidate item and therefore cannot be shared.
  • Figure 3: The ablation study on modules in MixFormer. The AUC gain is compared with the proposed MixFormer-small. QM, CA, OF, HM, and SA are abbreviations of Query Mixer, Cross Attention, Output Fusion, HeadMixing, and SelfAttention, respectively. PT-FFN and PFFN denote per-layer FFN and per-head FFN, respectively.
  • Figure 4: Scaling study over FLOPs. The sequence length is fixed at 512. $(A+)B$ denotes the size of module A is fixed while the size of B is scaled.
  • Figure 5: Scaling study over sequence length. The sequence length scales among {512, 2048, 8192, 10000}.
  • ...and 1 more figures