MCSD: An Efficient Language Model with Diverse Fusion

Hua Yang; Duohai Li; Shiman Li

MCSD: An Efficient Language Model with Diverse Fusion

Hua Yang, Duohai Li, Shiman Li

TL;DR

MCSD introduces a linear-scaling language model that replaces standard attention with a diverse fusion mechanism built from a slope and decay–based MCSD block. By reformulating inference into a recurrent representation, MCSD achieves $O(1)$ space and $O(N)$ time, enabling efficient edge deployment while maintaining competitive performance. Empirical results show MCSD attains higher throughput and lower GPU memory than Transformer baselines of similar size, with strong performance on downstream tasks and robust scaling behavior consistent with established scaling laws. The work highlights MCSD as a promising foundation for resource-constrained deployment and embodied intelligence, while acknowledging domain-generalization and hardware synchronization as future directions.

Abstract

Transformers excel in Natural Language Processing (NLP) due to their prowess in capturing long-term dependencies but suffer from exponential resource consumption with increasing sequence lengths. To address these challenges, we propose MCSD model, an efficient language model with linear scaling and fast inference speed. MCSD model leverages diverse feature fusion, primarily through the multi-channel slope and decay (MCSD) block, to robustly represent features. This block comprises slope and decay sections that extract features across diverse temporal receptive fields, facilitating capture of both local and global information. In addition, MCSD block conducts element-wise fusion of diverse features to further enhance the delicate feature extraction capability. For inference, we formulate the inference process into a recurrent representation, slashing space complexity to $O(1)$ and time complexity to $O(N)$ respectively. Our experiments show that MCSD attains higher throughput and lower GPU memory consumption compared to Transformers, while maintaining comparable performance to larger-scale language learning models on benchmark tests. These attributes position MCSD as a promising base for edge deployment and embodied intelligence.

MCSD: An Efficient Language Model with Diverse Fusion

TL;DR

space and

time, enabling efficient edge deployment while maintaining competitive performance. Empirical results show MCSD attains higher throughput and lower GPU memory than Transformer baselines of similar size, with strong performance on downstream tasks and robust scaling behavior consistent with established scaling laws. The work highlights MCSD as a promising foundation for resource-constrained deployment and embodied intelligence, while acknowledging domain-generalization and hardware synchronization as future directions.

Abstract

and time complexity to

respectively. Our experiments show that MCSD attains higher throughput and lower GPU memory consumption compared to Transformers, while maintaining comparable performance to larger-scale language learning models on benchmark tests. These attributes position MCSD as a promising base for edge deployment and embodied intelligence.

Paper Structure (13 sections, 21 equations, 9 figures, 2 tables)

This paper contains 13 sections, 21 equations, 9 figures, 2 tables.

Introduction
Methodology
Architecture
MCSD block
Fast inference for the MCSD block
Experiment
Experimental Details
Scaling Curves
Inference Cost
Downstream Task Comparison
Ablation Study
Conclusion
Limitations

Figures (9)

Figure 1: The architecture of our MCSD model (left). The proposed MCSD block (right) with decay and slope sections.
Figure 2: The slope section comprises multi-channel slope and slope perturbation, integrating past positional information via distinct slope matrices and conveying historical data to current features through element-wise multiplication, respectively. A gating mechanism filters this output, predominantly preserving current information.
Figure 3: The decay section encompasses multi-channel decay and decay perturbation, integrating past positional data via distinct decay matrices and updating historical information through element-wise multiplication with current features. A gating mechanism selectively filters this output, primarily conserving historical information.
Figure 4: Scaling curves for MCSD illustrate a linear decline in loss value with growing training token volume, culminating in convergence. Larger model parameter counts correlate with diminished converged loss values.
Figure 5: GPU memory versus sequence length curves for MCSD and Transformer.
...and 4 more figures

MCSD: An Efficient Language Model with Diverse Fusion

TL;DR

Abstract

MCSD: An Efficient Language Model with Diverse Fusion

Authors

TL;DR

Abstract

Table of Contents

Figures (9)