Hybrid Dual-Path Linear Transformations for Efficient Transformer Architectures

Vladimer Khasia

Hybrid Dual-Path Linear Transformations for Efficient Transformer Architectures

Vladimer Khasia

TL;DR

The Hybrid Dual-Path Linear (HDPL) operator is introduced, which decomposes the affine transformation into two topologically distinct pathways: a sparse block-diagonal component for high-rank local processing, and a low-rank Variational Autoencoder (VAE) bottleneck for global context regularization.

Abstract

Standard Transformer architectures rely heavily on dense linear transformations, treating feature projection as a monolithic, full-rank operation. We argue that this formulation is inefficient and lacks the structural inductive bias necessary for distinguishing between local feature preservation and global context integration. To address this, we introduce the Hybrid Dual-Path Linear (HDPL) operator, which decomposes the affine transformation into two topologically distinct pathways: a sparse block-diagonal component for high-rank local processing, and a low-rank Variational Autoencoder (VAE) bottleneck for global context regularization. By "surgically" replacing specific projections (Query, Key, Value, Gate, Up) with HDPL operators while retaining standard dense layers for aggregation (Output, Down), we achieve a superior balance of efficiency and representational power. Experiments on the FineWeb-Edu dataset demonstrate that the HDPL architecture outperforms a standard Llama-style baseline, reducing validation loss while simultaneously reducing parameter count by 6.8%. Beyond immediate performance gains, we discuss how the explicit materialization of a probabilistic latent space within the Transformer backbone serves as a vital architectural affordance, offering new pathways for inference-time or hypernetwork induced control, continual adaptation, interpretability, and cross-model or cross-modal synchronization. The code is available at https://github.com/VladimerKhasia/HDPL

Hybrid Dual-Path Linear Transformations for Efficient Transformer Architectures

TL;DR

Abstract

Paper Structure (24 sections, 11 equations, 1 figure, 2 tables, 1 algorithm)

This paper contains 24 sections, 11 equations, 1 figure, 2 tables, 1 algorithm.

Introduction
Methodology
The Hybrid Dual-Path Operator
Path 1: Block-Diagonal Local Projection
Path 2: Variational Global Context
Algorithm Specification
Training Objective and Regularization
Surgical Integration into Transformer
Complexity Analysis
Experiments
Experimental Setup
Results and Analysis
Discussion
Architectural Implications of the Dual-Path Topology
The Strategic Utility of the VAE Bottleneck
...and 9 more sections

Figures (1)

Figure 2: Validation Loss Trajectories. Comparison of the Baseline (Light Blue) and Surgical Hybrid (Dark Blue) models over 20,000 training steps. The Surgical Hybrid method exhibits strictly better sample efficiency, achieving lower loss values earlier in training and converging to a superior optimum (4.28 vs 4.32) despite having fewer parameters.

Hybrid Dual-Path Linear Transformations for Efficient Transformer Architectures

TL;DR

Abstract

Hybrid Dual-Path Linear Transformations for Efficient Transformer Architectures

Authors

TL;DR

Abstract

Table of Contents

Figures (1)