Table of Contents
Fetching ...

LORS: Low-rank Residual Structure for Parameter-Efficient Network Stacking

Jialin Li, Qiang Nie, Weifu Fu, Yuhuan Lin, Guangpin Tao, Yong Liu, Chengjie Wang

TL;DR

LORS introduces a parameter-efficient approach for stacked neural modules by decomposing weights into shared components plus low-rank residuals across layers, with static (LORST) and adaptive (LORSA) variants. By applying LORS to AdaMixer's decoders, the method achieves up to ~70% decoder parameter reduction while maintaining, and in some cases improving, object detection performance on MS COCO. The approach leverages cross-layer sharing and low-rank private contributions to capture layer-specific nuances, effectively regularizing the stacked structure. The technique is broadly applicable to transformer-like stacks and offers a practical pathway to deploy large, depth-heavy models with reduced parameter footprints and maintained performance.

Abstract

Deep learning models, particularly those based on transformers, often employ numerous stacked structures, which possess identical architectures and perform similar functions. While effective, this stacking paradigm leads to a substantial increase in the number of parameters, posing challenges for practical applications. In today's landscape of increasingly large models, stacking depth can even reach dozens, further exacerbating this issue. To mitigate this problem, we introduce LORS (LOw-rank Residual Structure). LORS allows stacked modules to share the majority of parameters, requiring a much smaller number of unique ones per module to match or even surpass the performance of using entirely distinct ones, thereby significantly reducing parameter usage. We validate our method by applying it to the stacked decoders of a query-based object detector, and conduct extensive experiments on the widely used MS COCO dataset. Experimental results demonstrate the effectiveness of our method, as even with a 70\% reduction in the parameters of the decoder, our method still enables the model to achieve comparable or

LORS: Low-rank Residual Structure for Parameter-Efficient Network Stacking

TL;DR

LORS introduces a parameter-efficient approach for stacked neural modules by decomposing weights into shared components plus low-rank residuals across layers, with static (LORST) and adaptive (LORSA) variants. By applying LORS to AdaMixer's decoders, the method achieves up to ~70% decoder parameter reduction while maintaining, and in some cases improving, object detection performance on MS COCO. The approach leverages cross-layer sharing and low-rank private contributions to capture layer-specific nuances, effectively regularizing the stacked structure. The technique is broadly applicable to transformer-like stacks and offers a practical pathway to deploy large, depth-heavy models with reduced parameter footprints and maintained performance.

Abstract

Deep learning models, particularly those based on transformers, often employ numerous stacked structures, which possess identical architectures and perform similar functions. While effective, this stacking paradigm leads to a substantial increase in the number of parameters, posing challenges for practical applications. In today's landscape of increasingly large models, stacking depth can even reach dozens, further exacerbating this issue. To mitigate this problem, we introduce LORS (LOw-rank Residual Structure). LORS allows stacked modules to share the majority of parameters, requiring a much smaller number of unique ones per module to match or even surpass the performance of using entirely distinct ones, thereby significantly reducing parameter usage. We validate our method by applying it to the stacked decoders of a query-based object detector, and conduct extensive experiments on the widely used MS COCO dataset. Experimental results demonstrate the effectiveness of our method, as even with a 70\% reduction in the parameters of the decoder, our method still enables the model to achieve comparable or
Paper Structure (12 sections, 8 equations, 5 figures, 11 tables)

This paper contains 12 sections, 8 equations, 5 figures, 11 tables.

Figures (5)

  • Figure 1: The LORS calculation process, which could be adaptive or static, depending on whether an adaptively generated kernel is used in the matrix manipulation for private parameters.
  • Figure 2: Pseudo-code for obtaining a static weight parameter for one layer.
  • Figure 3: Pseudo-code for obtaining an adaptive weight parameter for one layer.
  • Figure 4: The overall pipeline of our proposed LORS, consisting of both adaptive and static parts, each further composed of shared and private components, works collaboratively. The figure illustrates the entire computation process within one layer of the stacked layers, with an enlarged example of the $i$-th layer for demonstration.
  • Figure 5: Visualizing input features of each layer in DeiT-Tiny.