Differentiable Model Scaling using Differentiable Topk

Kai Liu; Ruohui Wang; Jianfei Gao; Kai Chen

Differentiable Model Scaling using Differentiable Topk

Kai Liu, Ruohui Wang, Jianfei Gao, Kai Chen

TL;DR

This paper tackles the inefficiency of Neural Architecture Search (NAS) by introducing Differentiable Model Scaling (DMS), which uses a fully differentiable top-k operator to directly model and optimize both network width and depth. The differentiable top-k consists of importance normalization and a soft masking mechanism, enabling stable gradient-based optimization of structural hyperparameters under a resource-constraint loss. DMS yields three pipelines (with or without pretraining) and demonstrates strong improvements across vision, object detection, and language modeling tasks, achieving higher accuracy with far lower search costs than state-of-the-art NAS and pruning methods. The approach is presented as broadly applicable, scalable, and practical for real-world model development, with plans to release code publicly.

Abstract

Over the past few years, as large language models have ushered in an era of intelligence emergence, there has been an intensified focus on scaling networks. Currently, many network architectures are designed manually, often resulting in sub-optimal configurations. Although Neural Architecture Search (NAS) methods have been proposed to automate this process, they suffer from low search efficiency. This study introduces Differentiable Model Scaling (DMS), increasing the efficiency for searching optimal width and depth in networks. DMS can model both width and depth in a direct and fully differentiable way, making it easy to optimize. We have evaluated our DMS across diverse tasks, ranging from vision tasks to NLP tasks and various network architectures, including CNNs and Transformers. Results consistently indicate that our DMS can find improved structures and outperforms state-of-the-art NAS methods. Specifically, for image classification on ImageNet, our DMS improves the top-1 accuracy of EfficientNet-B0 and Deit-Tiny by 1.4% and 0.6%, respectively, and outperforms the state-of-the-art zero-shot NAS method, ZiCo, by 1.3% while requiring only 0.4 GPU days for searching. For object detection on COCO, DMS improves the mAP of Yolo-v8-n by 2.0%. For language modeling, our pruned Llama-7B outperforms the prior method with lower perplexity and higher zero-shot classification accuracy. We will release our code in the future.

Differentiable Model Scaling using Differentiable Topk

TL;DR

Abstract

Paper Structure (39 sections, 8 equations, 4 figures, 16 tables)

This paper contains 39 sections, 8 equations, 4 figures, 16 tables.

Introduction
Related Work
Stochastic Search Methods
Gradient-based Methods
Method
Differentiable Top-k
Importance Normalization
Soft Mask Generation
Element Evaluation
Differentiable Model Scaling
Experiment
Comparison with Different Search Methods
Comparison with Gradient-based Methods
Comparison with Evolutionary Algorithm
Comparison with SOTA NAS Methods
...and 24 more sections

Figures (4)

Figure 1: Different Gradient-based Modeling Strategies for Width and Depth. For all strategies, they use learnable parameters to generate an element mask to select width elements or depth elements. SubFigure (a) illustrates four methods to generate the element mask, while (b) shows how the mask is used to search width and depth. (a.1) Multiple Element Selection: The element count is transformed into a multiple-element selection. (a.2) Single Number Selection: The element count is transformed into a selection from multiple numbers. (a.3) Gradient Estimate Topk: The element count is directly modeled yet non-differentiable. (a.4) Our Differentiable Topk: The element count is directly modeled and is fully differentiable. "Direct" means that the learnable parameters directly model the structural hyperparameters, while "Differentiable" means that the gradient of the learnable parameters can be computed in a fully differentiable manner.
Figure 2: Forwad and Backward Graph of Our Differentiable Top-k. We set maximal element number $N=\lambda=100$, pruning ratio $a\in \{0.25,0.5,0.75\}$. The x-axis represents the normalized element importance $c'_i$. (a) demonstrates the forward process, where the y-axis represents the soft mask $m_i$. (b) illustrates the backward process, where the y-axis represents the gradient of $a$ with respect to $m_i$, $\frac{\partial m_i}{\partial a}$.
Figure 3: We draw these three plots based on Table \ref{['tab:eff']} and Table \ref{['tab:eff_more']}. We use larger dot sizes to represent the "High" search cost level and smaller dot sizes to represent the "Low" search cost level. Dashed lines are used to represent the models trained with distillation. (a) Performance Comparison with Low-Search-Cost methods It can be seen that our method outperforms these methods significantly. (b) Performance Comparison with High-Search-Cost methods Our method achieves comparable or even better performance, while all high-search-cost methods cost more than dozens of times our total search costs. (c) Search Cost Comparison with High-Search-Cost methods We compare the search costs of ours and that of high-search-cost methods. We only draw methods with precise search cost estimation. The search costs of unpainted high-search-cost methods are also larger than 100 GPU days. We present search costs on a log scale.
Figure 4: Visualization of Our Searched Structure. The x-axis represents the layers' width (channels/features), while the y-axis represents the layers. As $\text{DMS}_{\text{np}}$-EN-B0 has more layers than EfficientNet-B0, the width of extra layers for EfficientNet-B0 are seen as 0.

Differentiable Model Scaling using Differentiable Topk

TL;DR

Abstract

Differentiable Model Scaling using Differentiable Topk

Authors

TL;DR

Abstract

Table of Contents

Figures (4)