Disentangled Lottery Tickets: Identifying and Assembling Core and Specialist Subnetworks
Sadman Mohammad Nasif, Md Abrar Jahin, M. F. Mridha
TL;DR
The paper addresses the inefficiency of finding winning tickets via Iterative Magnitude Pruning by introducing the Disentangled Lottery Ticket (DiLT) framework, which separates a universal core subnetwork from task-specific specialist subnetworks through partitioned training. It leverages the Gromov-Wasserstein distance and spectral clustering to quantify functional similarity across network layers and automatically reveal modular structure, validating the core-species decomposition on ImageNet and Stanford Cars with ResNet and ViT. The key contributions include the DiLT hypothesis, a two-stage methodology for identifying and analyzing disentangled subnetworks, and extensive experiments demonstrating that the core ticket enhances transferability while specialist tickets enable modular assembly, with the union ticket outperforming COLT. This reframes pruning as a tool for modularity discovery and engineering, offering practical gains in transfer learning, specialization, and potential emergence of expert subnetworks within a single dense model.
Abstract
The Lottery Ticket Hypothesis (LTH) suggests that within large neural networks, there exist sparse, trainable "winning tickets" capable of matching the performance of the full model, but identifying them through Iterative Magnitude Pruning (IMP) is computationally expensive. Recent work introduced COLT, an accelerator that discovers a "consensus" subnetwork by intersecting masks from models trained on disjoint data partitions; however, this approach discards all non-overlapping weights, assuming they are unimportant. This paper challenges that assumption and proposes the Disentangled Lottery Ticket (DiLT) Hypothesis, which posits that the intersection mask represents a universal, task-agnostic "core" subnetwork, while the non-overlapping difference masks capture specialized, task-specific "specialist" subnetworks. A framework is developed to identify and analyze these components using the Gromov-Wasserstein (GW) distance to quantify functional similarity between layer representations and reveal modular structures through spectral clustering. Experiments on ImageNet and fine-grained datasets such as Stanford Cars, using ResNet and Vision Transformer architectures, show that the "core" ticket provides superior transfer learning performance, the "specialist" tickets retain domain-specific features enabling modular assembly, and the full re-assembled "union" ticket outperforms COLT - demonstrating that non-consensus weights play a critical functional role. This work reframes pruning as a process for discovering modular, disentangled subnetworks rather than merely compressing models.
