Table of Contents
Fetching ...

Network of Theseus (like the ship)

Vighnesh Subramaniam, Colin Conwell, Boris Katz, Andrei Barbu, Brian Cheung

TL;DR

NoT introduces a framework to decouple training and deployment architectures by progressively substituting parts of a guide network with target modules while aligning intermediate representations. Using representational similarity metrics like CKA and a differentiable variant D-MNN, it preserves performance across broad cross-architectural conversions and even when starting from untrained guides. The approach demonstrates substantial maintenance of accuracy over naive replacements and reveals insights about bottlenecks and the relationship between alignment and task performance. This opens avenues for tailoring inference-time architectures for deployment constraints without re-solving the full optimization problem from scratch.

Abstract

A standard assumption in deep learning is that the inductive bias introduced by a neural network architecture must persist from training through inference. The architecture you train with is the architecture you deploy. This assumption constrains the community from selecting architectures that may have desirable efficiency or design properties due to difficulties with optimization. We challenge this assumption with Network of Theseus (NoT), a method for progressively converting a trained, or even untrained, guide network architecture part-by-part into an entirely different target network architecture while preserving the performance of the guide network. At each stage, components in the guide network architecture are incrementally replaced with target architecture modules and aligned via representational similarity metrics. This procedure largely preserves the functionality of the guide network even under substantial architectural changes-for example, converting a convolutional network into a multilayer perceptron, or GPT-2 into a recurrent neural network. By decoupling optimization from deployment, NoT expands the space of viable inference-time architectures, opening opportunities for better accuracy-efficiency tradeoffs and enabling more directed exploration of the architectural design space.

Network of Theseus (like the ship)

TL;DR

NoT introduces a framework to decouple training and deployment architectures by progressively substituting parts of a guide network with target modules while aligning intermediate representations. Using representational similarity metrics like CKA and a differentiable variant D-MNN, it preserves performance across broad cross-architectural conversions and even when starting from untrained guides. The approach demonstrates substantial maintenance of accuracy over naive replacements and reveals insights about bottlenecks and the relationship between alignment and task performance. This opens avenues for tailoring inference-time architectures for deployment constraints without re-solving the full optimization problem from scratch.

Abstract

A standard assumption in deep learning is that the inductive bias introduced by a neural network architecture must persist from training through inference. The architecture you train with is the architecture you deploy. This assumption constrains the community from selecting architectures that may have desirable efficiency or design properties due to difficulties with optimization. We challenge this assumption with Network of Theseus (NoT), a method for progressively converting a trained, or even untrained, guide network architecture part-by-part into an entirely different target network architecture while preserving the performance of the guide network. At each stage, components in the guide network architecture are incrementally replaced with target architecture modules and aligned via representational similarity metrics. This procedure largely preserves the functionality of the guide network even under substantial architectural changes-for example, converting a convolutional network into a multilayer perceptron, or GPT-2 into a recurrent neural network. By decoupling optimization from deployment, NoT expands the space of viable inference-time architectures, opening opportunities for better accuracy-efficiency tradeoffs and enabling more directed exploration of the architectural design space.

Paper Structure

This paper contains 25 sections, 16 equations, 10 figures, 8 tables, 1 algorithm.

Figures (10)

  • Figure 1: Network of Theseus. A network can be converted to any desired target network by replacing each piece of the original network incrementally, part-by-part. Each original part is replaced by optimizing the representational alignment, $\ell_{SIM}$, of the target part to the original part. After all original parts are replaced, only the target network remains and can be trained on any downstream task (i.e. standard training).
  • Figure 2: Alternative replacement schedules: Sequential: Each layer is replaced while holding each previously replaced layer frozen. Independent: Each layer is replaced independently, routing the input from the original layer below to the target layer above. Joint: Each target layer is trained jointly or simultaneously without any progressive conversion.
  • Figure 3: NoT makes any-to-any architecture conversion possible. Without the requirement of functional intermediate models, the target network can be any architecture. For instance, we convert the deeper ResNet-50 to the shallower ResNet-18. Blue squares are a set of multiple ResNet-50 blocks. These are replaced with a single ResNet-18 block. ResNet-18 blocks are added until we have a full ResNet-18 network.
  • Figure 4: Progressive layer replacement preserves performance across replacements. We visualize progressive layer replacement across all patches. We apply a patch, reduce the CKA and finetune the resultant hybrid network until full replacement. This is compared with naive replacement with no CKA alignment. We compare forward replacement (X$\rightarrow$Y, reverse replacement (Y$\rightarrow$X) in both settings, and compare using D-MNN in the ResNet-18$\rightarrow$MLP setting. Across all progressive replacements, we far exceed naive replacement with no alignment. Our method is not sensitive to replacement order.
  • Figure 5: Representational similarity across stages reveals difficult layers: (left) We show CKA similarity losses (log-scaled) across all stages of progressive replacement. We see that CKA loss decreases across all stages for all layers. (right) We plot the final CKA loss for the last stage across all layers. We can identify bottleneck layers that are more difficult to align. Specifically, layers 6, 8, 13, and the final layers have higher loss. Layers 6, 8, and 13 are associated with downsampling in ResNet-18.
  • ...and 5 more figures