Model Alignment Search
Satchel Grant
TL;DR
Model Alignment Search (MAS) proposes a causal, bidirectional approach to neural similarity by learning per-model invertible alignments and performing interchange interventions on behaviorally relevant subspaces. It situates MAS between model stitching and Distributed Alignment Search (DAS), offering a more behavior-driven measure of similarity that reduces computational requirements and reveals subtleties that correlational methods miss. The paper demonstrates MAS on numeric task families and toxicity-model case studies, introduces a Counterfactual Latent (CL) auxiliary objective for inaccessible networks (CLMAS), and shows that MAS can isolate specific causal information while maintaining competitive alignment with fewer resources. The findings advocate for causal methods in neural similarity analysis and outline directions for extending network alignment methodologies, including biological applications and larger-scale multi-model comparisons.
Abstract
When can we say that two neural systems perform a task in the same way? What nuances do we miss when we fail to causally probe the representations of the systems, and how do we establish bidirectional causal relationships? In this work, we introduce a method that bidirectionally transfers neural activity between artificial neural networks and uses their resulting behavior as a measure of functional similarity. We first show that the method can be used to transfer the behavior from one frozen Neural Network (NN) to another in a manner similar to model stitching, and we show how the method can differ from correlative similarity measures like Representational Similarity Analysis. Next, we empirically and theoretically show how the method can be equivalent to model stitching when desired, or it can take a form that has a more restrictive focus to shared causal information; in both forms, it reduces the number of required matrices for a comparison of n models to be linear in n. We then present a case study on number-related tasks showing that the method can be used to examine specific subtypes of causal information demonstrating that numbers can be encoded differently in recurrent models depending on the task, and we present another case study showing that MAS can reveal misalignment in fine-tuned DeepSeek-r1-Qwen-1.5B models. Lastly, we augment the loss function with a counterfactual latent (CL) auxiliary objective to improve causal relevance when one of the two networks is causally inaccessible (as is often the case in comparisons with biological networks). We use our results to encourage the use of causal methods in neural similarity analyses and to suggest future explorations of network similarity methodology for model misalignment.
