Words in Motion: Extracting Interpretable Control Vectors for Motion Transformers
Omer Sahin Tas, Royden Wagner
TL;DR
This work tackles the interpretability-versus-accuracy tension in motion forecasting by analyzing hidden states of multimodal motion transformers with linear probes to reveal latent space regularities. It introduces interpretable motion features (the 'words' of motion), fits directionally opposing control vectors from latent space differences, and refines them with sparse autoencoders to enable inference-time activation steering of forecasts. The approach demonstrates neural collapse toward interpretable features, enables precise, linear control of predictions with SAEs, and shows zero-shot generalization to unseen dataset characteristics with minimal computational overhead. The results suggest that mechanistic interpretability and controllability can be achieved in motion transformers, with practical implications for safer and more transparent autonomous driving systems.
Abstract
Transformer-based models generate hidden states that are difficult to interpret. In this work, we analyze hidden states and modify them at inference, with a focus on motion forecasting. We use linear probing to analyze whether interpretable features are embedded in hidden states. Our experiments reveal high probing accuracy, indicating latent space regularities with functionally important directions. Building on this, we use the directions between hidden states with opposing features to fit control vectors. At inference, we add our control vectors to hidden states and evaluate their impact on predictions. Remarkably, such modifications preserve the feasibility of predictions. We further refine our control vectors using sparse autoencoders (SAEs). This leads to more linear changes in predictions when scaling control vectors. Our approach enables mechanistic interpretation as well as zero-shot generalization to unseen dataset characteristics with negligible computational overhead.
