Table of Contents
Fetching ...

Words in Motion: Extracting Interpretable Control Vectors for Motion Transformers

Omer Sahin Tas, Royden Wagner

TL;DR

This work tackles the interpretability-versus-accuracy tension in motion forecasting by analyzing hidden states of multimodal motion transformers with linear probes to reveal latent space regularities. It introduces interpretable motion features (the 'words' of motion), fits directionally opposing control vectors from latent space differences, and refines them with sparse autoencoders to enable inference-time activation steering of forecasts. The approach demonstrates neural collapse toward interpretable features, enables precise, linear control of predictions with SAEs, and shows zero-shot generalization to unseen dataset characteristics with minimal computational overhead. The results suggest that mechanistic interpretability and controllability can be achieved in motion transformers, with practical implications for safer and more transparent autonomous driving systems.

Abstract

Transformer-based models generate hidden states that are difficult to interpret. In this work, we analyze hidden states and modify them at inference, with a focus on motion forecasting. We use linear probing to analyze whether interpretable features are embedded in hidden states. Our experiments reveal high probing accuracy, indicating latent space regularities with functionally important directions. Building on this, we use the directions between hidden states with opposing features to fit control vectors. At inference, we add our control vectors to hidden states and evaluate their impact on predictions. Remarkably, such modifications preserve the feasibility of predictions. We further refine our control vectors using sparse autoencoders (SAEs). This leads to more linear changes in predictions when scaling control vectors. Our approach enables mechanistic interpretation as well as zero-shot generalization to unseen dataset characteristics with negligible computational overhead.

Words in Motion: Extracting Interpretable Control Vectors for Motion Transformers

TL;DR

This work tackles the interpretability-versus-accuracy tension in motion forecasting by analyzing hidden states of multimodal motion transformers with linear probes to reveal latent space regularities. It introduces interpretable motion features (the 'words' of motion), fits directionally opposing control vectors from latent space differences, and refines them with sparse autoencoders to enable inference-time activation steering of forecasts. The approach demonstrates neural collapse toward interpretable features, enables precise, linear control of predictions with SAEs, and shows zero-shot generalization to unseen dataset characteristics with minimal computational overhead. The results suggest that mechanistic interpretability and controllability can be achieved in motion transformers, with practical implications for safer and more transparent autonomous driving systems.

Abstract

Transformer-based models generate hidden states that are difficult to interpret. In this work, we analyze hidden states and modify them at inference, with a focus on motion forecasting. We use linear probing to analyze whether interpretable features are embedded in hidden states. Our experiments reveal high probing accuracy, indicating latent space regularities with functionally important directions. Building on this, we use the directions between hidden states with opposing features to fit control vectors. At inference, we add our control vectors to hidden states and evaluate their impact on predictions. Remarkably, such modifications preserve the feasibility of predictions. We further refine our control vectors using sparse autoencoders (SAEs). This leads to more linear changes in predictions when scaling control vectors. Our approach enables mechanistic interpretation as well as zero-shot generalization to unseen dataset characteristics with negligible computational overhead.
Paper Structure (40 sections, 4 equations, 18 figures, 10 tables)

This paper contains 40 sections, 4 equations, 18 figures, 10 tables.

Figures (18)

  • Figure 1: Words in Motion.(a) We classify motion features in an interpretable way, as in natural language. (b) We measure the degree to which these interpretable features are embedded in the hidden states $\bm{H}_{i,:}$ of transformer models with linear probes. Furthermore, we use our discrete features and sparse autoencoding to fit interpretable control vectors $\bm{V}_{i,:}$ that allow for modifying motion forecasts at inference. The training of the sparse autoencoder is shown with red arrows ($\rightarrow$) and the fitting of control vectors with blue arrows ($\rightarrow$).
  • Figure 2: Linear probing accuracies for RedMotion, Wayformer, and HPTR on the validation split of the AV2F dataset.
  • Figure 3: Normalized standard deviation representation quality metric for RedMotion, Wayformer, and HPTR.
  • Figure 4: Linear probing accuracies at module 0, module 1 and module 2 for classfiying speed, acceleration, direction, and agent type on the validation split of the Waymo dataset.
  • Figure 5: Modifying hidden states to control a vehicle at an intersection. We add our acceleration control vector scaled with $\tau=-20$ and $\tau=100$ to enforce a strong deceleration and a moderate acceleration. The focal agent is highlighted in orange, dynamic agents are blue, and static agents are grey. Lanes are black lines and road markings are white lines.
  • ...and 13 more figures