Mechanistic Interpretability of Binary and Ternary Transformers
Jason Li
TL;DR
This paper investigates whether binarized and ternarized transformer networks provide interpretability advantages over full-precision models. It applies mechanistic interpretability to the discrete toy problem of modular addition, reverse-engineering the learned algorithms and comparing Fourier/clock-like representations to those of full-precision networks, while examining grokking dynamics. The study contributes as the first to apply mechanistic interpretability to binary/ternary transformers, showing that these discretized models tend to learn algorithms similar to full-precision ones (with some added noise) rather than simpler, more interpretable strategies. The findings suggest that discretization alone does not inherently yield more interpretable algorithms in this setting, motivating future work on other tasks, optimization techniques, and fully binarized/ternarized architectures to better assess interpretability benefits.
Abstract
Recent research (arXiv:2310.11453, arXiv:2402.17764) has proposed binary and ternary transformer networks as a way to significantly reduce memory and improve inference speed in Large Language Models (LLMs) while maintaining accuracy. In this work, we apply techniques from mechanistic interpretability to investigate whether such networks learn distinctly different or similar algorithms when compared to full-precision transformer networks. In particular, we reverse engineer the algorithms learned for the toy problem of modular addition where we find that binary and ternary networks learn similar algorithms as full precision networks. This provides evidence against the possibility of using binary and ternary networks as a more interpretable alternative in the LLM setting.
