Unveiling Transformer Perception by Exploring Input Manifolds
Alessandro Benfenati, Alfio Ferrara, Alessio Marta, Davide Riva, Elisabetta Rocchetti
TL;DR
This work addresses understanding Transformers by treating the input space as a manifold shaped by sequential layers. It develops a mathematically grounded framework using a singular pullback metric to define equivalence classes—sets of inputs yielding the same output distribution—and introduces two algorithms, SiMEC and SiMExp, to explore within and across these classes. The method enables reconstruction of equivalence classes and interpretable representations by mapping embeddings back to human-readable formats, demonstrated on ViT and BERT models across image and text tasks. Practically, this approach offers a principled way to study model sensitivity and generate interpretable, alternative inputs within controlled equivalence classes, with potential impact on explainability and robust input analysis for large Transformer architectures.
Abstract
This paper introduces a general method for the exploration of equivalence classes in the input space of Transformer models. The proposed approach is based on sound mathematical theory which describes the internal layers of a Transformer architecture as sequential deformations of the input manifold. Using eigendecomposition of the pullback of the distance metric defined on the output space through the Jacobian of the model, we are able to reconstruct equivalence classes in the input space and navigate across them. Our method enables two complementary exploration procedures: the first retrieves input instances that produce the same class probability distribution as the original instance-thus identifying elements within the same equivalence class-while the second discovers instances that yield a different class probability distribution, effectively navigating toward distinct equivalence classes. Finally, we demonstrate how the retrieved instances can be meaningfully interpreted by projecting their embeddings back into a human-readable format.
