On Linearizing Structured Data in Encoder-Decoder Language Models: Insights from Text-to-SQL
Yutong Shao, Ndapa Nakashole
TL;DR
This work interrogates how linearization-based structured data representations are processed by encoder-decoder LMs in text-to-SQL, using a prefix-tuned T5 model on the Spider dataset. Through probing and causal tracing, the authors show that linearized inputs preserve crucial low-level textual information and that node relationships are encoded in an ego-centric manner, with structure-node encodings largely dedicated to their own nodes. They reveal duplicative robustness between encoder self-attention and decoder cross-attention for modality fusion and identify a pipeline-like inner process that mirrors schema linking, syntax prediction, and node selection. The findings suggest opportunities for model compression and more informed design of SDR systems, while providing a deeper mechanistic understanding of how linearization-based approaches handle inherently non-linear structured data. Overall, the study advances interpretability for SDR in encoder-decoder LMs and offers guidance for future research and optimization.
Abstract
Structured data, prevalent in tables, databases, and knowledge graphs, poses a significant challenge in its representation. With the advent of large language models (LLMs), there has been a shift towards linearization-based methods, which process structured data as sequential token streams, diverging from approaches that explicitly model structure, often as a graph. Crucially, there remains a gap in our understanding of how these linearization-based methods handle structured data, which is inherently non-linear. This work investigates the linear handling of structured data in encoder-decoder language models, specifically T5. Our findings reveal the model's ability to mimic human-designed processes such as schema linking and syntax prediction, indicating a deep, meaningful learning of structure beyond simple token sequencing. We also uncover insights into the model's internal mechanisms, including the ego-centric nature of structure node encodings and the potential for model compression due to modality fusion redundancy. Overall, this work sheds light on the inner workings of linearization-based methods and could potentially provide guidance for future research.
