On Linearizing Structured Data in Encoder-Decoder Language Models: Insights from Text-to-SQL

Yutong Shao; Ndapa Nakashole

On Linearizing Structured Data in Encoder-Decoder Language Models: Insights from Text-to-SQL

Yutong Shao, Ndapa Nakashole

TL;DR

This work interrogates how linearization-based structured data representations are processed by encoder-decoder LMs in text-to-SQL, using a prefix-tuned T5 model on the Spider dataset. Through probing and causal tracing, the authors show that linearized inputs preserve crucial low-level textual information and that node relationships are encoded in an ego-centric manner, with structure-node encodings largely dedicated to their own nodes. They reveal duplicative robustness between encoder self-attention and decoder cross-attention for modality fusion and identify a pipeline-like inner process that mirrors schema linking, syntax prediction, and node selection. The findings suggest opportunities for model compression and more informed design of SDR systems, while providing a deeper mechanistic understanding of how linearization-based approaches handle inherently non-linear structured data. Overall, the study advances interpretability for SDR in encoder-decoder LMs and offers guidance for future research and optimization.

Abstract

Structured data, prevalent in tables, databases, and knowledge graphs, poses a significant challenge in its representation. With the advent of large language models (LLMs), there has been a shift towards linearization-based methods, which process structured data as sequential token streams, diverging from approaches that explicitly model structure, often as a graph. Crucially, there remains a gap in our understanding of how these linearization-based methods handle structured data, which is inherently non-linear. This work investigates the linear handling of structured data in encoder-decoder language models, specifically T5. Our findings reveal the model's ability to mimic human-designed processes such as schema linking and syntax prediction, indicating a deep, meaningful learning of structure beyond simple token sequencing. We also uncover insights into the model's internal mechanisms, including the ego-centric nature of structure node encodings and the potential for model compression due to modality fusion redundancy. Overall, this work sheds light on the inner workings of linearization-based methods and could potentially provide guidance for future research.

On Linearizing Structured Data in Encoder-Decoder Language Models: Insights from Text-to-SQL

TL;DR

Abstract

Paper Structure (46 sections, 7 figures, 24 tables)

This paper contains 46 sections, 7 figures, 24 tables.

Introduction
The Rise of Linearization-based Methods.
Open Problems and Our Contributions.
Related Work
Structured Data Representation for Text-to-SQL.
Model Behavior Analysis and Interpretation.
Preliminaries
Terminology.
Research Questions
Preliminary Intuition Open Questions.
Probing Study
Probing Tasks
Node Name Reconstruction (NR).
Link Prediction (LP).
Probing Results
...and 31 more sections

Figures (7)

Figure 1: The input to the text-to-SQL parser consists of the query in natural language text (blue), and the relevant structured data (red), other tokens (gray). "self-node," refers to the input tokens corresponding to the expected output node where a node refers to both column and table names, and "structure-context," represents all the structured input tokens excluding the self-node. The output is the predicted SQL query (top).
Figure 2: An illustrative sample showing the restoring effect of each encoder intermediate state. The decoder prompt: SELECT song_name FROM singer WHERE ==> age. Restoring the self-node hidden state on any layer can recover the correct prediction, while almost all other states do not have such an effect. More samples are available in Figure \ref{['fig:exp1-appendix']}.
Figure 3: Error type analysis on decoder cross-attention corruption on the text or structure part.
Figure 4: Error type analysis on decoder self-attention corruption on various layer ranges.
Figure 5: Encoder state restoration effectiveness. Multi-token nodes are usually harder to recover by restoring a single state. Supplementary for Figure \ref{['fig:exp1']}.
...and 2 more figures

On Linearizing Structured Data in Encoder-Decoder Language Models: Insights from Text-to-SQL

TL;DR

Abstract

On Linearizing Structured Data in Encoder-Decoder Language Models: Insights from Text-to-SQL

Authors

TL;DR

Abstract

Table of Contents

Figures (7)