When does word order matter and when doesn't it?

Xuanda Chen; Timothy O'Donnell; Siva Reddy

When does word order matter and when doesn't it?

Xuanda Chen, Timothy O'Donnell, Siva Reddy

TL;DR

The paper investigates when word order matters for natural language understanding by proposing a linguistic redundancy hypothesis: word order is often dispensable because other cues provide overlapping information. It formalizes this with a mutual information measure $I(S;T)$ between unscrambled and scrambled sentences, estimated via a variational bound using a reordering model $q_{\phi}(s|t)$ and a pretrained LM for $p(s)$, with PMI defined as $pmi(s;t)=\log_2 \frac{q_{\phi}(s|t)}{p(s)}$. Through training a reordering model on 100k sentences with six scramblings and evaluating RoBERTa across diverse NLU tasks, the study shows a task-dependent redundancy effect: some tasks (e.g., SST-2) are largely insensitive to word order, while others (e.g., RTE, BoolQ, COPA) exhibit PMI-dependent consistency. The findings offer a principled framework for understanding word-order effects in LMs and have implications for robustness, evaluation, and dataset design in NLP.

Abstract

Language models (LMs) may appear insensitive to word order changes in natural language understanding (NLU) tasks. In this paper, we propose that linguistic redundancy can explain this phenomenon, whereby word order and other linguistic cues such as case markers provide overlapping and thus redundant information. Our hypothesis is that models exhibit insensitivity to word order when the order provides redundant information, and the degree of insensitivity varies across tasks. We quantify how informative word order is using mutual information (MI) between unscrambled and scrambled sentences. Our results show the effect that the less informative word order is, the more consistent the model's predictions are between unscrambled and scrambled sentences. We also find that the effect varies across tasks: for some tasks, like SST-2, LMs' prediction is almost always consistent with the original one even if the Pointwise-MI (PMI) changes, while for others, like RTE, the consistency is near random when the PMI gets lower, i.e., word order is really important.

When does word order matter and when doesn't it?

TL;DR

between unscrambled and scrambled sentences, estimated via a variational bound using a reordering model

and a pretrained LM for

, with PMI defined as

. Through training a reordering model on 100k sentences with six scramblings and evaluating RoBERTa across diverse NLU tasks, the study shows a task-dependent redundancy effect: some tasks (e.g., SST-2) are largely insensitive to word order, while others (e.g., RTE, BoolQ, COPA) exhibit PMI-dependent consistency. The findings offer a principled framework for understanding word-order effects in LMs and have implications for robustness, evaluation, and dataset design in NLP.

Abstract

Paper Structure (21 sections, 4 equations, 10 figures, 2 tables)

This paper contains 21 sections, 4 equations, 10 figures, 2 tables.

Introduction
MI Estimation
Training the Reordering Model
Validating the Reordering Model
Experiment, Data and Results
Regression Results
Case Studies on Negative-PMI Sentences
Discussion
Related Work
Conclusion
Regression Model
Generalized Linear Model
Generalized Mixed Effects
Response Variable and Predictors
consistency as response
...and 6 more sections

Figures (10)

Figure 1: Variational approximation of the MI between scrambled and unscrambled sentences, using an LM and a reordering model. The estimation relies on bounding MI — see discussion in $\S$\ref{['mi']}.
Figure 2: Curves represent a simulation of the linear model for the redundancy effect. The x-axis reflects the actual data range, and the displayed ranges differ for each task. Line slopes indicate the level of redundancy effect, where steeper lines reflect a more pronounced effect. The intercept (the baseline level of the PMI influence on Consistency, i.e., how different tasks might have different starting points) signifies task difficulty: tasks with lower intercepts are more challenging for LMs to solve.
Figure 3: Correlation between sentence length and PMI.
Figure 4: Conditional effect of PMI on consistency. The effect is estimated on random posterior draws from the model.
Figure 5: Estimated random intercepts across all tasks. The shaded area is the ROPE showing whether the effect size is big enough to be effective.
...and 5 more figures

When does word order matter and when doesn't it?

TL;DR

Abstract

When does word order matter and when doesn't it?

Authors

TL;DR

Abstract

Table of Contents

Figures (10)