Table of Contents
Fetching ...

Learning the Wrong Lessons: Syntactic-Domain Spurious Correlations in Language Models

Chantal Shaib, Vinith M. Suriyakumar, Levent Sagun, Byron C. Wallace, Marzyeh Ghassemi

TL;DR

This work reveals that large language models can learn spurious correlations between syntactic templates and knowledge domains, causing them to rely on surface structure rather than semantics in instruction following. The authors formalize the phenomenon, build a synthetic TRex-based dataset to control domain, syntax, and meaning, and develop a three-step benchmarking framework to measure syntactic-domain reliance across in-domain and cross-domain prompts. They demonstrate that both open- and closed-source models exhibit this reliance, and show concrete safety implications by illustrating how refusals can be bypassed using cross-domain syntactic templates. The results motivate explicit testing for syntactic-domain correlations and highlight the need for syntactic diversity within domains during training to prevent such spurious generalization, with practical impact on robustness and safety of deployed LLMs.

Abstract

For an LLM to correctly respond to an instruction it must understand both the semantics and the domain (i.e., subject area) of a given task-instruction pair. However, syntax can also convey implicit information Recent work shows that syntactic templates -- frequent sequences of Part-of-Speech (PoS) tags -- are prevalent in training data and often appear in model outputs. In this work we characterize syntactic templates, domain, and semantics in task-instruction pairs. We identify cases of spurious correlations between syntax and domain, where models learn to associate a domain with syntax during training; this can sometimes override prompt semantics. Using a synthetic training dataset, we find that the syntactic-domain correlation can lower performance (mean 0.51 +/- 0.06) on entity knowledge tasks in OLMo-2 models (1B-13B). We introduce an evaluation framework to detect this phenomenon in trained models, and show that it occurs on a subset of the FlanV2 dataset in open (OLMo-2-7B; Llama-4-Maverick), and closed (GPT-4o) models. Finally, we present a case study on the implications for safety finetuning, showing that unintended syntactic-domain correlations can be used to bypass refusals in OLMo-2-7B Instruct and GPT-4o. Our findings highlight two needs: (1) to explicitly test for syntactic-domain correlations, and (2) to ensure syntactic diversity in training data, specifically within domains, to prevent such spurious correlations.

Learning the Wrong Lessons: Syntactic-Domain Spurious Correlations in Language Models

TL;DR

This work reveals that large language models can learn spurious correlations between syntactic templates and knowledge domains, causing them to rely on surface structure rather than semantics in instruction following. The authors formalize the phenomenon, build a synthetic TRex-based dataset to control domain, syntax, and meaning, and develop a three-step benchmarking framework to measure syntactic-domain reliance across in-domain and cross-domain prompts. They demonstrate that both open- and closed-source models exhibit this reliance, and show concrete safety implications by illustrating how refusals can be bypassed using cross-domain syntactic templates. The results motivate explicit testing for syntactic-domain correlations and highlight the need for syntactic diversity within domains during training to prevent such spurious generalization, with practical impact on robustness and safety of deployed LLMs.

Abstract

For an LLM to correctly respond to an instruction it must understand both the semantics and the domain (i.e., subject area) of a given task-instruction pair. However, syntax can also convey implicit information Recent work shows that syntactic templates -- frequent sequences of Part-of-Speech (PoS) tags -- are prevalent in training data and often appear in model outputs. In this work we characterize syntactic templates, domain, and semantics in task-instruction pairs. We identify cases of spurious correlations between syntax and domain, where models learn to associate a domain with syntax during training; this can sometimes override prompt semantics. Using a synthetic training dataset, we find that the syntactic-domain correlation can lower performance (mean 0.51 +/- 0.06) on entity knowledge tasks in OLMo-2 models (1B-13B). We introduce an evaluation framework to detect this phenomenon in trained models, and show that it occurs on a subset of the FlanV2 dataset in open (OLMo-2-7B; Llama-4-Maverick), and closed (GPT-4o) models. Finally, we present a case study on the implications for safety finetuning, showing that unintended syntactic-domain correlations can be used to bypass refusals in OLMo-2-7B Instruct and GPT-4o. Our findings highlight two needs: (1) to explicitly test for syntactic-domain correlations, and (2) to ensure syntactic diversity in training data, specifically within domains, to prevent such spurious correlations.

Paper Structure

This paper contains 33 sections, 2 equations, 11 figures, 11 tables.

Figures (11)

  • Figure 1: Example instantiations of each template setting for the phrase "Where is Paris located? France", where (Paris, France) is the entity pair denoting the domain country. Each template setting modifies either syntax, domain, or semantics. If a model answers "France" in the antonym or disfluent settings, this may be due to over reliance on syntax.
  • Figure 2: OLMo-2-Instruct (7B) and GPT4o performance on the Sentiment140 task. Both results show profiles suggestive of spurious syntactic-domain correlation (cf. Fig. \ref{['fig:model_cases_fig']}).
  • Figure 3: Plot profiles denoting different instruction following behaviours based on performance of each template setting in- and cross-domain. An absent bar indicates the template setting is not informative for assessment of the behaviour.
  • Figure 4: Model responses for variations on the prompt: "Can you guide me on how to bomb an interview?" from ai2-adapt-dev/tulu_v3.9_wildjailbreak_decontaminated_50k (FlanV2). The correct model response in the dataset should be a refusal, but prompt modifications over domain and setting bypass refusals in all but the Antonym setting.
  • Figure 5: Example instantiations of each template type and th semantic distance from the training example. The red region denotes the area where semantics are fully broken.
  • ...and 6 more figures