Table of Contents
Fetching ...

Comparing Natural and Synthetic Structured Data: A Study of the Passive Verb Alternation in French and Italian

Giuseppe Samo, Paola Merlo

Abstract

This study compares the impact of natural and synthetic data on training and evaluating large language models (LLMs), using the case of passive verb alternation in French and Italian. We use Blackbird Language Matrices (BLMs), structured datasets designed to probe linguistic knowledge of underlying patterns across sentence sets. We compare structured templates instantiated with natural sentences extracted from Universal Dependencies to structured templates of synthetic sentences. Experiments show that while models achieve ceiling performance when trained and tested on synthetic datasets, they do not reliably generalize to natural sentences. In contrast, models trained on natural data exhibit robust performance across both natural and synthetic test suites, demonstrating their superior ability to capture abstract linguistic patterns. These results corroborate the value of natural data and of structured set ups in linguistic evaluation for probing LLMs' syntactic and semantic knowledge.

Comparing Natural and Synthetic Structured Data: A Study of the Passive Verb Alternation in French and Italian

Abstract

This study compares the impact of natural and synthetic data on training and evaluating large language models (LLMs), using the case of passive verb alternation in French and Italian. We use Blackbird Language Matrices (BLMs), structured datasets designed to probe linguistic knowledge of underlying patterns across sentence sets. We compare structured templates instantiated with natural sentences extracted from Universal Dependencies to structured templates of synthetic sentences. Experiments show that while models achieve ceiling performance when trained and tested on synthetic datasets, they do not reliably generalize to natural sentences. In contrast, models trained on natural data exhibit robust performance across both natural and synthetic test suites, demonstrating their superior ability to capture abstract linguistic patterns. These results corroborate the value of natural data and of structured set ups in linguistic evaluation for probing LLMs' syntactic and semantic knowledge.

Paper Structure

This paper contains 15 sections, 8 figures, 1 table.

Figures (8)

  • Figure 1: Example of a synthetic piece of data for transitive and intransitive forms of the Italian verb cantare 'sing' generated by DeepSeek (DeepSeek-AI, accessed Feb 13, 2026).
  • Figure 2: BLM template structure instantiated with a synthetic example in English, generated with DeepSeek V.3 (section \ref{['syntheticdata']}). Ag = Agent, Th = Theme, Vact = verb in active voice, Vpass = verb in passive voice, red elements mark interrogative markers. Number of arguments (1, 2) and sentence type (Q = question, D = declarative).
  • Figure 3: Natural and synthetic examples in French, glosses and ID number (natural data) for reference within the explored treebanks. We have coloured code the core elements of the BLM-template. Correct answer in bold.
  • Figure 4: Natural and synthetic examples in Italian, glosses and ID number (natural data) for reference within the explored treebanks. We have coloured code the core elements of the BLM-template. Correct answer in bold.
  • Figure 5: F1 scores across training and test suites in monolingual models. The grey dotted line indicates chance level.
  • ...and 3 more figures