Lost in Speech: Benchmarking, Evaluation, and Parsing of Spoken Code-Switching Beyond Standard UD Assumptions

Nemika Tyagi; Holly Hendrix; Nelvin Licona-Guevara; Justin Mackie; Phanos Kareen; Muhammad Imran; Megan Michelle Smith; Tatiana Gallego Hernande; Chitta Baral; Olga Kellert

Lost in Speech: Benchmarking, Evaluation, and Parsing of Spoken Code-Switching Beyond Standard UD Assumptions

Nemika Tyagi, Holly Hendrix, Nelvin Licona-Guevara, Justin Mackie, Phanos Kareen, Muhammad Imran, Megan Michelle Smith, Tatiana Gallego Hernande, Chitta Baral, Olga Kellert

TL;DR

This work addresses the fundamental mismatch between written-language assumptions and spoken code-switching for syntactic parsing. It introduces a systems-oriented approach that decouples spoken-language phenomena from core syntax, grounded in an empirical taxonomy of CSW phenomena. The authors present SpokeBench, a linguistically informed gold benchmark, and FLEX-UD, an ambiguity-aware evaluation metric, to reveal limitations of standard UD metrics. Their DECAP framework, comprising specialized agents for phenomena handling, language-specific normalization, core UD assignment, and global verification, yields robust, interpretable parses without retraining and reports up to 52.6% improvements over existing methods. The findings advocate for evaluation practices and annotation schemes that accommodate structural uncertainty in spoken CSW, with practical implications for parsing in bilingual contexts and beyond UD’s canonical assumptions.

Abstract

Spoken code-switching (CSW) challenges syntactic parsing in ways not observed in written text. Disfluencies, repetition, ellipsis, and discourse-driven structure routinely violate standard Universal Dependencies (UD) assumptions, causing parsers and large language models (LLMs) to fail despite strong performance on written data. These failures are compounded by rigid evaluation metrics that conflate genuine structural errors with acceptable variation. In this work, we present a systems-oriented approach to spoken CSW parsing. We introduce a linguistically grounded taxonomy of spoken CSW phenomena and SpokeBench, an expert-annotated gold benchmark designed to test spoken-language structure beyond standard UD assumptions. We further propose FLEX-UD, an ambiguity-aware evaluation metric, which reveals that existing parsing techniques perform poorly on spoken CSW by penalizing linguistically plausible analyses as errors. We then propose DECAP, a decoupled agentic parsing framework that isolates spoken-phenomena handling from core syntactic analysis. Experiments show that DECAP produces more robust and interpretable parses without retraining and achieves up to 52.6% improvements over existing parsing techniques. FLEX-UD evaluations further reveal qualitative improvements that are masked by standard metrics.

Lost in Speech: Benchmarking, Evaluation, and Parsing of Spoken Code-Switching Beyond Standard UD Assumptions

TL;DR

Abstract

Paper Structure (48 sections, 2 equations, 4 figures, 6 tables, 1 algorithm)

This paper contains 48 sections, 2 equations, 4 figures, 6 tables, 1 algorithm.

Introduction
Related Work
UD Parsing for Non-Canonical and Code-Switched Text.
LLMs for Annotation and Parsing.
Evaluation Limitations.
Spoken Code-Switching: A Non-Canonical Parsing Domain
Taxonomy of Spoken Code-Switching Phenomena
Category Definitions and Corpus Distribution
SpokeBench: A Gold Benchmark for Spoken CSW Parsing
Subset Selection Strategy.
Annotation Protocol and Adjudication.
Methodology
DECAP: A Decoupled Agentic Framework for Spoken CSW Parsing
Spoken-Phenomena Handler (SPH).
Language-Specific Resolver (LSR).
...and 33 more sections

Figures (4)

Figure 1: Illustration of the modality gap motivating this work. Parsers and LLMs typically produce well-formed dependency analyses for written text (left), but the same systems often misparse spoken code-switched utterances (right) due to disfluencies, repetition, and discourse phenomena that violate written-text assumptions.
Figure 2: Overview of the DECAP framework for spoken code-switching parsing, illustrated with a running example. The input utterance (“Entonces then I won’t I won’t buy anything uh”) is processed. The Spoken-Phenomena Handler (SPH) detects disfluencies (e.g., repetition, discourse markers, fillers) and provides tokenization hints. The Language-Specific Resolver (LSR) applies conservative, language-aware normalization, and contraction splitting. The Core UD Assigner constructs a dependency parse under these constraints, preserving reparanda and enforcing a single root. Finally, the Verifier and Ranker (V/R) enforces global UD validity and outputs a final annotation with confidence and penalty scores, demonstrating incremental resolution of ambiguity.
Figure 3: UPOS-LAS by Category and Parser; DECAP is the GPT-4.1 agent and performs the best across all categories.
Figure 4: LAS by Category and Parser. DECAP is the GPT-4.1 agent and performs the best across all categories.

Lost in Speech: Benchmarking, Evaluation, and Parsing of Spoken Code-Switching Beyond Standard UD Assumptions

TL;DR

Abstract

Lost in Speech: Benchmarking, Evaluation, and Parsing of Spoken Code-Switching Beyond Standard UD Assumptions

Authors

TL;DR

Abstract

Table of Contents

Figures (4)