Table of Contents
Fetching ...

Domain-Adapted Retrieval for In-Context Annotation of Pedagogical Dialogue Acts

Jinsook Lee, Kirk Vanacore, Zhuqian Zhou, Bakhtawar Ahtisham, Rene F. Kizilcec

Abstract

Automated annotation of pedagogical dialogue is a high-stakes task where LLMs often fail without sufficient domain grounding. We present a domain-adapted RAG pipeline for tutoring move annotation. Rather than fine-tuning the generative model, we adapt retrieval by fine-tuning a lightweight embedding model on tutoring corpora and indexing dialogues at the utterance level to retrieve labeled few-shot demonstrations. Evaluated across two real tutoring dialogue datasets (TalkMoves and Eedi) and three LLM backbones (GPT-5.2, Claude Sonnet 4.6, Qwen3-32b), our best configuration achieves Cohen's $κ$ of 0.526-0.580 on TalkMoves and 0.659-0.743 on Eedi, substantially outperforming no-retrieval baselines ($κ= 0.275$-$0.413$ and $0.160$-$0.410$). An ablation study reveals that utterance-level indexing, rather than embedding quality alone, is the primary driver of these gains, with top-1 label match rates improving from 39.7\% to 62.0\% on TalkMoves and 52.9\% to 73.1\% on Eedi under domain-adapted retrieval. Retrieval also corrects systematic label biases present in zero-shot prompting and yields the largest improvements for rare and context-dependent labels. These findings suggest that adapting the retrieval component alone is a practical and effective path toward expert-level pedagogical dialogue annotation while keeping the generative model frozen.

Domain-Adapted Retrieval for In-Context Annotation of Pedagogical Dialogue Acts

Abstract

Automated annotation of pedagogical dialogue is a high-stakes task where LLMs often fail without sufficient domain grounding. We present a domain-adapted RAG pipeline for tutoring move annotation. Rather than fine-tuning the generative model, we adapt retrieval by fine-tuning a lightweight embedding model on tutoring corpora and indexing dialogues at the utterance level to retrieve labeled few-shot demonstrations. Evaluated across two real tutoring dialogue datasets (TalkMoves and Eedi) and three LLM backbones (GPT-5.2, Claude Sonnet 4.6, Qwen3-32b), our best configuration achieves Cohen's of 0.526-0.580 on TalkMoves and 0.659-0.743 on Eedi, substantially outperforming no-retrieval baselines (- and -). An ablation study reveals that utterance-level indexing, rather than embedding quality alone, is the primary driver of these gains, with top-1 label match rates improving from 39.7\% to 62.0\% on TalkMoves and 52.9\% to 73.1\% on Eedi under domain-adapted retrieval. Retrieval also corrects systematic label biases present in zero-shot prompting and yields the largest improvements for rare and context-dependent labels. These findings suggest that adapting the retrieval component alone is a practical and effective path toward expert-level pedagogical dialogue annotation while keeping the generative model frozen.

Paper Structure

This paper contains 42 sections, 2 equations, 4 figures, 20 tables.

Figures (4)

  • Figure 1: Overview of the proposed RAG-based annotation pipeline.
  • Figure 2: Normalized confusion matrices for TalkMoves test (top) and Eedi test (bottom) with Claude Sonnet 4.6. Domain-adapted retrieval (RAG_finetuned_utt, right) concentrates mass along the diagonal for all labels. Without retrieval, REA is systematically over-predicted on Eedi.
  • Figure 3: Confidence score distributions by model and retrieval condition on the combined test sets. Domain-adapted retrieval (RAG_finetuned_utt) concentrates prediction mass at higher confidence levels across all three models.
  • Figure 4: Cohen's $\kappa$ as a function of retrieval depth $k$ across three LLM backbones on TalkMoves test (top) and Eedi test (bottom). Left panels show RAG_no_finetune, right panels show RAG_finetuned_utt. Domain-adapted retrieval consistently outperforms general-purpose retrieval at every $k$ for all models.