Table of Contents
Fetching ...

TimeStampEval: A Simple LLM Eval and a Little Fuzzy Matching Trick to Improve Search Accuracy

James McCammon

TL;DR

TimeStampEval investigates how to reliably map quotes to precise timestamps in long transcripts when content varies semantically but not verbatim. The authors propose a simple two-stage approach: (1) a fast fuzzy pre-filter to narrow candidates, and (2) a targeted LLM verification on short snippets, augmented by a production-friendly prompt design (Text First Top with query-before). Key findings show that prompt/layout choices drive most of the gains, off-by-one errors form a distinct failure mode that can be mitigated with the right format, and that a modest thinking budget can dramatically boost accuracy for weaker prompts. Moreover, a hybrid fuzzy-LLM pipeline drastically reduces latency and cost per correct result while delivering near-ceiling accuracy, making timestamp retrieval feasible at production scales. The work provides concrete guidelines for practitioners—avoid verbatim-only matching, compress input with TFT, use a fast fuzzy pre-filter, and rely on LLM verification only on small snippets—to achieve fast, cheap, and reliable timestamp retrieval in production media pipelines.

Abstract

Traditional fuzzy matching often fails when searching for quotes that are semantically identical but syntactically different across documents-a common issue when aligning official written records with speech-to-text transcripts. We introduce TimeStampEval, a benchmark for retrieving precise millisecond timestamps from long transcripts given non-verbatim quotes. Our simple two-stage method dramatically improves retrieval accuracy while cutting inference costs by over 90%. The motivating use case is an automated long-form podcast that assembles Congressional Record clips into AI-hosted narration. The technical challenge: given a sentence-timestamped transcript and a target quote that may differ due to transcription or editorial drift, return exact start and end boundaries. Standard algorithms handle verbatim text but break under fuzzier variants. Evaluating six modern LLMs on a 2,800-sentence (120k-token) transcript revealed four key findings. (1) Prompt design matters more than model choice: placing the query before the transcript and using compact formatting improved accuracy by 3-20 points while reducing token count by 30-40%. (2) Off-by-one errors form a distinct category, showing models understand the task but misplace boundaries. (3) A modest reasoning budget (600-850 tokens) raises accuracy from 37% to 77% for weak setups and to above 90% for strong ones. (4) Our "Assisted Fuzzy" approach-RapidFuzz pre-filtering followed by LLM verification on short snippets-improves fuzzy match accuracy by up to 50 points while halving latency and reducing cost per correct result by up to 96%. Extended tests on ten transcripts (50k-900k tokens, 1989-2025) confirm robustness to transcript length, vocabulary drift, and domain change, maintaining 95-100% rejection accuracy for absent targets.

TimeStampEval: A Simple LLM Eval and a Little Fuzzy Matching Trick to Improve Search Accuracy

TL;DR

TimeStampEval investigates how to reliably map quotes to precise timestamps in long transcripts when content varies semantically but not verbatim. The authors propose a simple two-stage approach: (1) a fast fuzzy pre-filter to narrow candidates, and (2) a targeted LLM verification on short snippets, augmented by a production-friendly prompt design (Text First Top with query-before). Key findings show that prompt/layout choices drive most of the gains, off-by-one errors form a distinct failure mode that can be mitigated with the right format, and that a modest thinking budget can dramatically boost accuracy for weaker prompts. Moreover, a hybrid fuzzy-LLM pipeline drastically reduces latency and cost per correct result while delivering near-ceiling accuracy, making timestamp retrieval feasible at production scales. The work provides concrete guidelines for practitioners—avoid verbatim-only matching, compress input with TFT, use a fast fuzzy pre-filter, and rely on LLM verification only on small snippets—to achieve fast, cheap, and reliable timestamp retrieval in production media pipelines.

Abstract

Traditional fuzzy matching often fails when searching for quotes that are semantically identical but syntactically different across documents-a common issue when aligning official written records with speech-to-text transcripts. We introduce TimeStampEval, a benchmark for retrieving precise millisecond timestamps from long transcripts given non-verbatim quotes. Our simple two-stage method dramatically improves retrieval accuracy while cutting inference costs by over 90%. The motivating use case is an automated long-form podcast that assembles Congressional Record clips into AI-hosted narration. The technical challenge: given a sentence-timestamped transcript and a target quote that may differ due to transcription or editorial drift, return exact start and end boundaries. Standard algorithms handle verbatim text but break under fuzzier variants. Evaluating six modern LLMs on a 2,800-sentence (120k-token) transcript revealed four key findings. (1) Prompt design matters more than model choice: placing the query before the transcript and using compact formatting improved accuracy by 3-20 points while reducing token count by 30-40%. (2) Off-by-one errors form a distinct category, showing models understand the task but misplace boundaries. (3) A modest reasoning budget (600-850 tokens) raises accuracy from 37% to 77% for weak setups and to above 90% for strong ones. (4) Our "Assisted Fuzzy" approach-RapidFuzz pre-filtering followed by LLM verification on short snippets-improves fuzzy match accuracy by up to 50 points while halving latency and reducing cost per correct result by up to 96%. Extended tests on ten transcripts (50k-900k tokens, 1989-2025) confirm robustness to transcript length, vocabulary drift, and domain change, maintaining 95-100% rejection accuracy for absent targets.

Paper Structure

This paper contains 108 sections, 5 equations, 9 figures, 60 tables, 2 algorithms.

Figures (9)

  • Figure 1: AI-powered podcast production workflow
  • Figure 2: Performance comparison of Google’s Flash 2.5 model across 8 different prompt configurations, testing JSON vs text formats with varying query placements (Top/Bottom) and sentence locations (First/Middle/End). Results show text format significantly outperforms JSON, with Text First Top achieving the highest accuracy at 91% and Text End Bottom performing worst at 39%.
  • Figure 3: Performance comparison of Google's Flash 2.5 model when provided different thinking token budgets. No thinking performs worse regardless of format, but the effect is mitigated in the Text First Top format with the query at the top and the sentence text before timestamp markers. Once allowed to think, increasing the model's thinking budget has no noticeable impact on performance, likely because despite the larger budgets the model typically thought for only 1,000 tokens or less.
  • Figure 4: This chart shows how successive changes to prompt structure and transcript formatting dramatically increase exact-match accuracy for Google’s Gemini Flash 2.5 model on the timestamp retrieval task. Moving from the native JSON format of the Speech-to-Text provider to a text format reduces token count by 30%, but also reduces accuracy by 13 percentage points. Beginning with the weakest text format ("Text End Bottom"), each bar represents an improvement: (1) moving the query from the bottom to the top of the prompt, (2) placing the sentence text before the timestamp fields (instead of after), and (3) enabling model thinking. Each adjustment leads to a significant gain, with accuracy rising from 39% to 96%. Moving to a more powerful (but also slower and more costly model, Gemini 2.5 pro, further increases accuracy, but only by a few percentage points. These results highlight the outsized impact of thoughtful prompt and data formatting—well before model-level changes or tuning
  • Figure 5: Retrieval accuracy of Gemini 2.5 Flash across transcript lengths from 100k to 900k tokens. Accuracy is reported as exact match and fuzzy match over two independent runs. Performance remains stable through 400k tokens, after which both metrics begin to degrade, with fuzzy match showing earlier volatility.
  • ...and 4 more figures