TimeStampEval: A Simple LLM Eval and a Little Fuzzy Matching Trick to Improve Search Accuracy
James McCammon
TL;DR
TimeStampEval investigates how to reliably map quotes to precise timestamps in long transcripts when content varies semantically but not verbatim. The authors propose a simple two-stage approach: (1) a fast fuzzy pre-filter to narrow candidates, and (2) a targeted LLM verification on short snippets, augmented by a production-friendly prompt design (Text First Top with query-before). Key findings show that prompt/layout choices drive most of the gains, off-by-one errors form a distinct failure mode that can be mitigated with the right format, and that a modest thinking budget can dramatically boost accuracy for weaker prompts. Moreover, a hybrid fuzzy-LLM pipeline drastically reduces latency and cost per correct result while delivering near-ceiling accuracy, making timestamp retrieval feasible at production scales. The work provides concrete guidelines for practitioners—avoid verbatim-only matching, compress input with TFT, use a fast fuzzy pre-filter, and rely on LLM verification only on small snippets—to achieve fast, cheap, and reliable timestamp retrieval in production media pipelines.
Abstract
Traditional fuzzy matching often fails when searching for quotes that are semantically identical but syntactically different across documents-a common issue when aligning official written records with speech-to-text transcripts. We introduce TimeStampEval, a benchmark for retrieving precise millisecond timestamps from long transcripts given non-verbatim quotes. Our simple two-stage method dramatically improves retrieval accuracy while cutting inference costs by over 90%. The motivating use case is an automated long-form podcast that assembles Congressional Record clips into AI-hosted narration. The technical challenge: given a sentence-timestamped transcript and a target quote that may differ due to transcription or editorial drift, return exact start and end boundaries. Standard algorithms handle verbatim text but break under fuzzier variants. Evaluating six modern LLMs on a 2,800-sentence (120k-token) transcript revealed four key findings. (1) Prompt design matters more than model choice: placing the query before the transcript and using compact formatting improved accuracy by 3-20 points while reducing token count by 30-40%. (2) Off-by-one errors form a distinct category, showing models understand the task but misplace boundaries. (3) A modest reasoning budget (600-850 tokens) raises accuracy from 37% to 77% for weak setups and to above 90% for strong ones. (4) Our "Assisted Fuzzy" approach-RapidFuzz pre-filtering followed by LLM verification on short snippets-improves fuzzy match accuracy by up to 50 points while halving latency and reducing cost per correct result by up to 96%. Extended tests on ten transcripts (50k-900k tokens, 1989-2025) confirm robustness to transcript length, vocabulary drift, and domain change, maintaining 95-100% rejection accuracy for absent targets.
