TAC: Timestamped Audio Captioning

Sonal Kumar; Prem Seetharaman; Ke Chen; Oriol Nieto; Jiaqi Su; Zhepei Wang; Rithesh Kumar; Dinesh Manocha; Nicholas J. Bryan; Zeyu Jin; Justin Salamon

TAC: Timestamped Audio Captioning

Sonal Kumar, Prem Seetharaman, Ke Chen, Oriol Nieto, Jiaqi Su, Zhepei Wang, Rithesh Kumar, Dinesh Manocha, Nicholas J. Bryan, Zeyu Jin, Justin Salamon

TL;DR

TAC tackles the brittleness of large audio-language models in complex acoustic scenes by producing timestamped, dense descriptions that ground events in time. It couples a Dynamic Acoustic Mixer–driven synthetic curriculum with multitask prompts and a LoRA-tuned backbone to achieve state-of-the-art dense captioning and robust event grounding, further extended by TAC-V for audio-visual descriptions. A Describe-Then-Reason cascade allows TAC(-V) outputs to serve as semantic bridges for text-only LLMs, delivering leading audio and audio-visual reasoning scores across multiple benchmarks. The approach demonstrates scalable, interpretable reasoning grounded in temporally precise audio captions, while acknowledging sim-to-real gaps and outlining future domain adaptation and broader multimodal extensions.

Abstract

Large Audio Language Models struggle to disentangle overlapping events in complex acoustic scenes, yielding temporally inconsistent captions and frequent hallucinations. We introduce Timestamped Audio Captioner (TAC), a model that produces temporally grounded audio descriptions at varying degrees of detail and resolution. TAC is trained with a synthetic data pipeline that constructs challenging and dynamic mixtures from real-world audio sources, enabling robust learning under realistic polyphonic conditions. Across event detection and dense captioning, TAC outperforms all competing methods, with a low hallucination rate and accurate temporal grounding. We also introduce TAC-V, an audio-visual pipeline to generate semantically rich audio-visual descriptions. We then show that TAC and TAC-V serves as a "semantic bridge" for a text-only reasoner: a simple TAC$\rightarrow$LLM and TAC-V$\rightarrow$LLM cascade achieves state-of-the-art scores on benchmarks for both audio (MMAU-Pro, MMSU, MMAR) and audio-visual (DailyOmni, VideoHolmes) understanding and reasoning respectively.

TAC: Timestamped Audio Captioning

TL;DR

Abstract

LLM and TAC-V

LLM cascade achieves state-of-the-art scores on benchmarks for both audio (MMAU-Pro, MMSU, MMAR) and audio-visual (DailyOmni, VideoHolmes) understanding and reasoning respectively.

Paper Structure (22 sections, 2 equations, 19 figures, 4 tables, 1 algorithm)

This paper contains 22 sections, 2 equations, 19 figures, 4 tables, 1 algorithm.

Introduction
Related Work
Methodology
Dynamic Acoustic Mixer
Multitask prompts and output format
TAC Architecture and Training
TAC-V: TAC with Visuals
Evaluation
Experiments
Dense Captioning
Describe-Then-Reason
Audio Understanding & Reasoning
Audiovisual Understanding & Reasoning
Conclusion, Limitations, and Future Work
Appendix
...and 7 more sections

Figures (19)

Figure 1: Given only audio, TAC generates structured, timestamped descriptions of overlapping sound events. We visualize the timestamps produced by TAC as temporal lanes above. Colors indicate correspondence between text and temporal lanes.
Figure 2: The TAC Training Pipeline. Stage 1 synthesizes complex audio mixtures via our Dynamic Acoustic Mixer. In Stage 2, a Style Controller stochastically samples "description styles" (Keyword vs. Brief vs. Detailed) and timing resolutions, generating a diverse curriculum of instruction-tuned prompts.
Figure 3: An example of a synthetically generated training pair. Note how the "Reasoning Header" ("3 events total...") is algorithmically derived from the composition metadata, teaching the model to summarize before detailing.
Figure 4: An example output from our cascaded Audio-Visual pipeline. Note the integration of visual details ("metallic studio logo", "furrowed brow") with precise audio events, and the inclusion of FLAM confidence scores (e.g., $0.99$) alongside aligned transcriptions.
Figure 5: MMAU-Pro Example. The model combines distinct acoustic events (opening a can, boiling water) to deduce a specific recipe.
...and 14 more figures

TAC: Timestamped Audio Captioning

TL;DR

Abstract

TAC: Timestamped Audio Captioning

Authors

TL;DR

Abstract

Table of Contents

Figures (19)