Resource-Efficient Reference-Free Evaluation of Audio Captions

Rehana Mahfuz; Yinyi Guo; Erik Visser

Resource-Efficient Reference-Free Evaluation of Audio Captions

Rehana Mahfuz, Yinyi Guo, Erik Visser

TL;DR

This work tackles the challenge of evaluating audio captions in resource-constrained settings by introducing lightweight, reference-free confidence metrics that can be computed during inference without large pretrained models. It couples pooling-based confidences ($AM$, $GM$, $SAM$, $SGM$), temperature scaling, CLAPScore, and semantic entropy to calibrate against reference-based correctness measures, demonstrating strong alignment across multiple metrics on AudioCaps and Clotho. Key findings include that pooling-based metrics align well with both traditional and model-based correctness measures, selective pooling improves alignment for reference-based metrics, and temperature scaling substantially improves calibration across datasets. The approach enables reliable caption evaluation on edge devices and shows promise for extending to other modalities like image captions, with practical tradeoffs discussed for deployment on limited-resource platforms.

Abstract

To establish the trustworthiness of systems that automatically generate text captions for audio, images and video, existing reference-free metrics rely on large pretrained models which are impractical to accommodate in resource-constrained settings. To address this, we propose some metrics to elicit the model's confidence in its own generation. To assess how well these metrics replace correctness measures that leverage reference captions, we test their calibration with correctness measures. We discuss why some of these confidence metrics align better with certain correctness measures. Further, we provide insight into why temperature scaling of confidence metrics is effective. Our main contribution is a suite of well-calibrated lightweight confidence metrics for reference-free evaluation of captions in resource-constrained settings.

Resource-Efficient Reference-Free Evaluation of Audio Captions

TL;DR

), temperature scaling, CLAPScore, and semantic entropy to calibrate against reference-based correctness measures, demonstrating strong alignment across multiple metrics on AudioCaps and Clotho. Key findings include that pooling-based metrics align well with both traditional and model-based correctness measures, selective pooling improves alignment for reference-based metrics, and temperature scaling substantially improves calibration across datasets. The approach enables reliable caption evaluation on edge devices and shows promise for extending to other modalities like image captions, with practical tradeoffs discussed for deployment on limited-resource platforms.

Abstract

Paper Structure (23 sections, 10 equations, 6 figures, 4 tables)

This paper contains 23 sections, 10 equations, 6 figures, 4 tables.

Introduction
Related Work
Evaluating quality of generated text
In the presence of reference text
In the absence of reference text
Calibration
Procedure
Pooling-based metrics
Temperature Scaling
CLAPScore
Semantic Entropy
Experiments
Results
Identifying clusters in correctness measures
Evaluating Confidence Metrics
...and 8 more sections

Figures (6)

Figure 1: Our framework of obtaining confidence metrics and correctness measures.
Figure 2: Our framework of measuring calibration of confidence metrics with correctness measures.
Figure 3: Brier scores over temperatures for the AudioCaps dataset. Each plot shows the variation of all correctness measures over temperatures for a single confidence metric.
Figure 4: Pearson correlation between correctness measures for the AudioCaps dataset.
Figure 5: Variation of distribution over temperatures of the Arithmetic Mean confidence metric for the AudioCaps dataset.
...and 1 more figures

Resource-Efficient Reference-Free Evaluation of Audio Captions

TL;DR

Abstract

Resource-Efficient Reference-Free Evaluation of Audio Captions

Authors

TL;DR

Abstract

Table of Contents

Figures (6)