Resource-Efficient Reference-Free Evaluation of Audio Captions
Rehana Mahfuz, Yinyi Guo, Erik Visser
TL;DR
This work tackles the challenge of evaluating audio captions in resource-constrained settings by introducing lightweight, reference-free confidence metrics that can be computed during inference without large pretrained models. It couples pooling-based confidences ($AM$, $GM$, $SAM$, $SGM$), temperature scaling, CLAPScore, and semantic entropy to calibrate against reference-based correctness measures, demonstrating strong alignment across multiple metrics on AudioCaps and Clotho. Key findings include that pooling-based metrics align well with both traditional and model-based correctness measures, selective pooling improves alignment for reference-based metrics, and temperature scaling substantially improves calibration across datasets. The approach enables reliable caption evaluation on edge devices and shows promise for extending to other modalities like image captions, with practical tradeoffs discussed for deployment on limited-resource platforms.
Abstract
To establish the trustworthiness of systems that automatically generate text captions for audio, images and video, existing reference-free metrics rely on large pretrained models which are impractical to accommodate in resource-constrained settings. To address this, we propose some metrics to elicit the model's confidence in its own generation. To assess how well these metrics replace correctness measures that leverage reference captions, we test their calibration with correctness measures. We discuss why some of these confidence metrics align better with certain correctness measures. Further, we provide insight into why temperature scaling of confidence metrics is effective. Our main contribution is a suite of well-calibrated lightweight confidence metrics for reference-free evaluation of captions in resource-constrained settings.
