CAF-Score: Calibrating CLAP with LALMs for Reference-free Audio Captioning Evaluation

Insung Lee; Taeyoung Jeong; Haejun Yoo; Du-Seong Chang; Myoung-Wan Koo

CAF-Score: Calibrating CLAP with LALMs for Reference-free Audio Captioning Evaluation

Insung Lee, Taeyoung Jeong, Haejun Yoo, Du-Seong Chang, Myoung-Wan Koo

Abstract

While Large Audio-Language Models (LALMs) have advanced audio captioning, robust evaluation remains difficult. Reference-based metrics are expensive and often fail to assess acoustic fidelity, while Contrastive Language-Audio Pretraining (CLAP)-based approaches frequently overlook syntactic errors and fine-grained details. We propose CAF-Score, a reference-free metric that calibrates CLAP's coarse-grained semantic alignment with the fine-grained comprehension and syntactic awareness of LALMs. By combining contrastive audio-text embeddings with LALM reasoning, CAF-Score effectively detects syntactic inconsistencies and subtle hallucinations. Experiments on the BRACE benchmark demonstrate that our approach achieves the highest correlation with human judgments, even outperforming reference-based baselines in challenging scenarios. These results highlight the efficacy of CAF-Score for reference-free audio captioning evaluation. Code and results are available at https://github.com/inseong00/CAF-Score.

CAF-Score: Calibrating CLAP with LALMs for Reference-free Audio Captioning Evaluation

Abstract

Paper Structure (29 sections, 5 equations, 9 figures, 4 tables)

This paper contains 29 sections, 5 equations, 9 figures, 4 tables.

Introduction
Related Works
Contrastive Audio-Text Alignment
Generative Evaluation with LALMs
Benchmarks for Audio Captioning Evaluation
Methodology
CLAP-based Alignment (S-CLAPScore)
LALM-based Evaluation (FLEUR)
CAF-Score (CLAP-aligned FLEUR score)
Experiments
Dataset
Experimental Setup
Baseline Models
Implementation Details
Evaluation Protocol
...and 14 more sections

Figures (9)

Figure 1: Overview of audio captioning evaluation metrics. Traditional metrics (top left) depend on ground-truth reference captions. Although CLAPScore enables reference-free evaluation, it lacks fine-grained semantic understanding and frequently overlooks syntactic errors. Inspired by FLEUR, a vision-domain approach, the proposed CAF-Score addresses these limitations by integrating the coarse-grained semantic alignment of CLAP with the fine-grained semantic reasoning and syntactic awareness of LALMs, resulting in stronger alignment with human preference judgments.
Figure 2: Overall architecture of CAF-Score. The framework comprises two parallel branches. The CLAP-based coarse-grained semantic alignment branch applies a sliding-window strategy to the input audio and computes cosine similarity with the candidate caption using CLAP encoders; Max pooling is then used to select the most salient segment score (S-CLAPScore). The LALM-based evaluation branch assesses caption fidelity using an LALM. Rather than relying on discrete text generation, it computes a FLEUR score from token probability distributions to capture fine-grained semantic and syntactic information. The final CAF-Score is obtained as a weighted combination of the two metrics. Notably, our framework operates entirely at inference time, utilizing frozen pre-trained backbones without requiring additional training or fine-tuning.
Figure 3: Performance variation of CAF-Score across different weighting parameters $\alpha$ on BRACE-Main.
Figure 4: Tie rates in raw scores for AudioFlamingo3 (AF3) and Qwen3-Omni models.
Figure 5: LALM-driven Correction on TIKTOK_1.wav.
...and 4 more figures

CAF-Score: Calibrating CLAP with LALMs for Reference-free Audio Captioning Evaluation

Abstract

CAF-Score: Calibrating CLAP with LALMs for Reference-free Audio Captioning Evaluation

Authors

Abstract

Table of Contents

Figures (9)