Mitigating Object and Action Hallucinations in Multimodal LLMs via Self-Augmented Contrastive Alignment
Kai-Po Chang, Wei-Yuan Cheng, Chi-Pin Huang, Fu-En Yang, Yu-Chiang Frank Wang
TL;DR
This work tackles object and action hallucinations in multimodal video captioning by introducing SANTA, a Self-Augmented Contrastive Alignment framework. SANTA combines hallucinative self-augmentation to generate targeted negatives with a tracklet-phrase contrastive mechanism that grounds regional objects and relation-guided actions to visual and temporal phrases. The approach jointly optimizes video-level grounding and region/relationship alignment, significantly reducing hallucinations across multiple benchmarks (MiraData-9k, FactVC, VidHal) and improving Dream1k captioning and video QA. Extensive ablations demonstrate the robustness and component contributions, establishing SANTA as a strong solution for faithful video-language grounding in MLLMs.
Abstract
Recent advancement in multimodal LLMs (MLLMs) has demonstrated their remarkable capability to generate descriptive captions for input videos. However, these models suffer from factual inaccuracies in the generated descriptions, causing severe hallucination issues. While prior works have explored alleviating hallucinations for static images, jointly mitigating visual object and temporal action hallucinations for dynamic videos remains a challenging and unsolved task. To tackle this challenge, we propose a Self-Augmented Contrastive Alignment (SANTA) framework for enabling object and action faithfulness by exempting the spurious correlations and enforcing the emphasis on visual facts. SANTA employs a hallucinative self-augmentation scheme to identify the potential hallucinations that lie in the MLLM and transform the original captions to the contrasted negatives. Furthermore, we develop a tracklet-phrase contrastive alignment to match the regional objects and relation-guided actions with their corresponding visual and temporal phrases. Extensive experiments demonstrate that SANTA outperforms existing methods in alleviating object and action hallucinations, yielding superior performance on the hallucination examination benchmarks.
