Boosting Multi-modal Keyphrase Prediction with Dynamic Chain-of-Thought in Vision-Language Models

Qihang Ma; Shengyu Li; Jie Tang; Dingkang Yang; Shaodong Chen; Yingyi Zhang; Chao Feng; Jiao Ran

Boosting Multi-modal Keyphrase Prediction with Dynamic Chain-of-Thought in Vision-Language Models

Qihang Ma, Shengyu Li, Jie Tang, Dingkang Yang, Shaodong Chen, Yingyi Zhang, Chao Feng, Jiao Ran

TL;DR

The paper tackles multi-modal keyphrase prediction (MMKP) by leveraging vision-language models (VLMs) to fuse text and imagery, addressing absence and unseen keyphrase scenarios and limiting train-test overlap. It introduces three strategies—zero-shot/SFT baselines, Fine-tune-CoT with teacher-generated reasoning data, and a Dynamic CoT approach that adaptively injects CoT during training—along with new datasets MMKP, MMKP-V2, and MMKP-360k to provide more realistic evaluation. CoT data is generated with GPT-4o to imbue VLMs with reasoning capabilities, while a dynamic threshold $ ext{\gamma}$ on the SFT loss governs when to switch to CoT supervision, balancing generalization and efficiency. Experiments show that SFT-based VLMs outperform prior SOTA by significant margins, and Dynamic CoT further improves generalization, particularly for unseen keyphrases, with reduced inference overhead; code is released at the provided URL.

Abstract

Multi-modal keyphrase prediction (MMKP) aims to advance beyond text-only methods by incorporating multiple modalities of input information to produce a set of conclusive phrases. Traditional multi-modal approaches have been proven to have significant limitations in handling the challenging absence and unseen scenarios. Additionally, we identify shortcomings in existing benchmarks that overestimate model capability due to significant overlap in training tests. In this work, we propose leveraging vision-language models (VLMs) for the MMKP task. Firstly, we use two widely-used strategies, e.g., zero-shot and supervised fine-tuning (SFT) to assess the lower bound performance of VLMs. Next, to improve the complex reasoning capabilities of VLMs, we adopt Fine-tune-CoT, which leverages high-quality CoT reasoning data generated by a teacher model to finetune smaller models. Finally, to address the "overthinking" phenomenon, we propose a dynamic CoT strategy which adaptively injects CoT data during training, allowing the model to flexibly leverage its reasoning capabilities during the inference stage. We evaluate the proposed strategies on various datasets and the experimental results demonstrate the effectiveness of the proposed approaches. The code is available at https://github.com/bytedance/DynamicCoT.

Boosting Multi-modal Keyphrase Prediction with Dynamic Chain-of-Thought in Vision-Language Models

TL;DR

Abstract

Boosting Multi-modal Keyphrase Prediction with Dynamic Chain-of-Thought in Vision-Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)