Table of Contents
Fetching ...

Boosting Multi-modal Keyphrase Prediction with Dynamic Chain-of-Thought in Vision-Language Models

Qihang Ma, Shengyu Li, Jie Tang, Dingkang Yang, Shaodong Chen, Yingyi Zhang, Chao Feng, Jiao Ran

TL;DR

The paper tackles multi-modal keyphrase prediction (MMKP) by leveraging vision-language models (VLMs) to fuse text and imagery, addressing absence and unseen keyphrase scenarios and limiting train-test overlap. It introduces three strategies—zero-shot/SFT baselines, Fine-tune-CoT with teacher-generated reasoning data, and a Dynamic CoT approach that adaptively injects CoT during training—along with new datasets MMKP, MMKP-V2, and MMKP-360k to provide more realistic evaluation. CoT data is generated with GPT-4o to imbue VLMs with reasoning capabilities, while a dynamic threshold $ ext{\gamma}$ on the SFT loss governs when to switch to CoT supervision, balancing generalization and efficiency. Experiments show that SFT-based VLMs outperform prior SOTA by significant margins, and Dynamic CoT further improves generalization, particularly for unseen keyphrases, with reduced inference overhead; code is released at the provided URL.

Abstract

Multi-modal keyphrase prediction (MMKP) aims to advance beyond text-only methods by incorporating multiple modalities of input information to produce a set of conclusive phrases. Traditional multi-modal approaches have been proven to have significant limitations in handling the challenging absence and unseen scenarios. Additionally, we identify shortcomings in existing benchmarks that overestimate model capability due to significant overlap in training tests. In this work, we propose leveraging vision-language models (VLMs) for the MMKP task. Firstly, we use two widely-used strategies, e.g., zero-shot and supervised fine-tuning (SFT) to assess the lower bound performance of VLMs. Next, to improve the complex reasoning capabilities of VLMs, we adopt Fine-tune-CoT, which leverages high-quality CoT reasoning data generated by a teacher model to finetune smaller models. Finally, to address the "overthinking" phenomenon, we propose a dynamic CoT strategy which adaptively injects CoT data during training, allowing the model to flexibly leverage its reasoning capabilities during the inference stage. We evaluate the proposed strategies on various datasets and the experimental results demonstrate the effectiveness of the proposed approaches. The code is available at https://github.com/bytedance/DynamicCoT.

Boosting Multi-modal Keyphrase Prediction with Dynamic Chain-of-Thought in Vision-Language Models

TL;DR

The paper tackles multi-modal keyphrase prediction (MMKP) by leveraging vision-language models (VLMs) to fuse text and imagery, addressing absence and unseen keyphrase scenarios and limiting train-test overlap. It introduces three strategies—zero-shot/SFT baselines, Fine-tune-CoT with teacher-generated reasoning data, and a Dynamic CoT approach that adaptively injects CoT during training—along with new datasets MMKP, MMKP-V2, and MMKP-360k to provide more realistic evaluation. CoT data is generated with GPT-4o to imbue VLMs with reasoning capabilities, while a dynamic threshold on the SFT loss governs when to switch to CoT supervision, balancing generalization and efficiency. Experiments show that SFT-based VLMs outperform prior SOTA by significant margins, and Dynamic CoT further improves generalization, particularly for unseen keyphrases, with reduced inference overhead; code is released at the provided URL.

Abstract

Multi-modal keyphrase prediction (MMKP) aims to advance beyond text-only methods by incorporating multiple modalities of input information to produce a set of conclusive phrases. Traditional multi-modal approaches have been proven to have significant limitations in handling the challenging absence and unseen scenarios. Additionally, we identify shortcomings in existing benchmarks that overestimate model capability due to significant overlap in training tests. In this work, we propose leveraging vision-language models (VLMs) for the MMKP task. Firstly, we use two widely-used strategies, e.g., zero-shot and supervised fine-tuning (SFT) to assess the lower bound performance of VLMs. Next, to improve the complex reasoning capabilities of VLMs, we adopt Fine-tune-CoT, which leverages high-quality CoT reasoning data generated by a teacher model to finetune smaller models. Finally, to address the "overthinking" phenomenon, we propose a dynamic CoT strategy which adaptively injects CoT data during training, allowing the model to flexibly leverage its reasoning capabilities during the inference stage. We evaluate the proposed strategies on various datasets and the experimental results demonstrate the effectiveness of the proposed approaches. The code is available at https://github.com/bytedance/DynamicCoT.

Paper Structure

This paper contains 16 sections, 5 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: (a) An example of multi-model keyphrase prediction. (b) The performance of different models on the MMKP dataset wang2020cross. "Absent" refers to keyphrases that absent in the input text. "Unseen" refers to keyphrases that not appear in the training set's ground truth. (c) The number of seen and unseen keyphrases in the test set of the MMKP dataset and our MMKP-360k dataset.
  • Figure 2: Main framework of our proposed method. (a) CoT data production pipeline. (b) Dynamic CoT training pipeline.
  • Figure 3: Visualization of multi-modal embedding clustering for post sharing the same keyphrase (The top five most frequent keyphrases) in the MMKP dataset.
  • Figure 4: Examples of Multi-modal Keyphrase Prediction. Green denotes correct keyphrase predictions, whereas red denotes incorrect keyphrase predictions.
  • Figure 5: Visualization of SFT models on test set of MMKP dataset.
  • ...and 1 more figures