Table of Contents
Fetching ...

Soft-Prompting with Graph-of-Thought for Multi-modal Representation Learning

Juncheng Yang, Zuchao Li, Shuai Xie, Wei Yu, Shijun Li, Bo Du

TL;DR

This work addresses the limitation of linear chain-of-thought prompting in multi-modal tasks by introducing Aggregation-Graph-of-Thought (AGoT), a graph-based soft-prompting mechanism that aggregates and flows prompts across multiple reasoning views. By modeling each reasoning step as a graph with subnodes that capture diverse perspectives and using a flow controller to regulate information transfer, AGoT enhances text-image retrieval, VQA, and cross-domain generalization, while integrating visual features through a MetaNet within a CLIP-style training framework. The authors provide extensive ablations and demonstrations across 18 datasets, showing that multi-view prompt aggregation and dynamic prompt flow yield robust improvements over strong baselines such as CoOp, CoCoOp, KgCoOp, and CoT-PT. The approach holds promise for more adaptable and generalizable multi-modal reasoning, with potential applications in vision-language understanding and beyond.

Abstract

The chain-of-thought technique has been received well in multi-modal tasks. It is a step-by-step linear reasoning process that adjusts the length of the chain to improve the performance of generated prompts. However, human thought processes are predominantly non-linear, as they encompass multiple aspects simultaneously and employ dynamic adjustment and updating mechanisms. Therefore, we propose a novel Aggregation-Graph-of-Thought (AGoT) mechanism for soft-prompt tuning in multi-modal representation learning. The proposed AGoT models the human thought process not only as a chain but also models each step as a reasoning aggregation graph to cope with the overlooked multiple aspects of thinking in single-step reasoning. This turns the entire reasoning process into prompt aggregation and prompt flow operations. Experiments show that our multi-modal model enhanced with AGoT soft-prompting achieves good results in several tasks such as text-image retrieval, visual question answering, and image recognition. In addition, we demonstrate that it has good domain generalization performance due to better reasoning.

Soft-Prompting with Graph-of-Thought for Multi-modal Representation Learning

TL;DR

This work addresses the limitation of linear chain-of-thought prompting in multi-modal tasks by introducing Aggregation-Graph-of-Thought (AGoT), a graph-based soft-prompting mechanism that aggregates and flows prompts across multiple reasoning views. By modeling each reasoning step as a graph with subnodes that capture diverse perspectives and using a flow controller to regulate information transfer, AGoT enhances text-image retrieval, VQA, and cross-domain generalization, while integrating visual features through a MetaNet within a CLIP-style training framework. The authors provide extensive ablations and demonstrations across 18 datasets, showing that multi-view prompt aggregation and dynamic prompt flow yield robust improvements over strong baselines such as CoOp, CoCoOp, KgCoOp, and CoT-PT. The approach holds promise for more adaptable and generalizable multi-modal reasoning, with potential applications in vision-language understanding and beyond.

Abstract

The chain-of-thought technique has been received well in multi-modal tasks. It is a step-by-step linear reasoning process that adjusts the length of the chain to improve the performance of generated prompts. However, human thought processes are predominantly non-linear, as they encompass multiple aspects simultaneously and employ dynamic adjustment and updating mechanisms. Therefore, we propose a novel Aggregation-Graph-of-Thought (AGoT) mechanism for soft-prompt tuning in multi-modal representation learning. The proposed AGoT models the human thought process not only as a chain but also models each step as a reasoning aggregation graph to cope with the overlooked multiple aspects of thinking in single-step reasoning. This turns the entire reasoning process into prompt aggregation and prompt flow operations. Experiments show that our multi-modal model enhanced with AGoT soft-prompting achieves good results in several tasks such as text-image retrieval, visual question answering, and image recognition. In addition, we demonstrate that it has good domain generalization performance due to better reasoning.
Paper Structure (32 sections, 7 equations, 3 figures, 23 tables, 1 algorithm)

This paper contains 32 sections, 7 equations, 3 figures, 23 tables, 1 algorithm.

Figures (3)

  • Figure 1: Comparison between (a) ordinary step-by-step Chain-of-Thought and (b) Aggregation-Graph-of-Thought.
  • Figure 2: Illustration of the proposed AGoT. It learns a high-quality soft prompt with prompt aggregation (blue) and prompt flow operation (gray) for multi-view thinking and adaptation to complex multi-modal tasks.
  • Figure 3: Comparison of different reasoning steps on the Flickr30k dataset.