Table of Contents
Fetching ...

Knowledge-Decoupled Synergetic Learning: An MLLM based Collaborative Approach to Few-shot Multimodal Dialogue Intention Recognition

Bin Chen, Yu Zhang, Hongfei Ye, Ziyi Huang, Hongyang Chen

TL;DR

The paper tackles the challenge of few-shot multimodal dialogue intention recognition in ecommerce, where learning two related tasks causes a seesaw effect due to knowledge interference. It introduces Knowledge-Decoupled Synergetic Learning (KDSL), which decouples knowledge into an interpretable rule space using a small MLLM (via Monte Carlo Tree Search) and leverages post-training of a larger MLLM for collaborative prediction. Key contributions include identifying the interference phenomenon, a Monte Carlo Tree Search-based rule-generation pipeline, and a collaborative framework that merges a rule engine with a fine-tuned MLLM. On two Taobao datasets, KDSL achieves notable improvements of $6.37\%$ and $6.28\%$ in online weighted F1 scores over the prior state-of-the-art, demonstrating effective knowledge decoupling and cooperative reasoning for cross-modal, few-shot ecommerce tasks.

Abstract

Few-shot multimodal dialogue intention recognition is a critical challenge in the e-commerce domainn. Previous methods have primarily enhanced model classification capabilities through post-training techniques. However, our analysis reveals that training for few-shot multimodal dialogue intention recognition involves two interconnected tasks, leading to a seesaw effect in multi-task learning. This phenomenon is attributed to knowledge interference stemming from the superposition of weight matrix updates during the training process. To address these challenges, we propose Knowledge-Decoupled Synergetic Learning (KDSL), which mitigates these issues by utilizing smaller models to transform knowledge into interpretable rules, while applying the post-training of larger models. By facilitating collaboration between the large and small multimodal large language models for prediction, our approach demonstrates significant improvements. Notably, we achieve outstanding results on two real Taobao datasets, with enhancements of 6.37\% and 6.28\% in online weighted F1 scores compared to the state-of-the-art method, thereby validating the efficacy of our framework.

Knowledge-Decoupled Synergetic Learning: An MLLM based Collaborative Approach to Few-shot Multimodal Dialogue Intention Recognition

TL;DR

The paper tackles the challenge of few-shot multimodal dialogue intention recognition in ecommerce, where learning two related tasks causes a seesaw effect due to knowledge interference. It introduces Knowledge-Decoupled Synergetic Learning (KDSL), which decouples knowledge into an interpretable rule space using a small MLLM (via Monte Carlo Tree Search) and leverages post-training of a larger MLLM for collaborative prediction. Key contributions include identifying the interference phenomenon, a Monte Carlo Tree Search-based rule-generation pipeline, and a collaborative framework that merges a rule engine with a fine-tuned MLLM. On two Taobao datasets, KDSL achieves notable improvements of and in online weighted F1 scores over the prior state-of-the-art, demonstrating effective knowledge decoupling and cooperative reasoning for cross-modal, few-shot ecommerce tasks.

Abstract

Few-shot multimodal dialogue intention recognition is a critical challenge in the e-commerce domainn. Previous methods have primarily enhanced model classification capabilities through post-training techniques. However, our analysis reveals that training for few-shot multimodal dialogue intention recognition involves two interconnected tasks, leading to a seesaw effect in multi-task learning. This phenomenon is attributed to knowledge interference stemming from the superposition of weight matrix updates during the training process. To address these challenges, we propose Knowledge-Decoupled Synergetic Learning (KDSL), which mitigates these issues by utilizing smaller models to transform knowledge into interpretable rules, while applying the post-training of larger models. By facilitating collaboration between the large and small multimodal large language models for prediction, our approach demonstrates significant improvements. Notably, we achieve outstanding results on two real Taobao datasets, with enhancements of 6.37\% and 6.28\% in online weighted F1 scores compared to the state-of-the-art method, thereby validating the efficacy of our framework.

Paper Structure

This paper contains 11 sections, 2 figures, 2 tables.

Figures (2)

  • Figure 1: The Qwen2VL$_{7B}$-FT-Image and Qwen2VL$_{7B}$-FT-Intent models specialize in image and intent classification, respectively, using full fine-tuning. In contrast, Qwen2VL$_{7B}$-FT-joint is jointly trained for both tasks. Performance comparisons on Test Set 1 and Test Set 2 from Taobao show that single-task models outperform joint models with identical parameters, suggesting a seesaw effect in multimodal intent recognition for e-commerce applications.
  • Figure 2: We propose a collaborative pipeline integrating multimodal large and small language models. The large model, Qwen2-VL$_{7B}$, is fine-tuned on the Taobao few-shot multimodal dialogue intention dataset with data augmentation, enabling it to learn implicit patterns. The smaller model, Qwen2-VL$_{2B}$, uses Monte Carlo Tree Search to generate and collect rules, which are stored in a rule base. The fine-tuned Qwen2-VL$_{7B}$ and the rule base collaborate for prediction.