Knowledge-Decoupled Synergetic Learning: An MLLM based Collaborative Approach to Few-shot Multimodal Dialogue Intention Recognition

Bin Chen; Yu Zhang; Hongfei Ye; Ziyi Huang; Hongyang Chen

Knowledge-Decoupled Synergetic Learning: An MLLM based Collaborative Approach to Few-shot Multimodal Dialogue Intention Recognition

Bin Chen, Yu Zhang, Hongfei Ye, Ziyi Huang, Hongyang Chen

TL;DR

The paper tackles the challenge of few-shot multimodal dialogue intention recognition in ecommerce, where learning two related tasks causes a seesaw effect due to knowledge interference. It introduces Knowledge-Decoupled Synergetic Learning (KDSL), which decouples knowledge into an interpretable rule space using a small MLLM (via Monte Carlo Tree Search) and leverages post-training of a larger MLLM for collaborative prediction. Key contributions include identifying the interference phenomenon, a Monte Carlo Tree Search-based rule-generation pipeline, and a collaborative framework that merges a rule engine with a fine-tuned MLLM. On two Taobao datasets, KDSL achieves notable improvements of $6.37\%$ and $6.28\%$ in online weighted F1 scores over the prior state-of-the-art, demonstrating effective knowledge decoupling and cooperative reasoning for cross-modal, few-shot ecommerce tasks.

Abstract

Few-shot multimodal dialogue intention recognition is a critical challenge in the e-commerce domainn. Previous methods have primarily enhanced model classification capabilities through post-training techniques. However, our analysis reveals that training for few-shot multimodal dialogue intention recognition involves two interconnected tasks, leading to a seesaw effect in multi-task learning. This phenomenon is attributed to knowledge interference stemming from the superposition of weight matrix updates during the training process. To address these challenges, we propose Knowledge-Decoupled Synergetic Learning (KDSL), which mitigates these issues by utilizing smaller models to transform knowledge into interpretable rules, while applying the post-training of larger models. By facilitating collaboration between the large and small multimodal large language models for prediction, our approach demonstrates significant improvements. Notably, we achieve outstanding results on two real Taobao datasets, with enhancements of 6.37\% and 6.28\% in online weighted F1 scores compared to the state-of-the-art method, thereby validating the efficacy of our framework.

Knowledge-Decoupled Synergetic Learning: An MLLM based Collaborative Approach to Few-shot Multimodal Dialogue Intention Recognition

TL;DR

Abstract

Knowledge-Decoupled Synergetic Learning: An MLLM based Collaborative Approach to Few-shot Multimodal Dialogue Intention Recognition

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (2)