Table of Contents
Fetching ...

Evolutionary Multimodal Reasoning via Hierarchical Semantic Representation for Intent Recognition

Qianrui Zhou, Hua Xu, Yunjin Gu, Yifan Wang, Songze Li, Hanlei Zhang

TL;DR

HIER is a novel method that integrates HIerarchical semantic representation with Evolutionary Reasoning based on Multimodal Large Language Model (MLLM) and utilizes a self-evolution mechanism that refines semantic representations through MLLM feedback, allowing dynamic adaptation during inference.

Abstract

Multimodal intent recognition aims to infer human intents by jointly modeling various modalities, playing a pivotal role in real-world dialogue systems. However, current methods struggle to model hierarchical semantics underlying complex intents and lack the capacity for self-evolving reasoning over multimodal representations. To address these issues, we propose HIER, a novel method that integrates HIerarchical semantic representation with Evolutionary Reasoning based on Multimodal Large Language Model (MLLM). Inspired by human cognition, HIER introduces a structured reasoning paradigm that organizes multimodal semantics into three progressively abstracted levels. It starts with modality-specific tokens capturing localized semantic cues, which are then clustered via a label-guided strategy to form mid-level semantic concepts. To capture higher-order structure, inter-concept relations are selected using JS divergence scores to highlight salient dependencies across concepts. These hierarchical representations are then injected into MLLM via CoT-driven prompting, enabling step-wise reasoning. Besides, HIER utilizes a self-evolution mechanism that refines semantic representations through MLLM feedback, allowing dynamic adaptation during inference. Experiments on three challenging benchmarks show that HIER consistently outperforms state-of-the-art methods and MLLMs with 1-3% gains across all metrics. Code and more results are available at https://github.com/thuiar/HIER.

Evolutionary Multimodal Reasoning via Hierarchical Semantic Representation for Intent Recognition

TL;DR

HIER is a novel method that integrates HIerarchical semantic representation with Evolutionary Reasoning based on Multimodal Large Language Model (MLLM) and utilizes a self-evolution mechanism that refines semantic representations through MLLM feedback, allowing dynamic adaptation during inference.

Abstract

Multimodal intent recognition aims to infer human intents by jointly modeling various modalities, playing a pivotal role in real-world dialogue systems. However, current methods struggle to model hierarchical semantics underlying complex intents and lack the capacity for self-evolving reasoning over multimodal representations. To address these issues, we propose HIER, a novel method that integrates HIerarchical semantic representation with Evolutionary Reasoning based on Multimodal Large Language Model (MLLM). Inspired by human cognition, HIER introduces a structured reasoning paradigm that organizes multimodal semantics into three progressively abstracted levels. It starts with modality-specific tokens capturing localized semantic cues, which are then clustered via a label-guided strategy to form mid-level semantic concepts. To capture higher-order structure, inter-concept relations are selected using JS divergence scores to highlight salient dependencies across concepts. These hierarchical representations are then injected into MLLM via CoT-driven prompting, enabling step-wise reasoning. Besides, HIER utilizes a self-evolution mechanism that refines semantic representations through MLLM feedback, allowing dynamic adaptation during inference. Experiments on three challenging benchmarks show that HIER consistently outperforms state-of-the-art methods and MLLMs with 1-3% gains across all metrics. Code and more results are available at https://github.com/thuiar/HIER.
Paper Structure (20 sections, 15 equations, 3 figures, 5 tables)

This paper contains 20 sections, 15 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Overview of the proposed HIER method. The model comprises three key steps: (1) Multimodal Concept Clustering, which groups semantically related tokens into mid-level concept representations via soft Spherical K-Means++ augmented by intent labels; (2) Multimodal Relation Selection, which captures informative inter-concept dependencies using an information bottleneck network and JS divergence; and (3) Evolutionary Multimodal Reasoning, which conducts hierarchical reasoning through a structured CoT and self-evolution mechanism, enhancing both reasoning depth and robustness.
  • Figure 2: Details of Self-evolution module. We first copy the Qwen2-VL’s generation head to project concept and relation features into vocabulary logits. Conditioned on usefulness assessment prompts, we then extract and normalize logits of affirmative and negative responses to derive confidence scores, which in turn guide feature refinement for adaptive and robust reasoning.
  • Figure 3: Impact of concept and relation quantity in HIER evaluated on the MIntRec2.0 dataset.