Table of Contents
Fetching ...

'No' Matters: Out-of-Distribution Detection in Multimodality Long Dialogue

Rena Gao, Xuetong Wu, Siwen Luo, Caren Han, Feng Liu

TL;DR

This paper introduces a novel scoring framework named Dialogue Image Aligning and Enhancing Framework (DIAEF) that integrates the visual language models with the novel proposed scores that detect OOD in two key scenarios: mismatches between the dialogue and image input pair and (2) input pairs with previously unseen labels.

Abstract

Out-of-distribution (OOD) detection in multimodal contexts is essential for identifying deviations in combined inputs from different modalities, particularly in applications like open-domain dialogue systems or real-life dialogue interactions. This paper aims to improve the user experience that involves multi-round long dialogues by efficiently detecting OOD dialogues and images. We introduce a novel scoring framework named Dialogue Image Aligning and Enhancing Framework (DIAEF) that integrates the visual language models with the novel proposed scores that detect OOD in two key scenarios (1) mismatches between the dialogue and image input pair and (2) input pairs with previously unseen labels. Our experimental results, derived from various benchmarks, demonstrate that integrating image and multi-round dialogue OOD detection is more effective with previously unseen labels than using either modality independently. In the presence of mismatched pairs, our proposed score effectively identifies these mismatches and demonstrates strong robustness in long dialogues. This approach enhances domain-aware, adaptive conversational agents and establishes baselines for future studies.

'No' Matters: Out-of-Distribution Detection in Multimodality Long Dialogue

TL;DR

This paper introduces a novel scoring framework named Dialogue Image Aligning and Enhancing Framework (DIAEF) that integrates the visual language models with the novel proposed scores that detect OOD in two key scenarios: mismatches between the dialogue and image input pair and (2) input pairs with previously unseen labels.

Abstract

Out-of-distribution (OOD) detection in multimodal contexts is essential for identifying deviations in combined inputs from different modalities, particularly in applications like open-domain dialogue systems or real-life dialogue interactions. This paper aims to improve the user experience that involves multi-round long dialogues by efficiently detecting OOD dialogues and images. We introduce a novel scoring framework named Dialogue Image Aligning and Enhancing Framework (DIAEF) that integrates the visual language models with the novel proposed scores that detect OOD in two key scenarios (1) mismatches between the dialogue and image input pair and (2) input pairs with previously unseen labels. Our experimental results, derived from various benchmarks, demonstrate that integrating image and multi-round dialogue OOD detection is more effective with previously unseen labels than using either modality independently. In the presence of mismatched pairs, our proposed score effectively identifies these mismatches and demonstrates strong robustness in long dialogues. This approach enhances domain-aware, adaptive conversational agents and establishes baselines for future studies.

Paper Structure

This paper contains 11 sections, 1 theorem, 6 equations, 5 figures, 9 tables.

Key Result

Theorem 1

With Assumption assump:asp1, we can show that the proposed DIEAF score satisfies the following:

Figures (5)

  • Figure 1: Motivating examples for ID, mismatched OOD and label OOD pair where the ID label is 'cat' and OOD label is 'sport'.
  • Figure 2: The workflow for three motivating examples for cross-modal OOD detection, including ID pair, mismatched OOD pair, and label OOD pair. The workflow consists of three main parts: the dialogue and image will be firstly processed and passed into a visual language model to get the image and dialogue embeddings; then two label extractors will be trained on both the image and dialogue embeddings for predictions and score calculations; finally the score function $s$, $s_T$ and $s_I$ are aggregated to determine the threshold $\lambda$ at recall rate of 95%. The FPR95% is reported to demonstrate that combining images and dialogue outperforms using images or dialogue alone.
  • Figure 3: Effect of $\gamma$ with $\alpha = 0.5$
  • Figure 4: Effect of $\alpha$ with $\gamma = 1$
  • Figure 5: An illustration of the effectiveness of $s(x_I, x_T)$

Theorems & Definitions (4)

  • Remark 1
  • Definition 1: Cross-Modal OOD Detection
  • Theorem 1
  • Proof 1