Table of Contents
Fetching ...

Leveraging Weak Cross-Modal Guidance for Coherence Modelling via Iterative Learning

Yi Bin, Junrong Liao, Yujuan Ding, Haoxuan Li, Yang Yang, See-Kiong Ng, Heng Tao Shen

TL;DR

This work tackles cross-modal coherence modeling in the absence of gold cross-modal order labels by introducing weak cross-modal guidance. The IterWeGO framework combines intra-modal semantic/context encoding, semantic-aligned cross-modal order guidance via CGO-MU, and an iterative boosting scheme to jointly train and infer orderings across image and text modalities. Empirical results on SIND and TACoS-Ordering show consistent improvements over strong baselines, with ablations validating the importance of weak guidance and iterative updates. The approach demonstrates practical benefits for building coherent multimodal narratives without costly annotations and highlights potential for broader multimodal reasoning tasks.

Abstract

Cross-modal coherence modeling is essential for intelligent systems to help them organize and structure information, thereby understanding and creating content of the physical world coherently like human-beings. Previous work on cross-modal coherence modeling attempted to leverage the order information from another modality to assist the coherence recovering of the target modality. Despite of the effectiveness, labeled associated coherency information is not always available and might be costly to acquire, making the cross-modal guidance hard to leverage. To tackle this challenge, this paper explores a new way to take advantage of cross-modal guidance without gold labels on coherency, and proposes the Weak Cross-Modal Guided Ordering (WeGO) model. More specifically, it leverages high-confidence predicted pairwise order in one modality as reference information to guide the coherence modeling in another. An iterative learning paradigm is further designed to jointly optimize the coherence modeling in two modalities with selected guidance from each other. The iterative cross-modal boosting also functions in inference to further enhance coherence prediction in each modality. Experimental results on two public datasets have demonstrated that the proposed method outperforms existing methods for cross-modal coherence modeling tasks. Major technical modules have been evaluated effective through ablation studies. Codes are available at: \url{https://github.com/scvready123/IterWeGO}.

Leveraging Weak Cross-Modal Guidance for Coherence Modelling via Iterative Learning

TL;DR

This work tackles cross-modal coherence modeling in the absence of gold cross-modal order labels by introducing weak cross-modal guidance. The IterWeGO framework combines intra-modal semantic/context encoding, semantic-aligned cross-modal order guidance via CGO-MU, and an iterative boosting scheme to jointly train and infer orderings across image and text modalities. Empirical results on SIND and TACoS-Ordering show consistent improvements over strong baselines, with ablations validating the importance of weak guidance and iterative updates. The approach demonstrates practical benefits for building coherent multimodal narratives without costly annotations and highlights potential for broader multimodal reasoning tasks.

Abstract

Cross-modal coherence modeling is essential for intelligent systems to help them organize and structure information, thereby understanding and creating content of the physical world coherently like human-beings. Previous work on cross-modal coherence modeling attempted to leverage the order information from another modality to assist the coherence recovering of the target modality. Despite of the effectiveness, labeled associated coherency information is not always available and might be costly to acquire, making the cross-modal guidance hard to leverage. To tackle this challenge, this paper explores a new way to take advantage of cross-modal guidance without gold labels on coherency, and proposes the Weak Cross-Modal Guided Ordering (WeGO) model. More specifically, it leverages high-confidence predicted pairwise order in one modality as reference information to guide the coherence modeling in another. An iterative learning paradigm is further designed to jointly optimize the coherence modeling in two modalities with selected guidance from each other. The iterative cross-modal boosting also functions in inference to further enhance coherence prediction in each modality. Experimental results on two public datasets have demonstrated that the proposed method outperforms existing methods for cross-modal coherence modeling tasks. Major technical modules have been evaluated effective through ablation studies. Codes are available at: \url{https://github.com/scvready123/IterWeGO}.
Paper Structure (18 sections, 2 equations, 4 figures, 4 tables, 1 algorithm)

This paper contains 18 sections, 2 equations, 4 figures, 4 tables, 1 algorithm.

Figures (4)

  • Figure 1: Uni-modal and Cross-Modal (CM) Coherence Modeling (CM) tasks shown by a sentence ordering case according to whether using the guidance from another modality. This paper focuses on the CMCM task without using cross-modal ORDER information but only leverages weak guidance across modalities as shown in the green part.
  • Figure 2: Framework of the proposed IterWeGO model. An iterative learning paradigm is designed to optimize the ordering models of two modalities jointly with continuous guidance from each other. The weak cross-modal order guidance is applied selectively at each learning step based on the predicted pairwise order through semantic cross-modal alignment.
  • Figure 3: Illustration of the performance of two type of models (IterWeGO and IterWeGO w/o IB in training) with different Iterative Boosting (IB) steps during inference.
  • Figure 4: Three case illustration. (a) and (b) show the image and sentence orders predicted by our IterWeGO and the key baseline NACON and the variant IterWeGO-UM. (c) illustrates the iterative updating process for two ordering tasks with cross-modal guidance during inference by our IterWeGO.