Table of Contents
Fetching ...

MoEMoE: Question Guided Dense and Scalable Sparse Mixture-of-Expert for Multi-source Multi-modal Answering

Vinay Kumar Verma, Shreyas Sunil Kulkarni, Happy Mittal, Deepak Gupta

TL;DR

MoEMoE tackles multi-source, multi-modal question-answer generation by introducing a question-guided attention (QGA) mechanism and explicit question-source alignment within an encoder-decoder framework that employs a sparse mixture-of-experts (MoE) for scalability. The model uses three unshared encoders (question, context, image) and a single image encoder (SwinV2) to form source-aware embeddings, while QGA assigns token-level weights across sources, enabling robust, question-conditioned focus. Alignment losses for question-context and question-image ($\mathcal{L}_{QCA}$, $\mathcal{L}_{QIA}$) refine within-source attention, and a load-balanced MoEauxiliary loss ($\mathcal{L}_{aux}$) scales to thousands of question types. Experiments on MXT 30PT, CMA-CLIP, and OHLSL with T5/Flan-T5 show state-of-the-art results and demonstrate that the decoder-focused MoE configuration, together with careful training and loss weighting, yields the best performance while maintaining efficiency. The work advances practical attribute-based answer generation in e-commerce and related domains by effectively leveraging diverse sources and modalities while mitigating bias toward textual signals.

Abstract

Question Answering (QA) and Visual Question Answering (VQA) are well-studied problems in the language and vision domain. One challenging scenario involves multiple sources of information, each of a different modality, where the answer to the question may exist in one or more sources. This scenario contains richer information but is highly complex to handle. In this work, we formulate a novel question-answer generation (QAG) framework in an environment containing multi-source, multimodal information. The answer may belong to any or all sources; therefore, selecting the most prominent answer source or an optimal combination of all sources for a given question is challenging. To address this issue, we propose a question-guided attention mechanism that learns attention across multiple sources and decodes this information for robust and unbiased answer generation. To learn attention within each source, we introduce an explicit alignment between questions and various information sources, which facilitates identifying the most pertinent parts of the source information relative to the question. Scalability in handling diverse questions poses a challenge. We address this by extending our model to a sparse mixture-of-experts (sparse-MoE) framework, enabling it to handle thousands of question types. Experiments on T5 and Flan-T5 using three datasets demonstrate the model's efficacy, supported by ablation studies.

MoEMoE: Question Guided Dense and Scalable Sparse Mixture-of-Expert for Multi-source Multi-modal Answering

TL;DR

MoEMoE tackles multi-source, multi-modal question-answer generation by introducing a question-guided attention (QGA) mechanism and explicit question-source alignment within an encoder-decoder framework that employs a sparse mixture-of-experts (MoE) for scalability. The model uses three unshared encoders (question, context, image) and a single image encoder (SwinV2) to form source-aware embeddings, while QGA assigns token-level weights across sources, enabling robust, question-conditioned focus. Alignment losses for question-context and question-image (, ) refine within-source attention, and a load-balanced MoEauxiliary loss () scales to thousands of question types. Experiments on MXT 30PT, CMA-CLIP, and OHLSL with T5/Flan-T5 show state-of-the-art results and demonstrate that the decoder-focused MoE configuration, together with careful training and loss weighting, yields the best performance while maintaining efficiency. The work advances practical attribute-based answer generation in e-commerce and related domains by effectively leveraging diverse sources and modalities while mitigating bias toward textual signals.

Abstract

Question Answering (QA) and Visual Question Answering (VQA) are well-studied problems in the language and vision domain. One challenging scenario involves multiple sources of information, each of a different modality, where the answer to the question may exist in one or more sources. This scenario contains richer information but is highly complex to handle. In this work, we formulate a novel question-answer generation (QAG) framework in an environment containing multi-source, multimodal information. The answer may belong to any or all sources; therefore, selecting the most prominent answer source or an optimal combination of all sources for a given question is challenging. To address this issue, we propose a question-guided attention mechanism that learns attention across multiple sources and decodes this information for robust and unbiased answer generation. To learn attention within each source, we introduce an explicit alignment between questions and various information sources, which facilitates identifying the most pertinent parts of the source information relative to the question. Scalability in handling diverse questions poses a challenge. We address this by extending our model to a sparse mixture-of-experts (sparse-MoE) framework, enabling it to handle thousands of question types. Experiments on T5 and Flan-T5 using three datasets demonstrate the model's efficacy, supported by ablation studies.

Paper Structure

This paper contains 27 sections, 8 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Example of the multi-modal and multi-source attribute extraction using the proposed question answering mechanism.
  • Figure 2: The proposed model architecture consists of two T5 encoders for processing the question and context, along with one image encoder. The question and context are aligned using the Question Context Alignment Loss while the question and image are aligned through the Question Image Alignment Loss
  • Figure 3: Illustration of working of the Mixture of Experts (MoE) Layer shazeer2017outrageously
  • Figure 4: Results on the OHLSL dataset over the Flan-T5 architecture. We report the average Accuracy and Recall@90 metric for the OHL and SL category for all the attribute.
  • Figure 5: The figure shows the ablation over the various component of the proposed model. We can observe that without question guidance (WoQG) or without alignment (WoAL) the model performance significantly drops. Also, single encoder (S-Enc) shows degraded result.