Table of Contents
Fetching ...

IIU: Independent Inference Units for Knowledge-based Visual Question Answering

Yili Li, Jing Yu, Keke Gai, Gang Xiong

TL;DR

The paper addresses knowledge-based visual question answering, where external knowledge is necessary to answer questions. It introduces Independent Inference Units (IIU) to disentangle intra-modal clues into functionally independent reasoning units, augmented by a memory update module and inter-unit communication to drive interpretable, stepwise reasoning on a tri-graph representation (visual, semantic, and fact graphs). Empirically, IIU achieves state-of-the-art performance on OK-VQA (34.87% overall) and strong results on FVQA, with ablations and visualizations demonstrating the contributions of unit independence, communication, and memory updating to generalization and interpretability. The approach offers explainable reasoning traces and robustness by modular, functional processing that isolates redundant information during reasoning. These aspects suggest practical benefits for knowledge-grounded visual reasoning in diverse settings.

Abstract

Knowledge-based visual question answering requires external knowledge beyond visible content to answer the question correctly. One limitation of existing methods is that they focus more on modeling the inter-modal and intra-modal correlations, which entangles complex multimodal clues by implicit embeddings and lacks interpretability and generalization ability. The key challenge to solve the above problem is to separate the information and process it separately at the functional level. By reusing each processing unit, the generalization ability of the model to deal with different data can be increased. In this paper, we propose Independent Inference Units (IIU) for fine-grained multi-modal reasoning to decompose intra-modal information by the functionally independent units. Specifically, IIU processes each semantic-specific intra-modal clue by an independent inference unit, which also collects complementary information by communication from different units. To further reduce the impact of redundant information, we propose a memory update module to maintain semantic-relevant memory along with the reasoning process gradually. In comparison with existing non-pretrained multi-modal reasoning models on standard datasets, our model achieves a new state-of-the-art, enhancing performance by 3%, and surpassing basic pretrained multi-modal models. The experimental results show that our IIU model is effective in disentangling intra-modal clues as well as reasoning units to provide explainable reasoning evidence. Our code is available at https://github.com/Lilidamowang/IIU.

IIU: Independent Inference Units for Knowledge-based Visual Question Answering

TL;DR

The paper addresses knowledge-based visual question answering, where external knowledge is necessary to answer questions. It introduces Independent Inference Units (IIU) to disentangle intra-modal clues into functionally independent reasoning units, augmented by a memory update module and inter-unit communication to drive interpretable, stepwise reasoning on a tri-graph representation (visual, semantic, and fact graphs). Empirically, IIU achieves state-of-the-art performance on OK-VQA (34.87% overall) and strong results on FVQA, with ablations and visualizations demonstrating the contributions of unit independence, communication, and memory updating to generalization and interpretability. The approach offers explainable reasoning traces and robustness by modular, functional processing that isolates redundant information during reasoning. These aspects suggest practical benefits for knowledge-grounded visual reasoning in diverse settings.

Abstract

Knowledge-based visual question answering requires external knowledge beyond visible content to answer the question correctly. One limitation of existing methods is that they focus more on modeling the inter-modal and intra-modal correlations, which entangles complex multimodal clues by implicit embeddings and lacks interpretability and generalization ability. The key challenge to solve the above problem is to separate the information and process it separately at the functional level. By reusing each processing unit, the generalization ability of the model to deal with different data can be increased. In this paper, we propose Independent Inference Units (IIU) for fine-grained multi-modal reasoning to decompose intra-modal information by the functionally independent units. Specifically, IIU processes each semantic-specific intra-modal clue by an independent inference unit, which also collects complementary information by communication from different units. To further reduce the impact of redundant information, we propose a memory update module to maintain semantic-relevant memory along with the reasoning process gradually. In comparison with existing non-pretrained multi-modal reasoning models on standard datasets, our model achieves a new state-of-the-art, enhancing performance by 3%, and surpassing basic pretrained multi-modal models. The experimental results show that our IIU model is effective in disentangling intra-modal clues as well as reasoning units to provide explainable reasoning evidence. Our code is available at https://github.com/Lilidamowang/IIU.
Paper Structure (14 sections, 9 equations, 4 figures, 3 tables)

This paper contains 14 sections, 9 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Illustration of our motivation. The question on the left and right have different processing objects, but they require the same reasoning abilities. By combining and reusing these reasoning abilities, it is possible to use the same reasoning process to answer different questions.
  • Figure 2: An overview of our model. The model contains two main modules: Memory Extraction Module, Independent Inference Module. After $t$ steps of inference, we utilize hidden states of each units to achieve answer prediction on fact graph.
  • Figure 3: Units Activation of Different Modalities. Dark indicates active, and light indicates inactive.
  • Figure 4: Units Activation of Redundant Information. Dark indicates active and light indicates inactive.