Table of Contents
Fetching ...

Dynamic Knowledge Integration for Enhanced Vision-Language Reasoning

Julian Perry, Surasakdi Siripong, Thanakorn Phonchai

TL;DR

This work tackles the limited external knowledge integration in Large Vision-Language Models by proposing Adaptive Knowledge-Guided Pretraining for LVLMs (AKGP-LVLM). The approach combines a knowledge encoder, a task-aware retrieval mechanism, and a lightweight Dynamic Knowledge Adaptor to align external knowledge with multimodal representations through a two-stage training regime and a contrastive alignment objective. It achieves state-of-the-art results on four knowledge-intensive benchmarks (OK-VQA, FVQA, SNLI-VE, NLVR2) and is complemented by thorough ablations and human evaluations that confirm improved correctness and relevance. The framework demonstrates robustness, scalability, and practical potential for real-world multimodal reasoning tasks, while highlighting areas for future work such as multi-hop reasoning and expanded knowledge coverage.

Abstract

Large Vision-Language Models (LVLMs) have demonstrated impressive capabilities in multimodal tasks, but their performance is often constrained by the lack of external knowledge integration, limiting their ability to handle knowledge-intensive tasks such as visual question answering and reasoning. To address this challenge, we propose a novel method, Adaptive Knowledge-Guided Pretraining for Large Vision-Language Models (AKGP-LVLM), which dynamically incorporates structured and unstructured knowledge into LVLMs during pretraining and fine-tuning. Our approach employs a knowledge encoder to represent external knowledge, a retrieval mechanism to select task-relevant information, and a dynamic adaptor to align multimodal and knowledge representations effectively. We evaluate our method on four benchmark datasets, demonstrating significant performance improvements over state-of-the-art models. Furthermore, human evaluations highlight the superior correctness and relevance of our model's outputs. Extensive analyses confirm the robustness, efficiency, and scalability of AKGP-LVLM, making it a compelling solution for real-world knowledge-intensive tasks.

Dynamic Knowledge Integration for Enhanced Vision-Language Reasoning

TL;DR

This work tackles the limited external knowledge integration in Large Vision-Language Models by proposing Adaptive Knowledge-Guided Pretraining for LVLMs (AKGP-LVLM). The approach combines a knowledge encoder, a task-aware retrieval mechanism, and a lightweight Dynamic Knowledge Adaptor to align external knowledge with multimodal representations through a two-stage training regime and a contrastive alignment objective. It achieves state-of-the-art results on four knowledge-intensive benchmarks (OK-VQA, FVQA, SNLI-VE, NLVR2) and is complemented by thorough ablations and human evaluations that confirm improved correctness and relevance. The framework demonstrates robustness, scalability, and practical potential for real-world multimodal reasoning tasks, while highlighting areas for future work such as multi-hop reasoning and expanded knowledge coverage.

Abstract

Large Vision-Language Models (LVLMs) have demonstrated impressive capabilities in multimodal tasks, but their performance is often constrained by the lack of external knowledge integration, limiting their ability to handle knowledge-intensive tasks such as visual question answering and reasoning. To address this challenge, we propose a novel method, Adaptive Knowledge-Guided Pretraining for Large Vision-Language Models (AKGP-LVLM), which dynamically incorporates structured and unstructured knowledge into LVLMs during pretraining and fine-tuning. Our approach employs a knowledge encoder to represent external knowledge, a retrieval mechanism to select task-relevant information, and a dynamic adaptor to align multimodal and knowledge representations effectively. We evaluate our method on four benchmark datasets, demonstrating significant performance improvements over state-of-the-art models. Furthermore, human evaluations highlight the superior correctness and relevance of our model's outputs. Extensive analyses confirm the robustness, efficiency, and scalability of AKGP-LVLM, making it a compelling solution for real-world knowledge-intensive tasks.
Paper Structure (27 sections, 10 equations, 3 tables)