Table of Contents
Fetching ...

Declarative Knowledge Distillation from Large Language Models for Visual Question Answering Datasets

Thomas Eiter, Jan Hadl, Nelson Higuera, Johannes Oetsch

TL;DR

This work prompts an LLM to extend an initial theory on VQA reasoning, given as an answer-set program, to meet the requirements of the VQA task and confirms that distilling knowledge from LLMs is in fact a promising direction besides data-driven rule learning approaches.

Abstract

Visual Question Answering (VQA) is the task of answering a question about an image and requires processing multimodal input and reasoning to obtain the answer. Modular solutions that use declarative representations within the reasoning component have a clear advantage over end-to-end trained systems regarding interpretability. The downside is that crafting the rules for such a component can be an additional burden on the developer. We address this challenge by presenting an approach for declarative knowledge distillation from Large Language Models (LLMs). Our method is to prompt an LLM to extend an initial theory on VQA reasoning, given as an answer-set program, to meet the requirements of the VQA task. Examples from the VQA dataset are used to guide the LLM, validate the results, and mend rules if they are not correct by using feedback from the ASP solver. We demonstrate that our approach works on the prominent CLEVR and GQA datasets. Our results confirm that distilling knowledge from LLMs is in fact a promising direction besides data-driven rule learning approaches.

Declarative Knowledge Distillation from Large Language Models for Visual Question Answering Datasets

TL;DR

This work prompts an LLM to extend an initial theory on VQA reasoning, given as an answer-set program, to meet the requirements of the VQA task and confirms that distilling knowledge from LLMs is in fact a promising direction besides data-driven rule learning approaches.

Abstract

Visual Question Answering (VQA) is the task of answering a question about an image and requires processing multimodal input and reasoning to obtain the answer. Modular solutions that use declarative representations within the reasoning component have a clear advantage over end-to-end trained systems regarding interpretability. The downside is that crafting the rules for such a component can be an additional burden on the developer. We address this challenge by presenting an approach for declarative knowledge distillation from Large Language Models (LLMs). Our method is to prompt an LLM to extend an initial theory on VQA reasoning, given as an answer-set program, to meet the requirements of the VQA task. Examples from the VQA dataset are used to guide the LLM, validate the results, and mend rules if they are not correct by using feedback from the ASP solver. We demonstrate that our approach works on the prominent CLEVR and GQA datasets. Our results confirm that distilling knowledge from LLMs is in fact a promising direction besides data-driven rule learning approaches.

Paper Structure

This paper contains 18 sections, 2 equations, 2 figures, 7 tables.

Figures (2)

  • Figure 1: An overview of our knowledge distillation method.
  • Figure 2: GS-VQA takes the image and question as input and uses a question-driven approach to generate a partial scene graph. We generate an ASP representation of both partial scene graph and question. These are then solved along an ASP theory to derive the correct answer.