Table of Contents
Fetching ...

Textbook Question Answering with Multi-modal Context Graph Understanding and Self-supervised Open-set Comprehension

Daesik Kim, Seonhoon Kim, Nojun Kwak

TL;DR

Textbook Question Answering with Multi-modal Context Graph Understanding and Self-supervised Open-set Comprehension tackles realistic multi-modal QA by modeling long textual lessons and diagrams as context graphs using a fusion GCN (f-GCN). It introduces a self-supervised open-set comprehension (SSOC) pretraining stage to address out-of-domain terminology and open-set issues before supervised QA. The approach uses visual and textual context graphs built via UDPnet and TF-IDF-guided dependency graphs, combined with RNN-based encoders and attention to predict answers, achieving state-of-the-art results on the TQA dataset. Ablation studies confirm the crucial roles of both f-GCN and SSOC in boosting performance on text and diagram questions alike.

Abstract

In this work, we introduce a novel algorithm for solving the textbook question answering (TQA) task which describes more realistic QA problems compared to other recent tasks. We mainly focus on two related issues with analysis of the TQA dataset. First, solving the TQA problems requires to comprehend multi-modal contexts in complicated input data. To tackle this issue of extracting knowledge features from long text lessons and merging them with visual features, we establish a context graph from texts and images, and propose a new module f-GCN based on graph convolutional networks (GCN). Second, scientific terms are not spread over the chapters and subjects are split in the TQA dataset. To overcome this so called "out-of-domain" issue, before learning QA problems, we introduce a novel self-supervised open-set learning process without any annotations. The experimental results show that our model significantly outperforms prior state-of-the-art methods. Moreover, ablation studies validate that both methods of incorporating f-GCN for extracting knowledge from multi-modal contexts and our newly proposed self-supervised learning process are effective for TQA problems.

Textbook Question Answering with Multi-modal Context Graph Understanding and Self-supervised Open-set Comprehension

TL;DR

Textbook Question Answering with Multi-modal Context Graph Understanding and Self-supervised Open-set Comprehension tackles realistic multi-modal QA by modeling long textual lessons and diagrams as context graphs using a fusion GCN (f-GCN). It introduces a self-supervised open-set comprehension (SSOC) pretraining stage to address out-of-domain terminology and open-set issues before supervised QA. The approach uses visual and textual context graphs built via UDPnet and TF-IDF-guided dependency graphs, combined with RNN-based encoders and attention to predict answers, achieving state-of-the-art results on the TQA dataset. Ablation studies confirm the crucial roles of both f-GCN and SSOC in boosting performance on text and diagram questions alike.

Abstract

In this work, we introduce a novel algorithm for solving the textbook question answering (TQA) task which describes more realistic QA problems compared to other recent tasks. We mainly focus on two related issues with analysis of the TQA dataset. First, solving the TQA problems requires to comprehend multi-modal contexts in complicated input data. To tackle this issue of extracting knowledge features from long text lessons and merging them with visual features, we establish a context graph from texts and images, and propose a new module f-GCN based on graph convolutional networks (GCN). Second, scientific terms are not spread over the chapters and subjects are split in the TQA dataset. To overcome this so called "out-of-domain" issue, before learning QA problems, we introduce a novel self-supervised open-set learning process without any annotations. The experimental results show that our model significantly outperforms prior state-of-the-art methods. Moreover, ablation studies validate that both methods of incorporating f-GCN for extracting knowledge from multi-modal contexts and our newly proposed self-supervised learning process are effective for TQA problems.

Paper Structure

This paper contains 25 sections, 9 equations, 14 figures, 4 tables, 1 algorithm.

Figures (14)

  • Figure 1: Examples of the textbook question answering task and a brief concept of our work. In this figure, we can see lessons which contain long essays and diagrams in the TQA. Related questions are also illustrated. With a self-supervised method, our model can comprehend contexts converted into context graphs in training and validation sets. Then it learns to solve questions only in the training set in a supervised manner.
  • Figure 2: Analysis of contexts in TQA and SQuAD datasets.
  • Figure 3: Overall framework of our model:(a) The preparation step for the $k$-th answer among $n$ candidates. The context $m$ is determined by TF-IDF score with the question and the $k$-th answer. Then, the context $m$ is converted to a context graph $m$. The question and the $k$-th answer are also embedded by GloVe and character embedding. This step is repeated for $n$ candidates. (b) The embedding step uses $RNN_C$ as a sequence embedding module and f-GCN as a graph embedding module. With attention methods, we can obtain combined features. After concatenation, $RNN_S$ and the fully connected module predict final distribution in the solving step.
  • Figure 4: Illustration of f-GCN. Both of textual and visual contexts are converted into $H_c^d$ and $H_c^t$. With attention methods, we obtain combined features of $H_c^t$ and $H_c^d$ (f-GCN1). Finally, we use GCN again to propagate over entire features of context graphs (f-GCN2).
  • Figure 5: Self-supervised open-set comprehension step in our model. We set contexts as candidates we should predict for the question and the $k$-th answer. For each answer, we obtain $n$ context candidates from TF-IDF methods and set the top-1 candidate as the correct context. While we use the same structure as in Figure \ref{['fig:overall']}, we can predict final distribution after all the steps.
  • ...and 9 more figures