Table of Contents
Fetching ...

Task Progressive Curriculum Learning for Robust Visual Question Answering

Ahmed Akl, Abdelwahed Khamis, Zhe Wang, Ali Cheraghian, Sara Khalifa, Kewen Wang

TL;DR

This work shows for the first time that robust Visual Question Answering is attainable by simply enhancing the training strategy, and proposes a proposed approach, Task Progressive Curriculum Learning (TPCL), which breaks the main VQA problem into smaller, easier tasks based on the question type.

Abstract

Visual Question Answering (VQA) systems are known for their poor performance in out-of-distribution datasets. An issue that was addressed in previous works through ensemble learning, answer re-ranking, or artificially growing the training set. In this work, we show for the first time that robust Visual Question Answering is attainable by simply enhancing the training strategy. Our proposed approach, Task Progressive Curriculum Learning (TPCL), breaks the main VQA problem into smaller, easier tasks based on the question type. Then, it progressively trains the model on a (carefully crafted) sequence of tasks. We further support the method by a novel distributional-based difficulty measurer. Our approach is conceptually simple, model-agnostic, and easy to implement. We demonstrate TPCL effectiveness through a comprehensive evaluation on standard datasets. Without either data augmentation or explicit debiasing mechanism, it achieves state-of-the-art on VQA-CP v2, VQA-CP v1 and VQA v2 datasets. Extensive experiments demonstrate that TPCL outperforms the most competitive robust VQA approaches by more than 5% and 7% on VQA-CP v2 and VQA-CP v1; respectively. TPCL also can boost VQA baseline backbone performance by up to 28.5%.

Task Progressive Curriculum Learning for Robust Visual Question Answering

TL;DR

This work shows for the first time that robust Visual Question Answering is attainable by simply enhancing the training strategy, and proposes a proposed approach, Task Progressive Curriculum Learning (TPCL), which breaks the main VQA problem into smaller, easier tasks based on the question type.

Abstract

Visual Question Answering (VQA) systems are known for their poor performance in out-of-distribution datasets. An issue that was addressed in previous works through ensemble learning, answer re-ranking, or artificially growing the training set. In this work, we show for the first time that robust Visual Question Answering is attainable by simply enhancing the training strategy. Our proposed approach, Task Progressive Curriculum Learning (TPCL), breaks the main VQA problem into smaller, easier tasks based on the question type. Then, it progressively trains the model on a (carefully crafted) sequence of tasks. We further support the method by a novel distributional-based difficulty measurer. Our approach is conceptually simple, model-agnostic, and easy to implement. We demonstrate TPCL effectiveness through a comprehensive evaluation on standard datasets. Without either data augmentation or explicit debiasing mechanism, it achieves state-of-the-art on VQA-CP v2, VQA-CP v1 and VQA v2 datasets. Extensive experiments demonstrate that TPCL outperforms the most competitive robust VQA approaches by more than 5% and 7% on VQA-CP v2 and VQA-CP v1; respectively. TPCL also can boost VQA baseline backbone performance by up to 28.5%.

Paper Structure

This paper contains 20 sections, 5 equations, 8 figures, 8 tables, 1 algorithm.

Figures (8)

  • Figure 1: We propose a novel task-based curriculum learning scheme for VQA. While the vanilla curriculum learning is instance oriented, we propose task-based curriculum where each task is mapped to a specific question type (e.g. "what is the", "how many"). We demonstrate the efficacy of this simple approach in the robust VQA problem.
  • Figure 2: Dynamic Curriculum Training (Algorithm \ref{['alg:tpcl']}). TPCL is essentially a task-based self-taught CL that adapts the training difficulty based on VQA model continual feedback during the training. The training progresses from hard to easy to make the model focus on the challenging tasks first and enable out-of-distribution generalisation. (top) The VQA model is exposed to a sequence of curricula $\mathcal{Q}_1, \cdots, \mathcal{Q}_R$ that are determined using a pacing function ($\bullet$) and the (VQA) self-reported difficulty scores ($\bullet$). (right panel) TPCL innovates a task-specific difficulty measurer that 1) considers the distribution of all samples within the task (histogram) and 2) stabilises the scores by Optimal Transport-based consolidation over a $B$-length scores history window.
  • Figure 3: Fixed Curriculum Tasks Order.
  • Figure 4: Task Relatedness may explain the effectiveness of the linguistic curriculum. (left) per-task transfer cost in Linguistic Vs Random sequence. Transfer cost between pairs of tasks is inversely proportional to their label sets overlapping with darker colours, denoting higher costs. The linguistic sequence total switching cost (right) is less than that of random sequences. Suggesting that task relatedness (through label overlap) in CL improves performance.
  • Figure 5: Loss distributions shift (horizontal) as the training progresses. The distribution of losses for the question type "How many" in iterations 2 (blue) and 4 (red). As the training progresses, the distributions shift to the left (towards zero). This creates areas of no-overlap on the distribution support (i.e., the x-axis area between 7-8 where the red distribution is supported but not the blue). This motivates the use of geometrically-ware distributional metric such as the Optimal Transport khamis2024scalable.
  • ...and 3 more figures