Table of Contents
Fetching ...

Attribute Diversity Determines the Systematicity Gap in VQA

Ian Berlot-Attwell, Kumar Krishna Agrawal, A. Michael Carrell, Yash Sharma, Naomi Saphra

TL;DR

This work studies the systematicity gap in visual question answering: the performance difference between reasoning on previously seen and unseen combinations of object attributes, and suggests that the more distinct attribute type combinations seen during training, the more systematic the resulting model can be.

Abstract

Although modern neural networks often generalize to new combinations of familiar concepts, the conditions that enable such compositionality have long been an open question. In this work, we study the systematicity gap in visual question answering: the performance difference between reasoning on previously seen and unseen combinations of object attributes. To test, we introduce a novel diagnostic dataset, CLEVR-HOPE. We find that the systematicity gap is not reduced by increasing the quantity of training data, but is reduced by increasing the diversity of training data. In particular, our experiments suggest that the more distinct attribute type combinations are seen during training, the more systematic we can expect the resulting model to be.

Attribute Diversity Determines the Systematicity Gap in VQA

TL;DR

This work studies the systematicity gap in visual question answering: the performance difference between reasoning on previously seen and unseen combinations of object attributes, and suggests that the more distinct attribute type combinations seen during training, the more systematic the resulting model can be.

Abstract

Although modern neural networks often generalize to new combinations of familiar concepts, the conditions that enable such compositionality have long been an open question. In this work, we study the systematicity gap in visual question answering: the performance difference between reasoning on previously seen and unseen combinations of object attributes. To test, we introduce a novel diagnostic dataset, CLEVR-HOPE. We find that the systematicity gap is not reduced by increasing the quantity of training data, but is reduced by increasing the diversity of training data. In particular, our experiments suggest that the more distinct attribute type combinations are seen during training, the more systematic we can expect the resulting model to be.
Paper Structure (20 sections, 27 figures, 20 tables)

This paper contains 20 sections, 27 figures, 20 tables.

Figures (27)

  • Figure 1: Example image-question pairs for the sub-dataset of CLEVR-HOPE corresponding to rubbercylinder.The test sets are in gray; rubbercylinder is omitted visually and textually in the train split and the IID test splits; rubbercylinder only occurs in the OOD splits; occurrences are emphasized in this figure. The train and complex sets are of comparable visual and textual complexity to CLEVR. The minimal sets consist only of existence questions, checking whether a single object matches a given pair of attribute values.
  • Figure 2: Systematicity gap (difference between OOD and IID accuracy) on the complex test split, averaged by (HOP) diversity for 29 HOPs, each with 3 runs.
  • Figure 3: Box plots of minimal-OOD test set performance on all 29 HOPs. The average performance for each HOP is produced by averaging over 3 trials. The variation captured by this boxplot is from the difference in average performance between HOPs, rather than from the variation within the 3 trials.
  • Figure 4: Box plots of complex-OOD test set performance on all 29 HOPs. As in \ref{['fig:main_atom_ho']}, each HOP is individually averaged over 3 trials.
  • Figure 5: Average systematicity gap on complex examples (i.e., complex-OOD test accuracy minus complex-IID test accuracy) with 1 standard deviation; averaged over 3 runs on each of the 29 HOPs. The systematicity gap plateaus, suggesting that the performance drop when generalizing to unseen combinations does not improve with additional training data.
  • ...and 22 more figures