Reading Is Believing: Revisiting Language Bottleneck Models for Image Classification

Honori Udo; Takafumi Koshinaka

Reading Is Believing: Revisiting Language Bottleneck Models for Image Classification

Honori Udo, Takafumi Koshinaka

TL;DR

In a task of disaster image classification, it is experimentally shown that a language bottleneck model that combines a modern image captioner with a pre-trained language model can achieve image classification accuracy that exceeds that of black-box models.

Abstract

We revisit language bottleneck models as an approach to ensuring the explainability of deep learning models for image classification. Because of inevitable information loss incurred in the step of converting images into language, the accuracy of language bottleneck models is considered to be inferior to that of standard black-box models. Recent image captioners based on large-scale foundation models of Vision and Language, however, have the ability to accurately describe images in verbal detail to a degree that was previously believed to not be realistically possible. In a task of disaster image classification, we experimentally show that a language bottleneck model that combines a modern image captioner with a pre-trained language model can achieve image classification accuracy that exceeds that of black-box models. We also demonstrate that a language bottleneck model and a black-box model may be thought to extract different features from images and that fusing the two can create a synergistic effect, resulting in even higher classification accuracy.

Reading Is Believing: Revisiting Language Bottleneck Models for Image Classification

TL;DR

Abstract

Paper Structure (9 sections, 4 figures, 2 tables)

This paper contains 9 sections, 4 figures, 2 tables.

Introduction
Related Work
System Configuration
Image and Text Classifiers
Image Captioners
System Fusion
Experiments
Summary
Acknowledgment

Figures (4)

Figure 1: System configuration: a standard image-based classifier (left-hand side) and a text-based classifier combined with an image captioner (right-hand side).
Figure 2: Example images for different disaster types included in the CrisisNLP dataset (cited from CrisisNLP). There is another type, referred to as "not disaster," which is not shown here.
Figure 3: Example results with an image to be classified as "hurricane."
Figure 4: Score-level fusion results using ViT-Base as the image-based classifier: Horizontal axis represents the fusion weight $w$ for the text-based classifier. $w=0$ and $w=1$ correspond to image-based and text-based single-modal systems, respectively.

Reading Is Believing: Revisiting Language Bottleneck Models for Image Classification

TL;DR

Abstract

Reading Is Believing: Revisiting Language Bottleneck Models for Image Classification

Authors

TL;DR

Abstract

Table of Contents

Figures (4)