Image-to-Text Translation for Interactive Image Recognition: A Comparative User Study with Non-Expert Users

Wataru Kawabe; Yusuke Sugano

Image-to-Text Translation for Interactive Image Recognition: A Comparative User Study with Non-Expert Users

Wataru Kawabe, Yusuke Sugano

TL;DR

This study interrogates whether image-to-text translation can overcome the task-definition limitations of traditional classification-centric interactive machine learning for non-experts. It implements two prototypes—a CNN+Transformer-based image-to-text system and an equivalently structured image classification baseline—and compares them through a multi-task user study with non-experts. Findings indicate that text outputs enable richer and sometimes abstract task descriptions and can yield finer-grained annotations, while usability remains comparable to the classification approach; however, semantic understanding and annotation efficiency remain significant challenges. The work highlights the potential of natural language as a flexible interface for IML and points to future work on more efficient NL-based interactions and backend architectures capable of handling diverse recognition tasks.

Abstract

Interactive machine learning (IML) allows users to build their custom machine learning models without expert knowledge. While most existing IML systems are designed with classification algorithms, they sometimes oversimplify the capabilities of machine learning algorithms and restrict the user's task definition. On the other hand, as recent large-scale language models have shown, natural language representation has the potential to enable more flexible and generic task descriptions. Models that take images as input and output text have the potential to represent a variety of tasks by providing appropriate text labels for training. However, the effect of introducing text labels to IML system design has never been investigated. In this work, we aim to investigate the difference between image-to-text translation and image classification for IML systems. Using our prototype systems, we conducted a comparative user study with non-expert users, where participants solved various tasks. Our results demonstrate the underlying difficulty for users in properly defining image recognition tasks while highlighting the potential and challenges of interactive image-to-text translation systems.

Image-to-Text Translation for Interactive Image Recognition: A Comparative User Study with Non-Expert Users

TL;DR

Abstract

Paper Structure (19 sections, 10 figures, 1 table)

This paper contains 19 sections, 10 figures, 1 table.

Introduction
Related Work
Interactive Machine Learning
Image-to-Text Translation
Design of Interactive Image Recognition Systems
Interactive Image-to-text Translation
Implementation Details
Interactive Image Classification Baseline
User Study
Image Recognition Tasks
Procedure
Results
Discussions
Key Findings
Challenges in Classification-based Design
...and 4 more sections

Figures (10)

Figure 1: The goal of this work is to investigate a design of interactive image recognition systems based on text output. Using a novel interactive image-to-translation framework, we analyze whether such a design can address the limitations of classification-based systems.
Figure 2: GUI overview of our interactive image-to-text translation system. 1) Users upload images via the upload button (A) and they are displayed in the right panel (B). 2) Users enter sentences into the text boxes (C) below selected images and click the training button (D) to update the model. 3) The inference button (E) shows inference results on all images. Topic words panel (F) shows the frequently appearing words in the annotation or inference results. Users can delete all the uploaded images with the reset button (G), and individual images with the delete button (H).
Figure 3: The architecture of the image-to-text translation model consists of CNN-based image encoder and transformer-based text decoder modules. The encoder module takes the input image and encodes the content to the feature tensor. The decoder module works recursively to decode a sentence from the feature one word at a time.
Figure 4: Overview of the image classification-based system.
Figure 5: The proportion of task trials in the detection category introduced abstract categories in the annotations
...and 5 more figures

Image-to-Text Translation for Interactive Image Recognition: A Comparative User Study with Non-Expert Users

TL;DR

Abstract

Image-to-Text Translation for Interactive Image Recognition: A Comparative User Study with Non-Expert Users

Authors

TL;DR

Abstract

Table of Contents

Figures (10)