@Bench: Benchmarking Vision-Language Models for Human-centered Assistive Technology

Xin Jiang; Junwei Zheng; Ruiping Liu; Jiahang Li; Jiaming Zhang; Sven Matthiesen; Rainer Stiefelhagen

@Bench: Benchmarking Vision-Language Models for Human-centered Assistive Technology

Xin Jiang, Junwei Zheng, Ruiping Liu, Jiahang Li, Jiaming Zhang, Sven Matthiesen, Rainer Stiefelhagen

TL;DR

This work proposes a novel AT model (@MODEL) that addresses all tasks simultaneously and can be expanded to more assistive functions for helping PVIs, and exhibits outstanding performance across tasks by integrating multi-modal information, and it offers PVIs a more comprehensive assistance.

Abstract

As Vision-Language Models (VLMs) advance, human-centered Assistive Technologies (ATs) for helping People with Visual Impairments (PVIs) are evolving into generalists, capable of performing multiple tasks simultaneously. However, benchmarking VLMs for ATs remains under-explored. To bridge this gap, we first create a novel AT benchmark (@Bench). Guided by a pre-design user study with PVIs, our benchmark includes the five most crucial vision-language tasks: Panoptic Segmentation, Depth Estimation, Optical Character Recognition (OCR), Image Captioning, and Visual Question Answering (VQA). Besides, we propose a novel AT model (@Model) that addresses all tasks simultaneously and can be expanded to more assistive functions for helping PVIs. Our framework exhibits outstanding performance across tasks by integrating multi-modal information, and it offers PVIs a more comprehensive assistance. Extensive experiments prove the effectiveness and generalizability of our framework.

@Bench: Benchmarking Vision-Language Models for Human-centered Assistive Technology

TL;DR

Abstract

Paper Structure (38 sections, 7 equations, 9 figures, 7 tables)

This paper contains 38 sections, 7 equations, 9 figures, 7 tables.

Introduction
Related Work
Assistive Technologies for the Blind
Generalist Vision-Language Models
Benchmarks for Vision-Language Models
@Bench: Assistive Technology Benchmark
User-centered Study
Assistive Tasks
Efficiency-Performance Trade-off
@Model: Assistive Technology Model
Experiments
Comparison with Existing Generalist Models
Comparison with Specialized SoTA Models
Multi-task Training v.s. Single-task Training
Efficiency-Performance Trade-off
...and 23 more sections

Figures (9)

Figure 1: Overview of our Assistive Technology Model (@Model) and Benchmark (@Bench).@Model can perform vision-language tasks all at once, including: Panoptic Segmentation, Depth Estimation, Image Captioning, Optical Character Recognition and Visual Question Answering. All tasks of @Bench are selected by People with Visual Impairments (PVIs) to evaluate VLMs for AT.
Figure 2: Overall architecture of @Model. We propose task-based prompts to unify inputs and perform different tasks all at once.
Figure 3: Paradigms of multi-task methods. Our @Model incorporates task-specific prompts that effectively unify tasks all at once and with almost no additional parameters.
Figure 4: Examples of multi-task training results on 5 tasks. Given one image as input our @Model can output all predictions.
Figure 5: Single-task and multi-task training performance (relative) against the specialized SoTA models on different tasks
...and 4 more figures

@Bench: Benchmarking Vision-Language Models for Human-centered Assistive Technology

TL;DR

Abstract

@Bench: Benchmarking Vision-Language Models for Human-centered Assistive Technology

Authors

TL;DR

Abstract

Table of Contents

Figures (9)