Table of Contents
Fetching ...

@Bench: Benchmarking Vision-Language Models for Human-centered Assistive Technology

Xin Jiang, Junwei Zheng, Ruiping Liu, Jiahang Li, Jiaming Zhang, Sven Matthiesen, Rainer Stiefelhagen

TL;DR

This work proposes a novel AT model (@MODEL) that addresses all tasks simultaneously and can be expanded to more assistive functions for helping PVIs, and exhibits outstanding performance across tasks by integrating multi-modal information, and it offers PVIs a more comprehensive assistance.

Abstract

As Vision-Language Models (VLMs) advance, human-centered Assistive Technologies (ATs) for helping People with Visual Impairments (PVIs) are evolving into generalists, capable of performing multiple tasks simultaneously. However, benchmarking VLMs for ATs remains under-explored. To bridge this gap, we first create a novel AT benchmark (@Bench). Guided by a pre-design user study with PVIs, our benchmark includes the five most crucial vision-language tasks: Panoptic Segmentation, Depth Estimation, Optical Character Recognition (OCR), Image Captioning, and Visual Question Answering (VQA). Besides, we propose a novel AT model (@Model) that addresses all tasks simultaneously and can be expanded to more assistive functions for helping PVIs. Our framework exhibits outstanding performance across tasks by integrating multi-modal information, and it offers PVIs a more comprehensive assistance. Extensive experiments prove the effectiveness and generalizability of our framework.

@Bench: Benchmarking Vision-Language Models for Human-centered Assistive Technology

TL;DR

This work proposes a novel AT model (@MODEL) that addresses all tasks simultaneously and can be expanded to more assistive functions for helping PVIs, and exhibits outstanding performance across tasks by integrating multi-modal information, and it offers PVIs a more comprehensive assistance.

Abstract

As Vision-Language Models (VLMs) advance, human-centered Assistive Technologies (ATs) for helping People with Visual Impairments (PVIs) are evolving into generalists, capable of performing multiple tasks simultaneously. However, benchmarking VLMs for ATs remains under-explored. To bridge this gap, we first create a novel AT benchmark (@Bench). Guided by a pre-design user study with PVIs, our benchmark includes the five most crucial vision-language tasks: Panoptic Segmentation, Depth Estimation, Optical Character Recognition (OCR), Image Captioning, and Visual Question Answering (VQA). Besides, we propose a novel AT model (@Model) that addresses all tasks simultaneously and can be expanded to more assistive functions for helping PVIs. Our framework exhibits outstanding performance across tasks by integrating multi-modal information, and it offers PVIs a more comprehensive assistance. Extensive experiments prove the effectiveness and generalizability of our framework.
Paper Structure (38 sections, 7 equations, 9 figures, 7 tables)

This paper contains 38 sections, 7 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: Overview of our Assistive Technology Model (@Model) and Benchmark (@Bench).@Model can perform vision-language tasks all at once, including: Panoptic Segmentation, Depth Estimation, Image Captioning, Optical Character Recognition and Visual Question Answering. All tasks of @Bench are selected by People with Visual Impairments (PVIs) to evaluate VLMs for AT.
  • Figure 2: Overall architecture of @Model. We propose task-based prompts to unify inputs and perform different tasks all at once.
  • Figure 3: Paradigms of multi-task methods. Our @Model incorporates task-specific prompts that effectively unify tasks all at once and with almost no additional parameters.
  • Figure 4: Examples of multi-task training results on 5 tasks. Given one image as input our @Model can output all predictions.
  • Figure 5: Single-task and multi-task training performance (relative) against the specialized SoTA models on different tasks
  • ...and 4 more figures