Table of Contents
Fetching ...

Hospitality-VQA: Decision-Oriented Informativeness Evaluation for Vision-Language Models

Jeongwoo Lee, Baek Duhyeong, Eungyeol Han, Soyeon Shin, Gukin han, Seungduk Kim, Jaehyun Jeon, Taewoo Jeong

TL;DR

This work investigates how well VLMs can perform visual question answering (VQA) about hotel and facility images that are central to consumer decision-making, and introduces Informativeness as a formal framework to quantify how much hospitality-relevant information an image-question pair provides.

Abstract

Recent advances in Vision-Language Models (VLMs) have demonstrated impressive multimodal understanding in general domains. However, their applicability to decision-oriented domains such as hospitality remains largely unexplored. In this work, we investigate how well VLMs can perform visual question answering (VQA) about hotel and facility images that are central to consumer decision-making. While many existing VQA benchmarks focus on factual correctness, they rarely capture what information users actually find useful. To address this, we first introduce Informativeness as a formal framework to quantify how much hospitality-relevant information an image-question pair provides. Guided by this framework, we construct a new hospitality-specific VQA dataset that covers various facility types, where questions are specifically designed to reflect key user information needs. Using this benchmark, we conduct experiments with several state-of-the-art VLMs, revealing that VLMs are not intrinsically decision-aware-key visual signals remain underutilized, and reliable informativeness reasoning emerges only after modest domain-specific finetuning.

Hospitality-VQA: Decision-Oriented Informativeness Evaluation for Vision-Language Models

TL;DR

This work investigates how well VLMs can perform visual question answering (VQA) about hotel and facility images that are central to consumer decision-making, and introduces Informativeness as a formal framework to quantify how much hospitality-relevant information an image-question pair provides.

Abstract

Recent advances in Vision-Language Models (VLMs) have demonstrated impressive multimodal understanding in general domains. However, their applicability to decision-oriented domains such as hospitality remains largely unexplored. In this work, we investigate how well VLMs can perform visual question answering (VQA) about hotel and facility images that are central to consumer decision-making. While many existing VQA benchmarks focus on factual correctness, they rarely capture what information users actually find useful. To address this, we first introduce Informativeness as a formal framework to quantify how much hospitality-relevant information an image-question pair provides. Guided by this framework, we construct a new hospitality-specific VQA dataset that covers various facility types, where questions are specifically designed to reflect key user information needs. Using this benchmark, we conduct experiments with several state-of-the-art VLMs, revealing that VLMs are not intrinsically decision-aware-key visual signals remain underutilized, and reliable informativeness reasoning emerges only after modest domain-specific finetuning.
Paper Structure (33 sections, 7 figures, 5 tables)

This paper contains 33 sections, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Comparison between general VQA (top) and decision-oriented Hospitality-VQA (bottom).
  • Figure 2: Bad vs. Good examples for each informativeness dimension. Bad images lack decision-relevant visual cues—resulting in low spatial legibility, weak activity affordance, obstructed or unbalanced contextual openness, or incomplete geometric completeness. Good images exhibit high spatial legibility, clear activity affordances, well-balanced contextual openness, and strong geometric completeness, enabling more reliable assessment of hospitality informativeness.
  • Figure 3: The formal annotation schema used in Hospitality-VQA. We record hierarchical facility labels and quantify visual utility across the four informativeness dimensions.
  • Figure 4: Dataset statistics of Hospitality-VQA. (a) Distribution of main facility categories. (b--e) Distributions of the four informativeness axes, reflecting characteristic properties of professionally curated hospitality listing images.
  • Figure 5: General structure of instruction--answer templates shared across all evaluation tasks.
  • ...and 2 more figures