Table of Contents
Fetching ...

Can Vision-Language Models Answer Face to Face Questions in the Real-World?

Reza Pourreza, Rishit Dagli, Apratim Bhattacharyya, Sunny Panchal, Guillaume Berger, Roland Memisevic

TL;DR

This work addresses the challenge of real-time, face-to-face QA about live scenes by introducing the Qualcomm IVD dataset and benchmark for online situated audio-visual reasoning. It proposes a streaming baseline that combines ASR-based timing of when to answer with a Video-LMM backbone to generate responses from streaming inputs. Experiments across multiple LMMs reveal substantial gaps to human performance, with notable improvements from fine-tuning and audio-visual integration, though temporal reasoning and deictic references remain hard. The dataset and findings aim to spur development of true online AI assistants capable of natural, real-time interactions in the wild.

Abstract

AI models have made significant strides in recent years in their ability to describe and answer questions about real-world images. They have also made progress in the ability to converse with users in real-time using audio input. This raises the question: have we reached the point where AI models, connected to a camera and microphone, can converse with users in real-time about scenes and events that are unfolding live in front of the camera? This has been a long-standing goal in AI and is a prerequisite for real-world AI assistants and humanoid robots to interact with humans in everyday situations. In this work, we introduce a new dataset and benchmark, the Qualcomm Interactive Video Dataset (IVD), which allows us to assess the extent to which existing models can support these abilities, and to what degree these capabilities can be instilled through fine-tuning. The dataset is based on a simple question-answering setup, where users ask questions that the system has to answer, in real-time, based on the camera and audio input. We show that existing models fall far behind human performance on this task, and we identify the main sources for the performance gap. However, we also show that for many of the required perceptual skills, fine-tuning on this form of data can significantly reduce this gap.

Can Vision-Language Models Answer Face to Face Questions in the Real-World?

TL;DR

This work addresses the challenge of real-time, face-to-face QA about live scenes by introducing the Qualcomm IVD dataset and benchmark for online situated audio-visual reasoning. It proposes a streaming baseline that combines ASR-based timing of when to answer with a Video-LMM backbone to generate responses from streaming inputs. Experiments across multiple LMMs reveal substantial gaps to human performance, with notable improvements from fine-tuning and audio-visual integration, though temporal reasoning and deictic references remain hard. The dataset and findings aim to spur development of true online AI assistants capable of natural, real-time interactions in the wild.

Abstract

AI models have made significant strides in recent years in their ability to describe and answer questions about real-world images. They have also made progress in the ability to converse with users in real-time using audio input. This raises the question: have we reached the point where AI models, connected to a camera and microphone, can converse with users in real-time about scenes and events that are unfolding live in front of the camera? This has been a long-standing goal in AI and is a prerequisite for real-world AI assistants and humanoid robots to interact with humans in everyday situations. In this work, we introduce a new dataset and benchmark, the Qualcomm Interactive Video Dataset (IVD), which allows us to assess the extent to which existing models can support these abilities, and to what degree these capabilities can be instilled through fine-tuning. The dataset is based on a simple question-answering setup, where users ask questions that the system has to answer, in real-time, based on the camera and audio input. We show that existing models fall far behind human performance on this task, and we identify the main sources for the performance gap. However, we also show that for many of the required perceptual skills, fine-tuning on this form of data can significantly reduce this gap.

Paper Structure

This paper contains 16 sections, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Temporal relationship between the end of the video and optimal answer timing. The horizontal axis represents seconds from the optimal time to answer to the the end of the video.
  • Figure 2: Evaluations of the public and finetuned VideoLLaMA2.1-7B-AV damonlpsg2024videollama2 in vision+audio and vision-only settings.
  • Figure 3: Comparing correctness of selected baseline LMMs across individual categories of Qualcomm IVD.
  • Figure A.1: Each image showcases a different video from our collection, demonstrating the substantial variation in visual scenarios captured within the dataset. These examples highlight the diversity of environments (indoor and outdoor settings), participants, objects, actions, lighting conditions, camera angles, and compositional elements present across the dataset.
  • Figure B.1: Examples of questions that GPT-4o refused to answer due to ResponsibleAIPolicyViolation.
  • ...and 1 more figures