Table of Contents
Fetching ...

Open-Ended Multi-Modal Relational Reasoning for Video Question Answering

Haozheng Luo, Ruiyang Qin, Chenwei Xu, Guo Ye, Zening Luo

TL;DR

The paper presents a multi-modal robotic agent for video question answering designed to assist visually impaired users by combining action-centric video recognition with transformer-based language reasoning. It employs a multi-model pipeline that integrates $R(4+1)D$-based action detection, Whisper for speech, and a Relation-Aware Self-Attention module to encode relational cues, all within a ROS-based Kobuki platform. Empirical results show a 2%–3% improvement over baselines and reveal a positive link between user trust and interaction efficiency in human-robot interaction. The work demonstrates a concrete path toward practical, language-driven assistance in dynamic environments and outlines concrete avenues for future enhancements such as robotic manipulation and advanced attention mechanisms.

Abstract

In this paper, we introduce a robotic agent specifically designed to analyze external environments and address participants' questions. The primary focus of this agent is to assist individuals using language-based interactions within video-based scenes. Our proposed method integrates video recognition technology and natural language processing models within the robotic agent. We investigate the crucial factors affecting human-robot interactions by examining pertinent issues arising between participants and robot agents. Methodologically, our experimental findings reveal a positive relationship between trust and interaction efficiency. Furthermore, our model demonstrates a 2\% to 3\% performance enhancement in comparison to other benchmark methods.

Open-Ended Multi-Modal Relational Reasoning for Video Question Answering

TL;DR

The paper presents a multi-modal robotic agent for video question answering designed to assist visually impaired users by combining action-centric video recognition with transformer-based language reasoning. It employs a multi-model pipeline that integrates -based action detection, Whisper for speech, and a Relation-Aware Self-Attention module to encode relational cues, all within a ROS-based Kobuki platform. Empirical results show a 2%–3% improvement over baselines and reveal a positive link between user trust and interaction efficiency in human-robot interaction. The work demonstrates a concrete path toward practical, language-driven assistance in dynamic environments and outlines concrete avenues for future enhancements such as robotic manipulation and advanced attention mechanisms.

Abstract

In this paper, we introduce a robotic agent specifically designed to analyze external environments and address participants' questions. The primary focus of this agent is to assist individuals using language-based interactions within video-based scenes. Our proposed method integrates video recognition technology and natural language processing models within the robotic agent. We investigate the crucial factors affecting human-robot interactions by examining pertinent issues arising between participants and robot agents. Methodologically, our experimental findings reveal a positive relationship between trust and interaction efficiency. Furthermore, our model demonstrates a 2\% to 3\% performance enhancement in comparison to other benchmark methods.

Paper Structure

This paper contains 17 sections, 3 equations, 8 figures, 1 table.

Figures (8)

  • Figure 1: The picture of the Kobuki Robot we use in paper
  • Figure 2: Example of bounding box detection in the CATER dataset
  • Figure 3: 5D (a) vs (4+1)D convolution (b) vs R(4+1)D (c). A 5D convolution employs a filter of dimensions $t \times c \times h \times h \times w \times l$. On the other hand, a (2+1)D convolutional block separates the computation into a 4D spatial convolution followed by a 1D temporal convolution. We select the $M_i$ 4D filters to ensure that the (4+1)D block corresponds to the complete 5D convolution block. For R(4+1)D, we utilize ResNet during the spatial 4D convolution phase of (4+1)D to obtain the object relationships.
  • Figure 4: Examples of bounding box detection for objects actions with robot. Our robot identifies three stationary toys positioned on the ground.
  • Figure 5: RHS: The interface of the Robot camera capture LHS: The example of robot video capturing for the movement of two small robots
  • ...and 3 more figures