Open-Ended Multi-Modal Relational Reasoning for Video Question Answering
Haozheng Luo, Ruiyang Qin, Chenwei Xu, Guo Ye, Zening Luo
TL;DR
The paper presents a multi-modal robotic agent for video question answering designed to assist visually impaired users by combining action-centric video recognition with transformer-based language reasoning. It employs a multi-model pipeline that integrates $R(4+1)D$-based action detection, Whisper for speech, and a Relation-Aware Self-Attention module to encode relational cues, all within a ROS-based Kobuki platform. Empirical results show a 2%–3% improvement over baselines and reveal a positive link between user trust and interaction efficiency in human-robot interaction. The work demonstrates a concrete path toward practical, language-driven assistance in dynamic environments and outlines concrete avenues for future enhancements such as robotic manipulation and advanced attention mechanisms.
Abstract
In this paper, we introduce a robotic agent specifically designed to analyze external environments and address participants' questions. The primary focus of this agent is to assist individuals using language-based interactions within video-based scenes. Our proposed method integrates video recognition technology and natural language processing models within the robotic agent. We investigate the crucial factors affecting human-robot interactions by examining pertinent issues arising between participants and robot agents. Methodologically, our experimental findings reveal a positive relationship between trust and interaction efficiency. Furthermore, our model demonstrates a 2\% to 3\% performance enhancement in comparison to other benchmark methods.
