Navigating Hallucinations for Reasoning of Unintentional Activities
Shresth Grover, Vibhav Vineet, Yogesh S Rawat
TL;DR
The paper tackles zero-shot reasoning about the transition from intentional to unintentional actions in videos, a task where existing large multimodal models often hallucinate. It introduces the Dream of Thoughts (DoT) prompting framework, which generates multiple candidate descriptions, goals, and reasoning steps (Dream of Paths) and uses a MCQ-based Path Selection to navigate toward accurate explanations. Three evaluation metrics—$rm_{MCQ}$, $rm_{FIB}$, and $rm_{LLM}$—assess high- and low-level reasoning against ground-truth annotations on the OOPs and UCF-Crimes datasets. Empirical results show DoT outperforms standard prompting and Chain-of-Thought approaches while reducing hallucinations, demonstrating robust, context-grounded reasoning in zero-shot multimodal scenarios. The work provides a framework for reasoning about real-world unintentional events with implications for safety, surveillance, and autonomous systems.
Abstract
In this work we present a novel task of understanding unintentional human activities in videos. We formalize this problem as a reasoning task under zero-shot scenario, where given a video of an unintentional activity we want to know why it transitioned from intentional to unintentional. We first evaluate the effectiveness of current state-of-the-art Large Multimodal Models on this reasoning task and observe that they suffer from hallucination. We further propose a novel prompting technique,termed as Dream of Thoughts (DoT), which allows the model to navigate through hallucinated thoughts to achieve better reasoning. To evaluate the performance on this task, we also introduce three different specialized metrics designed to quantify the models reasoning capability. We perform our experiments on two different datasets, OOPs and UCF-Crimes, and our findings show that DOT prompting technique is able to outperform standard prompting, while minimizing hallucinations.
