Table of Contents
Fetching ...

Navigating Hallucinations for Reasoning of Unintentional Activities

Shresth Grover, Vibhav Vineet, Yogesh S Rawat

TL;DR

The paper tackles zero-shot reasoning about the transition from intentional to unintentional actions in videos, a task where existing large multimodal models often hallucinate. It introduces the Dream of Thoughts (DoT) prompting framework, which generates multiple candidate descriptions, goals, and reasoning steps (Dream of Paths) and uses a MCQ-based Path Selection to navigate toward accurate explanations. Three evaluation metrics—$rm_{MCQ}$, $rm_{FIB}$, and $rm_{LLM}$—assess high- and low-level reasoning against ground-truth annotations on the OOPs and UCF-Crimes datasets. Empirical results show DoT outperforms standard prompting and Chain-of-Thought approaches while reducing hallucinations, demonstrating robust, context-grounded reasoning in zero-shot multimodal scenarios. The work provides a framework for reasoning about real-world unintentional events with implications for safety, surveillance, and autonomous systems.

Abstract

In this work we present a novel task of understanding unintentional human activities in videos. We formalize this problem as a reasoning task under zero-shot scenario, where given a video of an unintentional activity we want to know why it transitioned from intentional to unintentional. We first evaluate the effectiveness of current state-of-the-art Large Multimodal Models on this reasoning task and observe that they suffer from hallucination. We further propose a novel prompting technique,termed as Dream of Thoughts (DoT), which allows the model to navigate through hallucinated thoughts to achieve better reasoning. To evaluate the performance on this task, we also introduce three different specialized metrics designed to quantify the models reasoning capability. We perform our experiments on two different datasets, OOPs and UCF-Crimes, and our findings show that DOT prompting technique is able to outperform standard prompting, while minimizing hallucinations.

Navigating Hallucinations for Reasoning of Unintentional Activities

TL;DR

The paper tackles zero-shot reasoning about the transition from intentional to unintentional actions in videos, a task where existing large multimodal models often hallucinate. It introduces the Dream of Thoughts (DoT) prompting framework, which generates multiple candidate descriptions, goals, and reasoning steps (Dream of Paths) and uses a MCQ-based Path Selection to navigate toward accurate explanations. Three evaluation metrics—, , and —assess high- and low-level reasoning against ground-truth annotations on the OOPs and UCF-Crimes datasets. Empirical results show DoT outperforms standard prompting and Chain-of-Thought approaches while reducing hallucinations, demonstrating robust, context-grounded reasoning in zero-shot multimodal scenarios. The work provides a framework for reasoning about real-world unintentional events with implications for safety, surveillance, and autonomous systems.

Abstract

In this work we present a novel task of understanding unintentional human activities in videos. We formalize this problem as a reasoning task under zero-shot scenario, where given a video of an unintentional activity we want to know why it transitioned from intentional to unintentional. We first evaluate the effectiveness of current state-of-the-art Large Multimodal Models on this reasoning task and observe that they suffer from hallucination. We further propose a novel prompting technique,termed as Dream of Thoughts (DoT), which allows the model to navigate through hallucinated thoughts to achieve better reasoning. To evaluate the performance on this task, we also introduce three different specialized metrics designed to quantify the models reasoning capability. We perform our experiments on two different datasets, OOPs and UCF-Crimes, and our findings show that DOT prompting technique is able to outperform standard prompting, while minimizing hallucinations.
Paper Structure (23 sections, 2 equations, 9 figures, 4 tables, 1 algorithm)

This paper contains 23 sections, 2 equations, 9 figures, 4 tables, 1 algorithm.

Figures (9)

  • Figure 1: Overview of the proposed Dream of Thoughts framework: The left figure shows an overview of the three-step process with all the possible paths generated by the Large Video Language Model using the video and provided prompts. The right figure describes the Dream of Paths mechanism for generating thoughts to cover the most probable options and the Path Selection mechanism for navigating through the best possible options.
  • Figure 2: Qualitative evaluations: We show some samples for qualitative analysis of the proposed DoT prompting compared with CoT and standard prompting. First row illustrates examples from OOPs dataset and the second row refers to examples sampled from UCF-Crimes dataset.
  • Figure 3: Distribution of cosine similarity between ground-truth and the DoT as well as basic prompt.
  • Figure 4: Effect of number of options: Variation of $p(x=ans|O)$ on reasoning task proposed as MCQ style query, with varying number of present in a MCQ question, where $p(x=ans|O) = 1 iff rm_{mcq}>=0.8$ else $p(x=ans|O) = 0$. Here $O$ refers to the options presented in the MCQ.
  • Figure 5: Analyzing number of trials: Variation of $p(ans \in x|n)$ on reasoning task proposed as MCQ style query, with $n$ is the number of times prompt has been evaluated using LMM and x is set of n outputs obtained using LMM.
  • ...and 4 more figures