Table of Contents
Fetching ...

HumanOmniV2: From Understanding to Omni-Modal Reasoning with Context

Qize Yang, Shimin Yao, Weixuan Chen, Shenghao Fu, Detao Bai, Jiaxing Zhao, Boyuan Sun, Bowen Yin, Xihan Wei, Jingren Zhou

TL;DR

HumanOmniV2 addresses the challenge of omni-modal reasoning by enforcing explicit global-context understanding before reasoning, guided by context and logical rewards judged by LLMs. It introduces IntentBench, a challenging audio-visual benchmark for human intentions and emotions, and a 24K video-audio training corpus to support cold-start and RL training. The approach yields strong open-source performance across Daily-Omni, WorldSense, and IntentBench, demonstrating that context-aware reasoning and structured output formats improve integration of multimodal cues. Overall, the work advances reliable, context-driven omni-modal reasoning with a practical training pipeline and evaluation suite tailored to complex social understanding tasks.

Abstract

With the rapid evolution of multimodal large language models, the capacity to deeply understand and interpret human intentions has emerged as a critical capability, which demands detailed and thoughtful reasoning. In recent studies, Reinforcement Learning (RL) has demonstrated potential in enhancing the reasoning capabilities of Large Language Models (LLMs). Nonetheless, the challenges associated with adapting RL to multimodal data and formats remain largely unaddressed. In this paper, we identify two issues in existing multimodal reasoning models: insufficient global context understanding and shortcut problems. Insufficient context understanding can happen when a model misinterprets multimodal context, resulting in incorrect answers. The shortcut problem occurs when the model overlooks crucial clues in multimodal inputs, directly addressing the query without considering the multimodal information. To tackle these issues, we emphasize the necessity for the model to reason with a clear understanding of the global context within multimodal inputs. This global context understanding can effectively prevent the model from overlooking key multimodal cues and ensure a thorough reasoning process. To ensure the accurate interpretation of multimodal context information, we implement a context reward judged by a large language model, alongside format and accuracy rewards. Additionally, to improve complex reasoning capability, we employ the LLM to assess the logical reward, determining whether the reasoning process successfully integrates multimodal information with logical methods. We also introduce a reasoning omni-modal benchmark, IntentBench, aimed at evaluating models in understanding complex human intentions and emotions. Our proposed method demonstrates advanced performance across multiple omni-modal benchmarks compared to other open-source omni-modal models.

HumanOmniV2: From Understanding to Omni-Modal Reasoning with Context

TL;DR

HumanOmniV2 addresses the challenge of omni-modal reasoning by enforcing explicit global-context understanding before reasoning, guided by context and logical rewards judged by LLMs. It introduces IntentBench, a challenging audio-visual benchmark for human intentions and emotions, and a 24K video-audio training corpus to support cold-start and RL training. The approach yields strong open-source performance across Daily-Omni, WorldSense, and IntentBench, demonstrating that context-aware reasoning and structured output formats improve integration of multimodal cues. Overall, the work advances reliable, context-driven omni-modal reasoning with a practical training pipeline and evaluation suite tailored to complex social understanding tasks.

Abstract

With the rapid evolution of multimodal large language models, the capacity to deeply understand and interpret human intentions has emerged as a critical capability, which demands detailed and thoughtful reasoning. In recent studies, Reinforcement Learning (RL) has demonstrated potential in enhancing the reasoning capabilities of Large Language Models (LLMs). Nonetheless, the challenges associated with adapting RL to multimodal data and formats remain largely unaddressed. In this paper, we identify two issues in existing multimodal reasoning models: insufficient global context understanding and shortcut problems. Insufficient context understanding can happen when a model misinterprets multimodal context, resulting in incorrect answers. The shortcut problem occurs when the model overlooks crucial clues in multimodal inputs, directly addressing the query without considering the multimodal information. To tackle these issues, we emphasize the necessity for the model to reason with a clear understanding of the global context within multimodal inputs. This global context understanding can effectively prevent the model from overlooking key multimodal cues and ensure a thorough reasoning process. To ensure the accurate interpretation of multimodal context information, we implement a context reward judged by a large language model, alongside format and accuracy rewards. Additionally, to improve complex reasoning capability, we employ the LLM to assess the logical reward, determining whether the reasoning process successfully integrates multimodal information with logical methods. We also introduce a reasoning omni-modal benchmark, IntentBench, aimed at evaluating models in understanding complex human intentions and emotions. Our proposed method demonstrates advanced performance across multiple omni-modal benchmarks compared to other open-source omni-modal models.

Paper Structure

This paper contains 21 sections, 4 equations, 14 figures, 3 tables.

Figures (14)

  • Figure 1: Visualizations of the vanilla GRPO method applied in multimodal tasks. When the model is overconfident on questions, it tends to answer questions directly without considering the global context (left) or may overlook key multimodal inputs (right).
  • Figure 2: (a)(b)(c) are examples from Social-IQ 2.0, MDPE, and EMER, respectively. (d) is the statistic of the curated testing set from Social-IQ 2.0
  • Figure 3: The reasoning path of our model on an example from Social-IQ 2.0. The model first clearly understands the context information of the video clip in the multi-person talking scenario; then it starts reasoning with the multimodal clues to precisely answer the question.
  • Figure 4: Illustration of our method. We use Qwen2.5-Omni-Thinkerxu2025qwen2 as our base model. For each training sample, we generate 8 completions and compute format and accuracy rewards with verifiable labels. Additionally, we assess reasoning-logical and context rewards by using a LLM as the judge, applying these rewards only to corresponding seen tokens for different rewards.
  • Figure 5: Visualization result of our method on IntentBench.
  • ...and 9 more figures