Table of Contents
Fetching ...

HumanOmni-Speaker: Identifying Who said What and When

Detao Bai, Shimin Yao, Weixuan Chen, Xihan Wei, Zhiheng Ma

Abstract

While Omni-modal Large Language Models have made strides in joint sensory processing, they fundamentally struggle with a cornerstone of human interaction: deciphering complex, multi-person conversational dynamics to accurately answer ``Who said what and when.'' Current models suffer from an ``illusion of competence'' -- they exploit visual biases in conventional benchmarks to bypass genuine cross-modal alignment, while relying on sparse, low-frame-rate visual sampling that destroys crucial high-frequency dynamics like lip movements. To shatter this illusion, we introduce Visual-Registered Speaker Diarization and Recognition (VR-SDR) and the HumanOmni-Speaker Benchmark. By strictly eliminating visual shortcuts, this rigorous paradigm demands true end-to-end spatio-temporal identity binding using only natural language queries. To overcome the underlying architectural perception gap, we propose HumanOmni-Speaker, powered by a Visual Delta Encoder. By sampling raw video at 25 fps and explicitly compressing inter-frame motion residuals into just 6 tokens per frame, it captures fine-grained visemes and speaker trajectories without triggering a catastrophic token explosion. Ultimately, HumanOmni-Speaker demonstrates strong multimodal synergy, natively enabling end-to-end lip-reading and high-precision spatial localization without intrusive cropping, and achieving superior performance across a wide spectrum of speaker-centric tasks.

HumanOmni-Speaker: Identifying Who said What and When

Abstract

While Omni-modal Large Language Models have made strides in joint sensory processing, they fundamentally struggle with a cornerstone of human interaction: deciphering complex, multi-person conversational dynamics to accurately answer ``Who said what and when.'' Current models suffer from an ``illusion of competence'' -- they exploit visual biases in conventional benchmarks to bypass genuine cross-modal alignment, while relying on sparse, low-frame-rate visual sampling that destroys crucial high-frequency dynamics like lip movements. To shatter this illusion, we introduce Visual-Registered Speaker Diarization and Recognition (VR-SDR) and the HumanOmni-Speaker Benchmark. By strictly eliminating visual shortcuts, this rigorous paradigm demands true end-to-end spatio-temporal identity binding using only natural language queries. To overcome the underlying architectural perception gap, we propose HumanOmni-Speaker, powered by a Visual Delta Encoder. By sampling raw video at 25 fps and explicitly compressing inter-frame motion residuals into just 6 tokens per frame, it captures fine-grained visemes and speaker trajectories without triggering a catastrophic token explosion. Ultimately, HumanOmni-Speaker demonstrates strong multimodal synergy, natively enabling end-to-end lip-reading and high-precision spatial localization without intrusive cropping, and achieving superior performance across a wide spectrum of speaker-centric tasks.
Paper Structure (17 sections, 2 equations, 6 figures, 5 tables)

This paper contains 17 sections, 2 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Speaker-Centric examples in HumanOmni-Speaker benchmark. (top) Visual-Registered Speaker Diarization and Recognition (VR-SDR); (bottom) Four atomic subtasks: Speech Recognition (SR), Speaker Verification (SV), Speaker Identification (SI), and Speaker Localization (SL). Our model unifies visual, audio, and text cues in a single framework.
  • Figure 2: Humanomni-Speaker benchmark sample characteristics and statistics.(a) Samples with strong visual biases which provide "recognition shortcuts". (b) Samples without visual biases that require both acoustic and visual information for speaker identification. (c) Benchmark sample statistics overview.
  • Figure 3: Overview of HumanOmni-Speaker architecture for human-centric speaking scenarios. It integrates text, audio, and visual inputs through Text Tokenizer, Audio Encoder, Visual Base Encoder (1-2 fps), and Visual Delta Encoder (25 fps). The Visual Delta Encoder (right) employs a ResNet-18 backbone, Spatio-Temporal Vision Transformer (SVT), and Transformer encoder, producing only 6 structured tokens per frame. All modality-specific tokens are aligned in a shared LLM decoder, enabling the model to reason about "Who", "When" and "What" in complex interactions.
  • Figure 4: The attention maps generated by Grad-CAM show that Visual Delta Encoder successfully localizes and tracks the speaker's mouth.
  • Figure 5: The Progressive Training Pipeline of HumanOmni-Speaker.
  • ...and 1 more figures