Table of Contents
Fetching ...

A Probabilistic Jump-Diffusion Framework for Open-World Egocentric Activity Recognition

Sanjoy Kundu, Shanmukha Vellamcheti, Sathyanarayanan N. Aakur

TL;DR

This paper tackles open-world egocentric activity recognition by proposing Probabilistic Residual Search (ProbRes), a jump-diffusion framework that navigates a structured search space with commonsense priors and adaptive VLM-based refinement. ProbRes operates in Exploration, Exploitation, and Residual Refinement phases, combining $P_{\text{prior}}$, $P_{\text{likelihood}}$, and component scores to compute $a^* = \arg\max_{a \in \mathcal{S}} [P_{\text{search}}(a) + \lambda_a S_a + \lambda_o S_o]$, while progressively narrowing the candidate set and re-ranking via semantic alignment $v^T \phi(\cdot)$. The approach leverages ConceptNet-derived priors, VLM embeddings, and a structured search space to achieve state-of-the-art results on GTEA Gaze, GTEA Gaze+, EK100, and Charades-Ego, with substantial reductions in VLM query counts. The work introduces a taxonomy for openness levels (L0–L3), demonstrates efficiency gains, and outlines practical directions for scalable, interpretable open-world egocentric understanding in AI agents and robotics.

Abstract

Open-world egocentric activity recognition poses a fundamental challenge due to its unconstrained nature, requiring models to infer unseen activities from an expansive, partially observed search space. We introduce ProbRes, a Probabilistic Residual search framework based on jump-diffusion that efficiently navigates this space by balancing prior-guided exploration with likelihood-driven exploitation. Our approach integrates structured commonsense priors to construct a semantically coherent search space, adaptively refines predictions using Vision-Language Models (VLMs) and employs a stochastic search mechanism to locate high-likelihood activity labels while minimizing exhaustive enumeration efficiently. We systematically evaluate ProbRes across multiple openness levels (L0--L3), demonstrating its adaptability to increasing search space complexity. In addition to achieving state-of-the-art performance on benchmark datasets (GTEA Gaze, GTEA Gaze+, EPIC-Kitchens, and Charades-Ego), we establish a clear taxonomy for open-world recognition, delineating the challenges and methodological advancements necessary for egocentric activity understanding.

A Probabilistic Jump-Diffusion Framework for Open-World Egocentric Activity Recognition

TL;DR

This paper tackles open-world egocentric activity recognition by proposing Probabilistic Residual Search (ProbRes), a jump-diffusion framework that navigates a structured search space with commonsense priors and adaptive VLM-based refinement. ProbRes operates in Exploration, Exploitation, and Residual Refinement phases, combining , , and component scores to compute , while progressively narrowing the candidate set and re-ranking via semantic alignment . The approach leverages ConceptNet-derived priors, VLM embeddings, and a structured search space to achieve state-of-the-art results on GTEA Gaze, GTEA Gaze+, EK100, and Charades-Ego, with substantial reductions in VLM query counts. The work introduces a taxonomy for openness levels (L0–L3), demonstrates efficiency gains, and outlines practical directions for scalable, interpretable open-world egocentric understanding in AI agents and robotics.

Abstract

Open-world egocentric activity recognition poses a fundamental challenge due to its unconstrained nature, requiring models to infer unseen activities from an expansive, partially observed search space. We introduce ProbRes, a Probabilistic Residual search framework based on jump-diffusion that efficiently navigates this space by balancing prior-guided exploration with likelihood-driven exploitation. Our approach integrates structured commonsense priors to construct a semantically coherent search space, adaptively refines predictions using Vision-Language Models (VLMs) and employs a stochastic search mechanism to locate high-likelihood activity labels while minimizing exhaustive enumeration efficiently. We systematically evaluate ProbRes across multiple openness levels (L0--L3), demonstrating its adaptability to increasing search space complexity. In addition to achieving state-of-the-art performance on benchmark datasets (GTEA Gaze, GTEA Gaze+, EPIC-Kitchens, and Charades-Ego), we establish a clear taxonomy for open-world recognition, delineating the challenges and methodological advancements necessary for egocentric activity understanding.

Paper Structure

This paper contains 8 sections, 4 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Taxonomy of Openness in Egocentric Activity Recognition. We define four levels of openness based on search space constraints: L0 (fixed activity set), L1 (known atomic concepts, unknown compositions), L2 (known domain, inferred activities), and L3 (fully unconstrained search).
  • Figure 2: (a) ProbRes framework for open-world egocentric activity recognition. The search space is structured using ConceptNet priors, enabling guided exploration. The model iteratively refines candidates via likelihood estimation, balancing exploration and exploitation, followed by local refinement and re-ranking. (b) Search trajectory visualization, showing how ProbRes navigates likelihood regions to reach high-confidence predictions near the ground truth.
  • Figure 3: Qualitative Visualization of the search trajectory by ProbRes across different phases indicating exploration, exploitation, and the final refinement phase.