Table of Contents
Fetching ...

GraphEnet: Event-driven Human Pose Estimation with a Graph Neural Network

Gaurvi Goyal, Pham Cong Thuong, Arren Glover, Masayoshi Mizuno, Chiara Bartolozzi

TL;DR

GraphEnet introduces a novel graph neural network framework for 2D human pose estimation from asynchronous event camera data. It builds a sparse input graph using a line-based intermediate representation (SCARF) and processes it with stacked SplineConv layers, culminating in a confidence-weighted pooling mechanism to predict joint positions. The approach achieves real-time performance (≈250 Hz, ~4 ms latency) with a modest accuracy trade-off relative to RGB-based methods, validated on eH36M and DHP19 datasets, and supported by extensive ablations. The work demonstrates the potential of sparse event-based graphs to deliver high-frequency pose estimation with reduced computational cost and energy, highlighting avenues for future hierarchical or asynchronous graph updates.

Abstract

Human Pose Estimation is a crucial module in human-machine interaction applications and, especially since the rise in deep learning technology, robust methods are available to consumers using RGB cameras and commercial GPUs. On the other hand, event-based cameras have gained popularity in the vision research community for their low latency and low energy advantages that make them ideal for applications where those resources are constrained like portable electronics and mobile robots. In this work we propose a Graph Neural Network, GraphEnet, that leverages the sparse nature of event camera output, with an intermediate line based event representation, to estimate 2D Human Pose of a single person at a high frequency. The architecture incorporates a novel offset vector learning paradigm with confidence based pooling to estimate the human pose. This is the first work that applies Graph Neural Networks to event data for Human Pose Estimation. The code is open-source at https://github.com/event-driven-robotics/GraphEnet-NeVi-ICCV2025.

GraphEnet: Event-driven Human Pose Estimation with a Graph Neural Network

TL;DR

GraphEnet introduces a novel graph neural network framework for 2D human pose estimation from asynchronous event camera data. It builds a sparse input graph using a line-based intermediate representation (SCARF) and processes it with stacked SplineConv layers, culminating in a confidence-weighted pooling mechanism to predict joint positions. The approach achieves real-time performance (≈250 Hz, ~4 ms latency) with a modest accuracy trade-off relative to RGB-based methods, validated on eH36M and DHP19 datasets, and supported by extensive ablations. The work demonstrates the potential of sparse event-based graphs to deliver high-frequency pose estimation with reduced computational cost and energy, highlighting avenues for future hierarchical or asynchronous graph updates.

Abstract

Human Pose Estimation is a crucial module in human-machine interaction applications and, especially since the rise in deep learning technology, robust methods are available to consumers using RGB cameras and commercial GPUs. On the other hand, event-based cameras have gained popularity in the vision research community for their low latency and low energy advantages that make them ideal for applications where those resources are constrained like portable electronics and mobile robots. In this work we propose a Graph Neural Network, GraphEnet, that leverages the sparse nature of event camera output, with an intermediate line based event representation, to estimate 2D Human Pose of a single person at a high frequency. The architecture incorporates a novel offset vector learning paradigm with confidence based pooling to estimate the human pose. This is the first work that applies Graph Neural Networks to event data for Human Pose Estimation. The code is open-source at https://github.com/event-driven-robotics/GraphEnet-NeVi-ICCV2025.

Paper Structure

This paper contains 25 sections, 6 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: The proposed GraphEnet is an initial investigation into using GNNs for human pose estimation with event cameras, with 2.5$\times$ faster than the previous state-of-the-art with a minor performance drop. The chance baseline is calculated assuming a random pixel in the image is chosen for each joint position.
  • Figure 2: Pipeline of GraphEnet. The input is the continuous, asynchronous stream of events from an event camera. Line detection is performed on a velocity-invariant event accumulation, such that line segments are detected in each grid placed over the surface. The input graph is formed by connecting nearby detected line end-points. The GNN processes the graph, aggregating node features to represent each joint appearance. The final pooling layer extracts joint positions given learnt vector offsets from each node.
  • Figure 3: Algorithm overview to extract line segment features from raw events in real time ikura2025iccv-nevi. (a) Lattice structure consists of active and (overlapping) inactive regions. (b) Each block stores active/inactive events together based on a FIFO principle. (c) Image-like representations can be generated by adding pixel intensities to all active events. (d) Line-fit score is calculated with (e) line occupancy ratio and (f) effective event ratio.
  • Figure 4: Graph building and joint regression, using (a) missing edges in the graph are added if the end-points are nearby (Equation \ref{['eq:augment']}), and (b) offset vectors are learned that point from node positions, to expected joint positions, given a learned confidence weighting.
  • Figure 5: Comparison with state-of-the-art using a variation on PCK thresholds on the (a) eH36M dataset and (b) DHP19 dataset. The eH36M results are clustered by body-part type in (c).
  • ...and 2 more figures