Table of Contents
Fetching ...

QORT-Former: Query-optimized Real-time Transformer for Understanding Two Hands Manipulating Objects

Elkhan Ismayilzada, MD Khalequzzaman Chowdhury Sayem, Yihalem Yimolal Tiruneh, Mubarrat Tajoar Chowdhury, Muhammadjon Boboev, Seungryul Baek

TL;DR

QORT-Former introduces a real-time Transformer framework for 3D pose estimation of two hands and an object, addressing the computational bottlenecks of prior methods by constraining to 108 queries and a single decoder. It semantically divides queries into left hand, right hand, and object, and enriches them with contact-map features, while a three-step decoder update co-optimizes image and query features to maintain high accuracy. The approach achieves state-of-the-art pose and interaction-recognition performance on H2O and FPHA datasets, while delivering real-time speed (53.5 FPS on an RTX 3090TI). This combination of efficiency and accuracy advances the practicality of hand-object pose estimation for AR/VR and HCI applications, with robust ablation support for its design choices.

Abstract

Significant advancements have been achieved in the realm of understanding poses and interactions of two hands manipulating an object. The emergence of augmented reality (AR) and virtual reality (VR) technologies has heightened the demand for real-time performance in these applications. However, current state-of-the-art models often exhibit promising results at the expense of substantial computational overhead. In this paper, we present a query-optimized real-time Transformer (QORT-Former), the first Transformer-based real-time framework for 3D pose estimation of two hands and an object. We first limit the number of queries and decoders to meet the efficiency requirement. Given limited number of queries and decoders, we propose to optimize queries which are taken as input to the Transformer decoder, to secure better accuracy: (1) we propose to divide queries into three types (a left hand query, a right hand query and an object query) and enhance query features (2) by using the contact information between hands and an object and (3) by using three-step update of enhanced image and query features with respect to one another. With proposed methods, we achieved real-time pose estimation performance using just 108 queries and 1 decoder (53.5 FPS on an RTX 3090TI GPU). Surpassing state-of-the-art results on the H2O dataset by 17.6% (left hand), 22.8% (right hand), and 27.2% (object), as well as on the FPHA dataset by 5.3% (right hand) and 10.4% (object), our method excels in accuracy. Additionally, it sets the state-of-the-art in interaction recognition, maintaining real-time efficiency with an off-the-shelf action recognition module.

QORT-Former: Query-optimized Real-time Transformer for Understanding Two Hands Manipulating Objects

TL;DR

QORT-Former introduces a real-time Transformer framework for 3D pose estimation of two hands and an object, addressing the computational bottlenecks of prior methods by constraining to 108 queries and a single decoder. It semantically divides queries into left hand, right hand, and object, and enriches them with contact-map features, while a three-step decoder update co-optimizes image and query features to maintain high accuracy. The approach achieves state-of-the-art pose and interaction-recognition performance on H2O and FPHA datasets, while delivering real-time speed (53.5 FPS on an RTX 3090TI). This combination of efficiency and accuracy advances the practicality of hand-object pose estimation for AR/VR and HCI applications, with robust ablation support for its design choices.

Abstract

Significant advancements have been achieved in the realm of understanding poses and interactions of two hands manipulating an object. The emergence of augmented reality (AR) and virtual reality (VR) technologies has heightened the demand for real-time performance in these applications. However, current state-of-the-art models often exhibit promising results at the expense of substantial computational overhead. In this paper, we present a query-optimized real-time Transformer (QORT-Former), the first Transformer-based real-time framework for 3D pose estimation of two hands and an object. We first limit the number of queries and decoders to meet the efficiency requirement. Given limited number of queries and decoders, we propose to optimize queries which are taken as input to the Transformer decoder, to secure better accuracy: (1) we propose to divide queries into three types (a left hand query, a right hand query and an object query) and enhance query features (2) by using the contact information between hands and an object and (3) by using three-step update of enhanced image and query features with respect to one another. With proposed methods, we achieved real-time pose estimation performance using just 108 queries and 1 decoder (53.5 FPS on an RTX 3090TI GPU). Surpassing state-of-the-art results on the H2O dataset by 17.6% (left hand), 22.8% (right hand), and 27.2% (object), as well as on the FPHA dataset by 5.3% (right hand) and 10.4% (object), our method excels in accuracy. Additionally, it sets the state-of-the-art in interaction recognition, maintaining real-time efficiency with an off-the-shelf action recognition module.

Paper Structure

This paper contains 11 sections, 4 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Comparisons to competitive state-of-the-art algorithms cho2023transformertekin2019h+aboukhadra2022thorwang2023interactinghasson2020leveraging on the two hands and an object pose estimation task on an RTX 3090TI GPU. Even with the Transformer architecture, we achieved the fastest speed (53.5 FPS) while obtaining the best accuracy among the methods.
  • Figure 2: Our architecture begins with extracting a multi-scale feature $\mathbf{f}$ from an image using ResNet-50 he2016deep, which is then refined into $\mathbf{f}'$ by our feature decoder. We propose queries aligned with hand and object locations, incorporating contact map features, while auxiliary queries capture background details. In the QORT Transformer decoder, enhanced and query features undergo three steps: 1) Cross-attention updates the enhanced feature based on integrated query features in Enhanced Feature Update Block, 2) Location-based Feature Extraction module adds feature maps of $3\times3$ patches around coarse 2D hand and object keypoints to Enhanced Feature, and 3) Cross and self-attention layers update the integrated query features based on updated enhanced features in Query Feature Update Block. Finally, the heads estimate poses for both hands and the object.
  • Figure 3: Query location visualization: (a) Left: query locations of H2OTR cho2023transformer, employing 300 queries. Notably, a substantial amount of queries are distributed in backgrounds. (b) Middle: Our hand-object query locations w/o Query division block. Due to feature similarities between two hands, a considerable number of queries concentrate on the left hand than the right hand, which reduces the accuracy of the right hand. (c) Right: Our hand-object query locations. Queries for left and right hands are highlighted in red and blue, respectively. Queries for objects are denoted as green. The query proposal loss ensures that each query concentrates on its specific region of interest.
  • Figure 4: Examples of estimated 3D poses on H2O dataset: For a separate example in each row, the figure represents (a) input RGB image, (b) our hand-object queries, (c) ground-truth contact map, (d) predicted contact map, and (e) final 3D pose estimation results, respectively.
  • Figure 5: Examples of estimated 3D poses on FPHA dataset. For a separate example in each row, the figure represents (a) input RGB image, (b) our hand-object queries, (c) ground-truth contact map, (d) predicted contact map, and (e) final 3D pose estimation results, respectively.