Table of Contents
Fetching ...

SFHand: A Streaming Framework for Language-guided 3D Hand Forecasting and Embodied Manipulation

Ruicong Liu, Yifei Huang, Liangyang Ouyang, Caixin Kang, Yoichi Sato

TL;DR

SFHand presents a streaming, language-conditioned framework for real-time 3D hand forecasting, guided by an ROI-enhanced memory that maintains temporal context from streaming video and instructions. It introduces EgoHaFL, a large-scale multimodal dataset with synchronized 3D hand poses and natural language descriptions, enabling robust multimodal learning for forecasting. The approach achieves state-of-the-art forecasting accuracy and demonstrates strong transfer to embodied manipulation benchmarks, highlighting the practical impact for AR and robotics. Together, SFHand and EgoHaFL establish a foundation for instruction-aware hand motion understanding and its application to real-world control tasks.

Abstract

Real-time 3D hand forecasting is a critical component for fluid human-computer interaction in applications like AR and assistive robotics. However, existing methods are ill-suited for these scenarios, as they typically require offline access to accumulated video sequences and cannot incorporate language guidance that conveys task intent. To overcome these limitations, we introduce SFHand, the first streaming framework for language-guided 3D hand forecasting. SFHand autoregressively predicts a comprehensive set of future 3D hand states, including hand type, 2D bounding box, 3D pose, and trajectory, from a continuous stream of video and language instructions. Our framework combines a streaming autoregressive architecture with an ROI-enhanced memory layer, capturing temporal context while focusing on salient hand-centric regions. To enable this research, we also introduce EgoHaFL, the first large-scale dataset featuring synchronized 3D hand poses and language instructions. We demonstrate that SFHand achieves new state-of-the-art results in 3D hand forecasting, outperforming prior work by a significant margin of up to 35.8%. Furthermore, we show the practical utility of our learned representations by transferring them to downstream embodied manipulation tasks, improving task success rates by up to 13.4% on multiple benchmarks. Dataset page: https://huggingface.co/datasets/ut-vision/EgoHaFL, project page: https://github.com/ut-vision/SFHand.

SFHand: A Streaming Framework for Language-guided 3D Hand Forecasting and Embodied Manipulation

TL;DR

SFHand presents a streaming, language-conditioned framework for real-time 3D hand forecasting, guided by an ROI-enhanced memory that maintains temporal context from streaming video and instructions. It introduces EgoHaFL, a large-scale multimodal dataset with synchronized 3D hand poses and natural language descriptions, enabling robust multimodal learning for forecasting. The approach achieves state-of-the-art forecasting accuracy and demonstrates strong transfer to embodied manipulation benchmarks, highlighting the practical impact for AR and robotics. Together, SFHand and EgoHaFL establish a foundation for instruction-aware hand motion understanding and its application to real-world control tasks.

Abstract

Real-time 3D hand forecasting is a critical component for fluid human-computer interaction in applications like AR and assistive robotics. However, existing methods are ill-suited for these scenarios, as they typically require offline access to accumulated video sequences and cannot incorporate language guidance that conveys task intent. To overcome these limitations, we introduce SFHand, the first streaming framework for language-guided 3D hand forecasting. SFHand autoregressively predicts a comprehensive set of future 3D hand states, including hand type, 2D bounding box, 3D pose, and trajectory, from a continuous stream of video and language instructions. Our framework combines a streaming autoregressive architecture with an ROI-enhanced memory layer, capturing temporal context while focusing on salient hand-centric regions. To enable this research, we also introduce EgoHaFL, the first large-scale dataset featuring synchronized 3D hand poses and language instructions. We demonstrate that SFHand achieves new state-of-the-art results in 3D hand forecasting, outperforming prior work by a significant margin of up to 35.8%. Furthermore, we show the practical utility of our learned representations by transferring them to downstream embodied manipulation tasks, improving task success rates by up to 13.4% on multiple benchmarks. Dataset page: https://huggingface.co/datasets/ut-vision/EgoHaFL, project page: https://github.com/ut-vision/SFHand.

Paper Structure

This paper contains 19 sections, 3 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Comparison between previous hand forecasting methods and our proposed approach. (a) Prior 3D hand forecasting models rely on accumulated video sequences and lack streaming input or language guidance. (b) Our method, SFHand, introduces an autoregressive framework for language-guided 3D hand forecasting. Its streaming and instruction-aware design makes it well-suited for real-time applications such as AR and embodied manipulation.
  • Figure 2: The overview of our method. Given a streaming egocentric video, language instruction, and the current 3D hand state, our model autoregressively forecasts future 3D hand motions. The ROI-enhanced memory layer maintains a key-value queue of past embeddings, enabling temporal reasoning over streaming inputs. The ROI mask generates an attention bias that drives hand-region queries to attend more strongly to historical embeddings. The memory-augmented embeddings are then decoded to predict future hand states.
  • Figure 3: Illustrations of various tasks in the Franka Kitchen R:gupta2020relay and Adroit R:rajeswaran2017learning simulated environments.
  • Figure 4: Function of memory layer. All hands are forecasted from previous video frames and hand states.
  • Figure 5: Qualitative comparison between our method and HaMeR. "HaMeR + In." indicates HaMeR incorporating language instructions for forecasting. Red rectangles highlight incorrect hand positions or hand poses. All hands are forecasted from previous video frames and hand states.
  • ...and 1 more figures