SFHand: A Streaming Framework for Language-guided 3D Hand Forecasting and Embodied Manipulation
Ruicong Liu, Yifei Huang, Liangyang Ouyang, Caixin Kang, Yoichi Sato
TL;DR
SFHand presents a streaming, language-conditioned framework for real-time 3D hand forecasting, guided by an ROI-enhanced memory that maintains temporal context from streaming video and instructions. It introduces EgoHaFL, a large-scale multimodal dataset with synchronized 3D hand poses and natural language descriptions, enabling robust multimodal learning for forecasting. The approach achieves state-of-the-art forecasting accuracy and demonstrates strong transfer to embodied manipulation benchmarks, highlighting the practical impact for AR and robotics. Together, SFHand and EgoHaFL establish a foundation for instruction-aware hand motion understanding and its application to real-world control tasks.
Abstract
Real-time 3D hand forecasting is a critical component for fluid human-computer interaction in applications like AR and assistive robotics. However, existing methods are ill-suited for these scenarios, as they typically require offline access to accumulated video sequences and cannot incorporate language guidance that conveys task intent. To overcome these limitations, we introduce SFHand, the first streaming framework for language-guided 3D hand forecasting. SFHand autoregressively predicts a comprehensive set of future 3D hand states, including hand type, 2D bounding box, 3D pose, and trajectory, from a continuous stream of video and language instructions. Our framework combines a streaming autoregressive architecture with an ROI-enhanced memory layer, capturing temporal context while focusing on salient hand-centric regions. To enable this research, we also introduce EgoHaFL, the first large-scale dataset featuring synchronized 3D hand poses and language instructions. We demonstrate that SFHand achieves new state-of-the-art results in 3D hand forecasting, outperforming prior work by a significant margin of up to 35.8%. Furthermore, we show the practical utility of our learned representations by transferring them to downstream embodied manipulation tasks, improving task success rates by up to 13.4% on multiple benchmarks. Dataset page: https://huggingface.co/datasets/ut-vision/EgoHaFL, project page: https://github.com/ut-vision/SFHand.
