LocLLM: Exploiting Generalizable Human Keypoint Localization via Large Language Model

Dongkai Wang; Shiyu Xuan; Shiliang Zhang

LocLLM: Exploiting Generalizable Human Keypoint Localization via Large Language Model

Dongkai Wang, Shiyu Xuan, Shiliang Zhang

TL;DR

The paper tackles generalization gaps in human keypoint localization by introducing LocLLM, an LLM-based model that locates keypoints from an image guided by natural-language descriptions. It frames localization as a multimodal reasoning task, integrating a visual encoder, a projection layer, and a decoder-only LLM to produce coordinates via $\mathcal{K}=\Phi_L(\Phi_P(\Phi_V(\mathcal{I})), \mathcal{T})$, and employs localization-based instruction conversations with parameter-efficient LoRA tuning. Experiments on 2D/3D benchmarks (COCO, MPII, HumanArt, Human3.6M) show competitive 2D AP, strong 3D performance, and notably superior cross-dataset and novel-keypoint generalization. The work demonstrates the potential of language-guided localization to extend keypoint detection beyond fixed training priors, enabling flexible, generalizable visual understanding with large-language reasoning.

Abstract

The capacity of existing human keypoint localization models is limited by keypoint priors provided by the training data. To alleviate this restriction and pursue more general model, this work studies keypoint localization from a different perspective by reasoning locations based on keypiont clues in text descriptions. We propose LocLLM, the first Large-Language Model (LLM) based keypoint localization model that takes images and text instructions as inputs and outputs the desired keypoint coordinates. LocLLM leverages the strong reasoning capability of LLM and clues of keypoint type, location, and relationship in textual descriptions for keypoint localization. To effectively tune LocLLM, we construct localization-based instruction conversations to connect keypoint description with corresponding coordinates in input image, and fine-tune the whole model in a parameter-efficient training pipeline. LocLLM shows remarkable performance on standard 2D/3D keypoint localization benchmarks. Moreover, incorporating language clues into the localization makes LocLLM show superior flexibility and generalizable capability in cross dataset keypoint localization, and even detecting novel type of keypoints unseen during training.

LocLLM: Exploiting Generalizable Human Keypoint Localization via Large Language Model

TL;DR

, and employs localization-based instruction conversations with parameter-efficient LoRA tuning. Experiments on 2D/3D benchmarks (COCO, MPII, HumanArt, Human3.6M) show competitive 2D AP, strong 3D performance, and notably superior cross-dataset and novel-keypoint generalization. The work demonstrates the potential of language-guided localization to extend keypoint detection beyond fixed training priors, enabling flexible, generalizable visual understanding with large-language reasoning.

Abstract

Paper Structure (21 sections, 8 equations, 4 figures, 9 tables)

This paper contains 21 sections, 8 equations, 4 figures, 9 tables.

Introduction
Related Work
Human Keypoint Localization
Multi-modal Large Language Model
Large Language Model for Vision Tasks
Method
Overview
Localization-based Instruction Conversation
Parameter-Efficient Tuning
Baseline: CLIP-based Keypoint Localization
Experiments
Datasets and Evaluation Metrics
Implementation Details
Ablation Study
2D/3D Human Keypoint Localization
...and 6 more sections

Figures (4)

Figure 1: Upper: The conventional keypoint localization methods xiao2018simplexu2022vitposeli2021human encodes keypoint prior provided by the training set into model architecture and refers to encoded prior for keypoint localization. Bottom: The proposed LLM-based keypoint localization method refers to keypoint type, location, and relationship descriptions, and utilizes pre-trained powerful LLM zhu2023minigptliu2023visual to predict keypoint coordinates. Our method is more general to locate novel keypoints cross datasets, as textual descriptions can be provided flexibly.
Figure 2: The proposed LocLLM for keypoint localization via large language model. LocLLM takes image and text instruction as input and contains three parts: a visual encoder, a projector and a decoder-only LLM. The image input is processed by visual encoder and projector to extract image tokens. The LLM takes the image tokens and text tokens as input and output corresponding keypoint coordinates. During training, we freeze the visual encoder and LLM and only update a small set of learnable parameters with projector, therefore relieving the training cost.
Figure 3: Illustration of the CLIP-based keypoint localization.
Figure 4: Localization results of three novel keypoints which are not seen during training (denoted by blue star in the first column image). It can be observed that CLIP baseline matches each novel keypoint to a similar keypoint in the training set, e.g., it locates the left knee to right knee, pelvis to right hip, and neck to nose. In contrast, our LocLLM can locate novel keypoint accurately.

LocLLM: Exploiting Generalizable Human Keypoint Localization via Large Language Model

TL;DR

Abstract

LocLLM: Exploiting Generalizable Human Keypoint Localization via Large Language Model

Authors

TL;DR

Abstract

Table of Contents

Figures (4)