PointT2I: LLM-based text-to-image generation via keypoints
Taekyung Lee, Donggyu Lee, Myungjoo Kang
TL;DR
PointT2I addresses the challenge of generating pose-accurate images from text prompts by using an LLM to infer 3D human pose keypoints directly from prompts. The framework couples a keypoint generator, an image generator conditioned on 2D pose guidance, and a two-module LLM-based feedback system to refine both keypoints and resulting images, all without fine-tuning. It demonstrates superior pose fidelity across yoga, acrobatic, and common poses, validated through quantitative metrics (VQAScore, CLIPScore) and a yoga-specific classifier, and shows compatibility with multiple image backbones and LLMs. The work highlights the potential of semantic-to-structural translation to unlock pose-aware rendering in diffusion-based T2I, while also acknowledging challenges in multi-person scenes, computation, and non-human poses, pointing to future integration and efficiency improvements. Overall, PointT2I provides a robust, architecture-agnostic pathway to controllable, pose-consistent image synthesis from purely textual prompts.
Abstract
Text-to-image (T2I) generation model has made significant advancements, resulting in high-quality images aligned with an input prompt. However, despite T2I generation's ability to generate fine-grained images, it still faces challenges in accurately generating images when the input prompt contains complex concepts, especially human pose. In this paper, we propose PointT2I, a framework that effectively generates images that accurately correspond to the human pose described in the prompt by using a large language model (LLM). PointT2I consists of three components: Keypoint generation, Image generation, and Feedback system. The keypoint generation uses an LLM to directly generate keypoints corresponding to a human pose, solely based on the input prompt, without external references. Subsequently, the image generation produces images based on both the text prompt and the generated keypoints to accurately reflect the target pose. To refine the outputs of the preceding stages, we incorporate an LLM-based feedback system that assesses the semantic consistency between the generated contents and the given prompts. Our framework is the first approach to leveraging LLM for keypoints-guided image generation without any fine-tuning, producing accurate pose-aligned images based solely on textual prompts.
