Table of Contents
Fetching ...

ChatHuman: Chatting about 3D Humans with Tools

Jing Lin, Yao Feng, Weiyang Liu, Michael J. Black

TL;DR

ChatHuman addresses the fragmentation of 3D-human analysis tools by introducing a language-driven agent that autonomously selects, applies, and interprets a suite of 26 tools. It leverages a paper-based Retrieval-Augmented Generation mechanism to acquire domain knowledge from tool publications and uses a graph-based invocation strategy to compose tools. Tool outputs are transformed into LLM-friendly representations and discriminated to produce reliable 3D human understanding while enabling interactive dialogue. Empirical results show substantial gains in tool-use accuracy and performance across 3D human tasks, outperforming prior LLM-based and task-specific baselines, with strong generalization to unseen tools.

Abstract

Numerous methods have been proposed to detect, estimate, and analyze properties of people in images, including 3D pose, shape, contact, human-object interaction, and emotion. While widely applicable in vision and other areas, such methods require expert knowledge to select, use, and interpret the results. To address this, we introduce ChatHuman, a language-driven system that integrates the capabilities of specialized methods into a unified framework. ChatHuman functions as an assistant proficient in utilizing, analyzing, and interacting with tools specific to 3D human tasks, adeptly discussing and resolving related challenges. Built on a Large Language Model (LLM) framework, ChatHuman is trained to autonomously select, apply, and interpret a diverse set of tools in response to user inputs. Our approach overcomes significant hurdles in adapting LLMs to 3D human tasks, including the need for domain-specific knowledge and the ability to interpret complex 3D outputs. The innovations of ChatHuman include leveraging academic publications to instruct the LLM on tool usage, employing a retrieval-augmented generation model to create in-context learning examples for managing new tools, and effectively discriminating between and integrating tool results by transforming specialized 3D outputs into comprehensible formats. Experiments demonstrate that ChatHuman surpasses existing models in both tool selection accuracy and overall performance across various 3D human tasks, and it supports interactive chatting with users. ChatHuman represents a significant step toward consolidating diverse analytical methods into a unified, robust system for 3D human tasks.

ChatHuman: Chatting about 3D Humans with Tools

TL;DR

ChatHuman addresses the fragmentation of 3D-human analysis tools by introducing a language-driven agent that autonomously selects, applies, and interprets a suite of 26 tools. It leverages a paper-based Retrieval-Augmented Generation mechanism to acquire domain knowledge from tool publications and uses a graph-based invocation strategy to compose tools. Tool outputs are transformed into LLM-friendly representations and discriminated to produce reliable 3D human understanding while enabling interactive dialogue. Empirical results show substantial gains in tool-use accuracy and performance across 3D human tasks, outperforming prior LLM-based and task-specific baselines, with strong generalization to unseen tools.

Abstract

Numerous methods have been proposed to detect, estimate, and analyze properties of people in images, including 3D pose, shape, contact, human-object interaction, and emotion. While widely applicable in vision and other areas, such methods require expert knowledge to select, use, and interpret the results. To address this, we introduce ChatHuman, a language-driven system that integrates the capabilities of specialized methods into a unified framework. ChatHuman functions as an assistant proficient in utilizing, analyzing, and interacting with tools specific to 3D human tasks, adeptly discussing and resolving related challenges. Built on a Large Language Model (LLM) framework, ChatHuman is trained to autonomously select, apply, and interpret a diverse set of tools in response to user inputs. Our approach overcomes significant hurdles in adapting LLMs to 3D human tasks, including the need for domain-specific knowledge and the ability to interpret complex 3D outputs. The innovations of ChatHuman include leveraging academic publications to instruct the LLM on tool usage, employing a retrieval-augmented generation model to create in-context learning examples for managing new tools, and effectively discriminating between and integrating tool results by transforming specialized 3D outputs into comprehensible formats. Experiments demonstrate that ChatHuman surpasses existing models in both tool selection accuracy and overall performance across various 3D human tasks, and it supports interactive chatting with users. ChatHuman represents a significant step toward consolidating diverse analytical methods into a unified, robust system for 3D human tasks.
Paper Structure (32 sections, 3 equations, 15 figures, 22 tables)

This paper contains 32 sections, 3 equations, 15 figures, 22 tables.

Figures (15)

  • Figure 1: ChatHuman is a LLM-based agent that uses a multimodal LLM to exploit and combine tools, discriminate their results, and integrate the results to solve tasks related to 3D Humans.
  • Figure 2: Paper-based Retrieval-Augmented Tool Use. We feed academic papers describing each tool to GPT-4 to build a document for each tool. During inference, given a user query, a relevant sample is retrieved from the documents and provided to the LLM-based agent as an in-context example to improve tool use accuracy.
  • Figure 3: Examples of tool use graphs. Tool use patterns include: a Node, which uses a single tool; a Chain, which requires sequential tool execution; and a DAG, which combines multiple tools.
  • Figure 4: Illustration of tool result integration. After tool invocation, results are transformed into VLM-compatible representations. The VLM then incorporates these transformed outputs to generate accurate, informative responses to user queries.
  • Figure 5: Illustration of tool results discrimination. When multiple plausible tools exist for a task, ChatHuman discriminates and chooses the best result as the final response.
  • ...and 10 more figures