Table of Contents
Fetching ...

FunctionChat-Bench: Comprehensive Evaluation of Language Models' Generative Capabilities in Korean Tool-use Dialogs

Shinbok Lee, Gaeun Seo, Daniel Lee, Byeongil Ko, Sunghee Jung, Myeongcheol Shin

TL;DR

It is argued that the capabilities required for function calling extend beyond generating tool call messages; they must also effectively generate conversational messages that engage the user.

Abstract

This study investigates language models' generative capabilities in tool-use dialogs. We categorize the models' outputs in tool-use dialogs into four distinct types: Tool Call, Answer Completion, Slot Question, and Relevance Detection, which serve as aspects for evaluation. We introduce FunctionChat-Bench, comprising 700 evaluation items and automated assessment programs. Using this benchmark, we evaluate several language models that support function calling. Our findings indicate that while language models may exhibit high accuracy in single-turn Tool Call scenarios, this does not necessarily translate to superior generative performance in multi-turn environments. We argue that the capabilities required for function calling extend beyond generating tool call messages; they must also effectively generate conversational messages that engage the user.

FunctionChat-Bench: Comprehensive Evaluation of Language Models' Generative Capabilities in Korean Tool-use Dialogs

TL;DR

It is argued that the capabilities required for function calling extend beyond generating tool call messages; they must also effectively generate conversational messages that engage the user.

Abstract

This study investigates language models' generative capabilities in tool-use dialogs. We categorize the models' outputs in tool-use dialogs into four distinct types: Tool Call, Answer Completion, Slot Question, and Relevance Detection, which serve as aspects for evaluation. We introduce FunctionChat-Bench, comprising 700 evaluation items and automated assessment programs. Using this benchmark, we evaluate several language models that support function calling. Our findings indicate that while language models may exhibit high accuracy in single-turn Tool Call scenarios, this does not necessarily translate to superior generative performance in multi-turn environments. We argue that the capabilities required for function calling extend beyond generating tool call messages; they must also effectively generate conversational messages that engage the user.

Paper Structure

This paper contains 17 sections, 5 figures, 5 tables.

Figures (5)

  • Figure 1: A Classification of Language Models' Outputs in Tool-use Dialogs
  • Figure 2: A simplified sample of the content of the evaluation report file that is generated in the final stage of the evaluation program. This sample was created based on the actual results of the FunctionChat-Singlecall test of the functionary model.
  • Figure 3: Examples of Errors in "Answer Completion"
  • Figure 4: Examples of Errors in "Slot Question"
  • Figure 5: An Example of Errors in "Relevance Detection"