RoCar: A Relationship Network-based Evaluation Method for Large Language Models

Ming Wang; Wenfang Wu; Chongyun Gao; Daling Wang; Shi Feng; Yifei Zhang

RoCar: A Relationship Network-based Evaluation Method for Large Language Models

Ming Wang, Wenfang Wu, Chongyun Gao, Daling Wang, Shi Feng, Yifei Zhang

TL;DR

The RoCar method is proposed, which utilizes the defined basic schemas to randomly construct a task graph and generates natural language evaluation tasks based on the task graph to evaluate the reasoning and memory abilities of LLMs respectively.

Abstract

Large language models (LLMs) have received increasing attention. However, due to the complexity of its capabilities, how to rationally evaluate the capabilities of LLMs is still a task to be solved. We propose the RoCar method, which utilizes the defined basic schemas to randomly construct a task graph and generates natural language evaluation tasks based on the task graph to evaluate the reasoning and memory abilities of LLMs respectively. Due to the very large randomness of the task construction process, it is possible to ensure that none of the LLMs to be tested has directly learned the evaluation tasks, guaranteeing the fairness of the evaluation method.

RoCar: A Relationship Network-based Evaluation Method for Large Language Models

TL;DR

Abstract

Paper Structure (12 sections, 2 equations, 2 figures, 7 tables)

This paper contains 12 sections, 2 equations, 2 figures, 7 tables.

Introduction
Methodology
Abstracting Basic Graph Schema
Template Definition and Randomised Social Network Graph Generation
Evaluation Tasks Construction
Experiments
Design of Experiments
Reasoning Capability
Memory Capability
Analysis of Results
Conclusion
Future Work

Figures (2)

Figure 1: The process of constructing a task graph.The figure contains two columns; the first column shows the process of decreasing the basic schema as it is inserted into the figure, while the second column shows the dynamic process of constructing an evaluation task graph from the basic schema. The green nodes in the figure indicate newly inserted basic schemas and the red nodes indicate constructed task graphs. The green relationships and splicing methods indicate optional relationships or splicing methods, and the red boxes indicate randomly selected relationships or splicing methods.
Figure 2: Task graph. Red nodes indicate females, blue nodes indicate males, arrows indicate the relationship between the two, and ordinal numbers indicate the order in which the task graph was constructed.

RoCar: A Relationship Network-based Evaluation Method for Large Language Models

TL;DR

Abstract

RoCar: A Relationship Network-based Evaluation Method for Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (2)