Table of Contents
Fetching ...

I Think, Therefore I am: Benchmarking Awareness of Large Language Models Using AwareBench

Yuan Li, Yue Huang, Yuli Lin, Siyuan Wu, Yao Wan, Lichao Sun

TL;DR

The paper introduces AwareBench, a psychology-informed benchmark for assessing awareness in LLMs across five dimensions, captured in the AwareEval dataset. It delineates introspective and social awareness components, and evaluates 13 LLMs to reveal weak capability and mission awareness but comparatively strong social understanding. The study highlights implications for AI alignment and safety and provides a human-AI collaborative data-generation pipeline with open resources. Overall, awareness in LLMs appears uneven and closely tied to underlying model capabilities, signaling avenues for improvement and responsible deployment.

Abstract

Do large language models (LLMs) exhibit any forms of awareness similar to humans? In this paper, we introduce AwareBench, a benchmark designed to evaluate awareness in LLMs. Drawing from theories in psychology and philosophy, we define awareness in LLMs as the ability to understand themselves as AI models and to exhibit social intelligence. Subsequently, we categorize awareness in LLMs into five dimensions, including capability, mission, emotion, culture, and perspective. Based on this taxonomy, we create a dataset called AwareEval, which contains binary, multiple-choice, and open-ended questions to assess LLMs' understandings of specific awareness dimensions. Our experiments, conducted on 13 LLMs, reveal that the majority of them struggle to fully recognize their capabilities and missions while demonstrating decent social intelligence. We conclude by connecting awareness of LLMs with AI alignment and safety, emphasizing its significance to the trustworthy and ethical development of LLMs. Our dataset and code are available at https://github.com/HowieHwong/Awareness-in-LLM.

I Think, Therefore I am: Benchmarking Awareness of Large Language Models Using AwareBench

TL;DR

The paper introduces AwareBench, a psychology-informed benchmark for assessing awareness in LLMs across five dimensions, captured in the AwareEval dataset. It delineates introspective and social awareness components, and evaluates 13 LLMs to reveal weak capability and mission awareness but comparatively strong social understanding. The study highlights implications for AI alignment and safety and provides a human-AI collaborative data-generation pipeline with open resources. Overall, awareness in LLMs appears uneven and closely tied to underlying model capabilities, signaling avenues for improvement and responsible deployment.

Abstract

Do large language models (LLMs) exhibit any forms of awareness similar to humans? In this paper, we introduce AwareBench, a benchmark designed to evaluate awareness in LLMs. Drawing from theories in psychology and philosophy, we define awareness in LLMs as the ability to understand themselves as AI models and to exhibit social intelligence. Subsequently, we categorize awareness in LLMs into five dimensions, including capability, mission, emotion, culture, and perspective. Based on this taxonomy, we create a dataset called AwareEval, which contains binary, multiple-choice, and open-ended questions to assess LLMs' understandings of specific awareness dimensions. Our experiments, conducted on 13 LLMs, reveal that the majority of them struggle to fully recognize their capabilities and missions while demonstrating decent social intelligence. We conclude by connecting awareness of LLMs with AI alignment and safety, emphasizing its significance to the trustworthy and ethical development of LLMs. Our dataset and code are available at https://github.com/HowieHwong/Awareness-in-LLM.
Paper Structure (27 sections, 11 figures, 14 tables)

This paper contains 27 sections, 11 figures, 14 tables.

Figures (11)

  • Figure 1: The architecture of AwareBench. We first proposed a unified taxonomy to define the awareness in LLMs. Then we constructed an evaluation dataset based on Human-AI collaboration. Finally, we conducted assessments on 13 popular LLMs and gained insightful conclusions.
  • Figure 2: Dataset construction pipeline for AwareEval. It includes three stages: seed curation (\ref{['sec:seed_curation']}), query generation (\ref{['sec:query_generation']}), and quality validation (\ref{['sec:quality_validation']}).
  • Figure 3: Model performance distribution on different tasks. Ex. means explicit, Im. means implicit, and Open. means open-ended.
  • Figure 4: Average performance on AwareEval dataset.
  • Figure 5: The human annotation interface.
  • ...and 6 more figures