Table of Contents
Fetching ...

COINBench: Moving Beyond Individual Perspectives to Collective Intent Understanding

Xiaozhe Li, Tianyi Lyu, Siyi Yang, Yizhao Yang, Yuxi Gong, Jinxuan Huang, Ligao Zhang, Zhuoyi Huang, Qingwen Liu

Abstract

Understanding human intent is a high-level cognitive challenge for Large Language Models (LLMs), requiring sophisticated reasoning over noisy, conflicting, and non-linear discourse. While LLMs excel at following individual instructions, their ability to distill Collective Intent - the process of extracting consensus, resolving contradictions, and inferring latent trends from multi-source public discussions - remains largely unexplored. To bridge this gap, we introduce COIN-BENCH, a dynamic, real-world, live-updating benchmark specifically designed to evaluate LLMs on collective intent understanding within the consumer domain. Unlike traditional benchmarks that focus on transactional outcomes, COIN-BENCH operationalizes intent as a hierarchical cognitive structure, ranging from explicit scenarios to deep causal reasoning. We implement a robust evaluation pipeline that combines a rule-based method with an LLM-as-the-Judge approach. This framework incorporates COIN-TREE for hierarchical cognitive structuring and retrieval-augmented verification (COIN-RAG) to ensure expert-level precision in analyzing raw, collective human discussions. An extensive evaluation of 20 state-of-the-art LLMs across four dimensions - depth, breadth, informativeness, and correctness - reveals that while current models can handle surface-level aggregation, they still struggle with the analytical depth required for complex intent synthesis. COIN-BENCH establishes a new standard for advancing LLMs from passive instruction followers to expert-level analytical agents capable of deciphering the collective voice of the real world. See our project page on COIN-BENCH.

COINBench: Moving Beyond Individual Perspectives to Collective Intent Understanding

Abstract

Understanding human intent is a high-level cognitive challenge for Large Language Models (LLMs), requiring sophisticated reasoning over noisy, conflicting, and non-linear discourse. While LLMs excel at following individual instructions, their ability to distill Collective Intent - the process of extracting consensus, resolving contradictions, and inferring latent trends from multi-source public discussions - remains largely unexplored. To bridge this gap, we introduce COIN-BENCH, a dynamic, real-world, live-updating benchmark specifically designed to evaluate LLMs on collective intent understanding within the consumer domain. Unlike traditional benchmarks that focus on transactional outcomes, COIN-BENCH operationalizes intent as a hierarchical cognitive structure, ranging from explicit scenarios to deep causal reasoning. We implement a robust evaluation pipeline that combines a rule-based method with an LLM-as-the-Judge approach. This framework incorporates COIN-TREE for hierarchical cognitive structuring and retrieval-augmented verification (COIN-RAG) to ensure expert-level precision in analyzing raw, collective human discussions. An extensive evaluation of 20 state-of-the-art LLMs across four dimensions - depth, breadth, informativeness, and correctness - reveals that while current models can handle surface-level aggregation, they still struggle with the analytical depth required for complex intent synthesis. COIN-BENCH establishes a new standard for advancing LLMs from passive instruction followers to expert-level analytical agents capable of deciphering the collective voice of the real world. See our project page on COIN-BENCH.
Paper Structure (37 sections, 4 equations, 5 figures, 5 tables)

This paper contains 37 sections, 4 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: COIN-Bench Pipeline: From Raw Discourse to Structured Evaluation.Top (Curation): An automated pipeline that consolidates 200k+ public opinions into a consumer database via hybrid keyword–semantic retrieval and dual-stage rule/LLM denoising. Middle (Active Synthesis): The Active Probing stage, where LLMs act as meta-analysts to convert fragmented discourse into structured questionnaires. Bottom (Evaluation): Performance is assessed along four dimensions: Depth and Breadth via CoIn-Tree (a five-level hierarchical intent map), Correctness via CoIn-RAG (grounded corpus verification), and Informativeness (lexical and semantic richness).
  • Figure 2: Overview of COIN-Bench: It includes over 200k product-level discussions across 9 major domains, 54 sub-domains, and more than 1,400 products.
  • Figure 3: Lighted Tree from GPT-5 on Google Nest Smart Speaker.
  • Figure 4: Lighted Tree from GPT-o3 on Google Nest Smart Speaker.
  • Figure 5: Lighted Tree from Qwen3-30B-A3B on Google Nest Smart Speaker.