DMind Benchmark: Toward a Holistic Assessment of LLM Capabilities across the Web3 Domain
Enhao Huang, Pengyu Sun, Zixin Lin, Alex Chen, Joey Ouyang, Haobo Wang, Kaichun Hu, James Yi, Frank Li, Zhiyu Zhang, Tianxiang Xu, Gang Zhao, Ziang Ling, Lowes Yang
TL;DR
DMind Benchmark introduces a holistic Web3-focused evaluation suite for large language models, spanning nine subfields and combining objective questions with domain-specific subjective tasks. By curating 3,543 items from a provenance-tracked, license-aware data pipeline and evaluating 31 prominent LLMs, the study reveals strong performance in fundamentals and infrastructure but notable gaps in token economics, meme concepts, and security auditing. The findings show that model size alone does not guarantee Web3 competence and highlight a Pareto-efficient cost-accuracy landscape across model families. Open-sourcing the dataset and evaluation pipeline, DMind provides a reusable platform to advance domain-adapted reasoning and trustworthy AI for decentralized ecosystems.
Abstract
Large Language Models (LLMs) have achieved impressive performance in diverse natural language processing tasks, but specialized domains such as Web3 present new challenges and require more tailored evaluation. Despite the significant user base and capital flows in Web3, encompassing smart contracts, decentralized finance (DeFi), non-fungible tokens (NFTs), decentralized autonomous organizations (DAOs), on-chain governance, and novel token-economics, no comprehensive benchmark has systematically assessed LLM performance in this domain. To address this gap, we introduce the DMind Benchmark, a holistic Web3-oriented evaluation suite covering nine critical subfields: fundamental blockchain concepts, blockchain infrastructure, smart contract, DeFi mechanisms, DAOs, NFTs, token economics, meme concept, and security vulnerabilities. Beyond multiple-choice questions, DMind Benchmark features domain-specific tasks such as contract debugging and on-chain numeric reasoning, mirroring real-world scenarios. We evaluated 26 models, including ChatGPT, Claude, DeepSeek, Gemini, Grok, and Qwen, uncovering notable performance gaps in specialized areas like token economics and security-critical contract analysis. While some models excel in blockchain infrastructure tasks, advanced subfields remain challenging. Our benchmark dataset and evaluation pipeline are open-sourced on https://huggingface.co/datasets/DMindAI/DMind_Benchmark, reaching number one in Hugging Face's trending dataset charts within a week of release.
