ElectriQ: A Benchmark for Assessing the Response Capability of Large Language Models in Power Marketing

Jinzhi Wang; Qingke Peng; Haozhou Li; Zeyuan Zeng; Jiangbo Zhang; Kaixuan Yang; Ningyong Wu; Qinfeng Song; Ruimeng Li; Biyi Zhou

ElectriQ: A Benchmark for Assessing the Response Capability of Large Language Models in Power Marketing

Jinzhi Wang, Qingke Peng, Haozhou Li, Zeyuan Zeng, Jiangbo Zhang, Kaixuan Yang, Ningyong Wu, Qinfeng Song, Ruimeng Li, Biyi Zhou

TL;DR

ElectriQ addresses the need for domain-specific evaluation of LLMs in electric power marketing by introducing a large-scale, auditable benchmark with 550k+ dialogues across six service domains and 24 sub-scenarios. The framework combines human assessments, automatic metrics, and two regulatory-stress tests (SCC and LDC), plus a SEEK-RAG retrieval-augmented approach to inject policy knowledge during training and inference. Experiments across 13 LLMs show domain-aligned 7B models with SEEK-RAG can match or exceed larger models at a fraction of compute, highlighting the value of domain grounding for sustainability-focused tasks like TOU/DR, DER interconnection, and outage communication. ElectriQ thus provides a reproducible, regulation-aware path to deploying trustworthy EPM assistants that support demand-side management, renewable integration, and grid resilience.

Abstract

As power systems decarbonise and digitalise, high penetrations of distributed energy resources and flexible tariffs make electric power marketing (EPM) a key interface between regulation, system operation and sustainable-energy deployment. Many utilities still rely on human agents and rule- or intent-based chatbots with fragmented knowledge bases that struggle with long, cross-scenario dialogues and fall short of requirements for compliant, verifiable and DR-ready interactions. Meanwhile, frontier large language models (LLMs) show strong conversational ability but are evaluated on generic benchmarks that underweight sector-specific terminology, regulatory reasoning and multi-turn process stability. To address this gap, we present ElectriQ, a large-scale benchmark and evaluation framework for LLMs in EPM. ElectriQ contains over 550k dialogues across six service domains and 24 sub-scenarios and defines a unified protocol that combines human ratings, automatic metrics and two compliance stress tests-Statutory Citation Correctness and Long-Dialogue Consistency. Building on ElectriQ, we propose SEEK-RAG, a retrieval-augmented method that injects policy and domain knowledge during finetuning and inference. Experiments on 13 LLMs show that domain-aligned 7B models with SEEK-RAG match or surpass much larger models while reducing computational cost, providing an auditable, regulation-aware basis for deploying LLM-based EPM assistants that support demand-side management, renewable integration and resilient grid operation.

ElectriQ: A Benchmark for Assessing the Response Capability of Large Language Models in Power Marketing

TL;DR

Abstract

ElectriQ: A Benchmark for Assessing the Response Capability of Large Language Models in Power Marketing

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (9)