Table of Contents
Fetching ...

ElectriQ: A Benchmark for Assessing the Response Capability of Large Language Models in Power Marketing

Jinzhi Wang, Qingke Peng, Haozhou Li, Zeyuan Zeng, Jiangbo Zhang, Kaixuan Yang, Ningyong Wu, Qinfeng Song, Ruimeng Li, Biyi Zhou

TL;DR

ElectriQ addresses the need for domain-specific evaluation of LLMs in electric power marketing by introducing a large-scale, auditable benchmark with 550k+ dialogues across six service domains and 24 sub-scenarios. The framework combines human assessments, automatic metrics, and two regulatory-stress tests (SCC and LDC), plus a SEEK-RAG retrieval-augmented approach to inject policy knowledge during training and inference. Experiments across 13 LLMs show domain-aligned 7B models with SEEK-RAG can match or exceed larger models at a fraction of compute, highlighting the value of domain grounding for sustainability-focused tasks like TOU/DR, DER interconnection, and outage communication. ElectriQ thus provides a reproducible, regulation-aware path to deploying trustworthy EPM assistants that support demand-side management, renewable integration, and grid resilience.

Abstract

As power systems decarbonise and digitalise, high penetrations of distributed energy resources and flexible tariffs make electric power marketing (EPM) a key interface between regulation, system operation and sustainable-energy deployment. Many utilities still rely on human agents and rule- or intent-based chatbots with fragmented knowledge bases that struggle with long, cross-scenario dialogues and fall short of requirements for compliant, verifiable and DR-ready interactions. Meanwhile, frontier large language models (LLMs) show strong conversational ability but are evaluated on generic benchmarks that underweight sector-specific terminology, regulatory reasoning and multi-turn process stability. To address this gap, we present ElectriQ, a large-scale benchmark and evaluation framework for LLMs in EPM. ElectriQ contains over 550k dialogues across six service domains and 24 sub-scenarios and defines a unified protocol that combines human ratings, automatic metrics and two compliance stress tests-Statutory Citation Correctness and Long-Dialogue Consistency. Building on ElectriQ, we propose SEEK-RAG, a retrieval-augmented method that injects policy and domain knowledge during finetuning and inference. Experiments on 13 LLMs show that domain-aligned 7B models with SEEK-RAG match or surpass much larger models while reducing computational cost, providing an auditable, regulation-aware basis for deploying LLM-based EPM assistants that support demand-side management, renewable integration and resilient grid operation.

ElectriQ: A Benchmark for Assessing the Response Capability of Large Language Models in Power Marketing

TL;DR

ElectriQ addresses the need for domain-specific evaluation of LLMs in electric power marketing by introducing a large-scale, auditable benchmark with 550k+ dialogues across six service domains and 24 sub-scenarios. The framework combines human assessments, automatic metrics, and two regulatory-stress tests (SCC and LDC), plus a SEEK-RAG retrieval-augmented approach to inject policy knowledge during training and inference. Experiments across 13 LLMs show domain-aligned 7B models with SEEK-RAG can match or exceed larger models at a fraction of compute, highlighting the value of domain grounding for sustainability-focused tasks like TOU/DR, DER interconnection, and outage communication. ElectriQ thus provides a reproducible, regulation-aware path to deploying trustworthy EPM assistants that support demand-side management, renewable integration, and grid resilience.

Abstract

As power systems decarbonise and digitalise, high penetrations of distributed energy resources and flexible tariffs make electric power marketing (EPM) a key interface between regulation, system operation and sustainable-energy deployment. Many utilities still rely on human agents and rule- or intent-based chatbots with fragmented knowledge bases that struggle with long, cross-scenario dialogues and fall short of requirements for compliant, verifiable and DR-ready interactions. Meanwhile, frontier large language models (LLMs) show strong conversational ability but are evaluated on generic benchmarks that underweight sector-specific terminology, regulatory reasoning and multi-turn process stability. To address this gap, we present ElectriQ, a large-scale benchmark and evaluation framework for LLMs in EPM. ElectriQ contains over 550k dialogues across six service domains and 24 sub-scenarios and defines a unified protocol that combines human ratings, automatic metrics and two compliance stress tests-Statutory Citation Correctness and Long-Dialogue Consistency. Building on ElectriQ, we propose SEEK-RAG, a retrieval-augmented method that injects policy and domain knowledge during finetuning and inference. Experiments on 13 LLMs show that domain-aligned 7B models with SEEK-RAG match or surpass much larger models while reducing computational cost, providing an auditable, regulation-aware basis for deploying LLM-based EPM assistants that support demand-side management, renewable integration and resilient grid operation.

Paper Structure

This paper contains 34 sections, 8 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: General customer service vs. power-marketing dialogue. EPM dialogues must integrate pricing policies, consumption patterns, and regulatory constraints to produce actionable, compliant responses, motivating a dedicated benchmark.
  • Figure 2: Overview of the ElectriQ benchmark and evaluation. ElectriQ's dataset composition and scale, covered service scenarios and model spectrum, and unified evaluation with human and automatic metrics for EPM dialogues.
  • Figure 3: Construction of the three ElectriQ subsets. (A) 1.2M call logs $\rightarrow$ denoising, ASR, normalisation, de-identification $\rightarrow$ clean real multi-turn dialogues. (B) Scenario-guided knowledge retrieval and prompted generation on aligned tickets and knowledge bases (training only, with the same de-ID and QA pipeline). (C) Stratified sampling by complexity, urgency and scenario, followed by model- and human-based filtering with quota balancing, to obtain $\sim$2,400 preference-aligned samples.
  • Figure 4: ElectriQ taxonomy and per-category distributions. Sunburst view of the ElectriQ taxonomy (six primary categories from >60 labels) and per-category dialogue-length statistics.
  • Figure 5: Two-stage training of ElectriQ-LLM. Stage 1 injects domain knowledge using large-scale EPM dialogues; Stage 2 aligns responses to human preferences on a 2,400-sample subset to improve professionalism and user experience.
  • ...and 4 more figures