Table of Contents
Fetching ...

TimeSeries2Report prompting enables adaptive large language model management of lithium-ion batteries

Jiayang Yang, Chunhui Zhao, Martin Guay, Zhixing Cao

TL;DR

The paper presents TimeSeries2Report (TS2R), a prompting-based bridge that converts multivariate LIB time-series into semantically rich textual reports that off-the-shelf LLMs can reason over. TS2R segments time-series, annotates short-term dynamics with semantic descriptors, translates them into expressions, and uses LLMs to generate interpretable O&M reports without any model retraining. Through extensive benchmarks on lab-scale MIT/TJU data and real-world ZJU-o data, TS2R improves report factuality, SOC prediction accuracy, anomaly detection reliability, and predictive charging decisions across multiple backbones and tasks. The framework enables scalable, explainable, and model-agnostic LLM-driven battery management and comes with open-source datasets and tooling to accelerate adoption and further research.

Abstract

Large language models (LLMs) offer promising capabilities for interpreting multivariate time-series data, yet their application to real-world battery energy storage system (BESS) operation and maintenance remains largely unexplored. Here, we present TimeSeries2Report (TS2R), a prompting framework that converts raw lithium-ion battery operational time-series into structured, semantically enriched reports, enabling LLMs to reason, predict, and make decisions in BESS management scenarios. TS2R encodes short-term temporal dynamics into natural language through a combination of segmentation, semantic abstraction, and rule-based interpretation, effectively bridging low-level sensor signals with high-level contextual insights. We benchmark TS2R across both lab-scale and real-world datasets, evaluating report quality and downstream task performance in anomaly detection, state-of-charge prediction, and charging/discharging management. Compared with vision-, embedding-, and text-based prompting baselines, report-based prompting via TS2R consistently improves LLM performance in terms of across accuracy, robustness, and explainability metrics. Notably, TS2R-integrated LLMs achieve expert-level decision quality and predictive consistency without retraining or architecture modification, establishing a practical path for adaptive, LLM-driven battery intelligence.

TimeSeries2Report prompting enables adaptive large language model management of lithium-ion batteries

TL;DR

The paper presents TimeSeries2Report (TS2R), a prompting-based bridge that converts multivariate LIB time-series into semantically rich textual reports that off-the-shelf LLMs can reason over. TS2R segments time-series, annotates short-term dynamics with semantic descriptors, translates them into expressions, and uses LLMs to generate interpretable O&M reports without any model retraining. Through extensive benchmarks on lab-scale MIT/TJU data and real-world ZJU-o data, TS2R improves report factuality, SOC prediction accuracy, anomaly detection reliability, and predictive charging decisions across multiple backbones and tasks. The framework enables scalable, explainable, and model-agnostic LLM-driven battery management and comes with open-source datasets and tooling to accelerate adoption and further research.

Abstract

Large language models (LLMs) offer promising capabilities for interpreting multivariate time-series data, yet their application to real-world battery energy storage system (BESS) operation and maintenance remains largely unexplored. Here, we present TimeSeries2Report (TS2R), a prompting framework that converts raw lithium-ion battery operational time-series into structured, semantically enriched reports, enabling LLMs to reason, predict, and make decisions in BESS management scenarios. TS2R encodes short-term temporal dynamics into natural language through a combination of segmentation, semantic abstraction, and rule-based interpretation, effectively bridging low-level sensor signals with high-level contextual insights. We benchmark TS2R across both lab-scale and real-world datasets, evaluating report quality and downstream task performance in anomaly detection, state-of-charge prediction, and charging/discharging management. Compared with vision-, embedding-, and text-based prompting baselines, report-based prompting via TS2R consistently improves LLM performance in terms of across accuracy, robustness, and explainability metrics. Notably, TS2R-integrated LLMs achieve expert-level decision quality and predictive consistency without retraining or architecture modification, establishing a practical path for adaptive, LLM-driven battery intelligence.

Paper Structure

This paper contains 18 sections, 3 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Overview of the TimeSeries2Report framework. a, Data acquisition and system-level statistics. Operational time-series data are collected from all LIB cells arranged in series and/or parallel configurations. Cross-cell statistics, i.e., mean (avg.), standard deviation (std.), and entropy, are computed at each time stamp to capture system-level dynamics. b, Time-series to expression. Each time series is uniformly segmented into short temporal slices, and each slice is evaluated using multiple attributes (e.g., trend, fluctuation, transition) to generate local semantic descriptors such as increasing, decreasing, or stable. Adjacent slices with identical descriptors are merged to reduce redundancy, and the resulting consolidated descriptors are translated into standardized expressions. c, Expression to report. The LLM receives (i) the derived expressions, and (ii) instructions specifying the report’s desired style and output structure. It integrates these inputs to synthesize a coherent, evidence-grounded report describing the operational behavior of the BESS. d, Downstream applications. The generated report serves as input for LLM-driven O$\&$M tasks, demonstrated here for three representative examples: SOC prediction, operational monitoring with anomaly detection, and charging/discharging management with interpretable decision support.
  • Figure 2: Quantitative evaluation of TS2R on public datasets of single, standalone LIB cells.a, Top: Representative voltage time series from the MIT dataset showing a LIB cell undergoing CC-CV charging. Bottom: Comparison of reports generated by Qwen3‑14B prompted with (left) raw numerical time series, (middle) TS2R-parsed textual input, and (right) an expert-written reference. The TS2R-enhanced report correctly identifies the CC-CV phase transition at the $\sim$ 55$^\text{th}$ time stamp, whereas the baseline fails to do so. b, Schematic of the FactScore computation pipeline, which quantifies the factual consistency between generated reports and expert annotations. c, FactScore comparison for LLMs prompted with raw numerical time series (–) versus TS2R-parsed text (+), evaluated on the MIT dataset. d, Same as (c), evaluated on the TJU dataset. TS2R consistently improves performance across all models (Qwen3‑14B, Llama3.1‑8B, DeepSeek‑v3.2, ChatGLM-4.5) and conditions. DeepSeek‑v3.2 (+) and ChatGLM‑4.5 (+) achieve the highest overall performance. Bars represent mean values; error bars indicate 95$\%$ confidence intervals (CI); overlaid black dots show individual FactScore values. *: $p < 0.1$, **: $p < 0.01$, ****: $p < 0.0001$; $p$-values computed using a one-sided paired Wilcoxon test.
  • Figure 3: Performance evaluation of TS2R on a real-world BESS dataset.a, Distribution of discharging currents from the MIT dataset (lab-scale) and the ZJU-o dataset (real-world). The ZJU-o distribution exhibits a markedly broader range, highlighting the greater variability and complexity of LIB operating conditions in field deployments. b, Schematic illustration of the computation of system-level statistical time series, including the average and standard deviation of voltage across 16 cells per module. c–d, Example system-level reports generated by Qwen3-14B with raw numerical input (left), TS2R-processed semantic input (middle), and expert-written references (right), based on the voltage mean (c) and standard deviation (d) time series shown in (b). With raw numerical input, Qwen3-14B misinterprets trends and fails to detect key transitions, whereas TS2R enables accurate identification of a sharp voltage drop and the corresponding transition point. e, FactScore evaluation on the ZJU-t dataset assessing system-level report quality across four attributes (trend, initial/final values, temporal mean and standard deviation and amplitude level) and four LLM backbones (Qwen3-14B, Llama3.1-8B, DeepSeek-v3.2, ChatGLM-4.5). Across all settings, LLMs prompted with TS2R-derived semantic input substantially outperform those given raw numerical data. Horizontal lines indicate mean FactScore; error bars represent 95$\%$ CI. **: $p < 0.01$; ****: $p < 0.0001$ (one-sided paired Wilcoxon test).
  • Figure 4: TS2R improves the accuracy of SOC prediction.a, Illustration of four categories of LLM-based methods: (i) Text-based LLMs, which take raw numerical time-series data as text input; (ii) Vision-based LLMs, which process time-series plots as images; (iii) Time-series embedding models, which encode time-series data into deep vector embeddings aligned with language tokens; and (iv) Report-based LLMs, where TS2R converts time-series into natural language reports used as prompts. b, Evaluation pipeline for SOC prediction. Using the ZJU-t dataset, each method is prompted to predict the system-level average SOC at minute 100, based on time-series input up to minute 90. Predictions are compared against ground truth, and performance is quantified RMSE and MAE. c, SOC prediction accuracy of seven LLM-based methods, shown as bar plots for RMSE and MAE. Report-based LLMs (TS2R+Qwen3‑14B, TS2R+ChatGLM‑4.5) consistently achieve the lowest prediction errors, demonstrating the effectiveness of TS2R in enhancing temporal reasoning.
  • Figure 5: TS2R enables accurate abnormality detection in LIB operation monitoring.a, Quantitative evaluation pipeline for anomaly detection using the seven LLM-based methods across four categories described in Fig. 4. Each method is prompted to classify whether a time-series sample is normal or abnormal. If labeled as abnormal, the model is further queried to identify the type of anomaly (e.g., voltage or temperature-related). Performance is assessed using two metrics: Acc and FAR. b, Comparison of the seven methods on the ZJU-ta dataset, which includes three curated abnormal scenarios and reflects the real-world rarity of anomalies. Report-based methods (TS2R+ChatGLM-4.5 and TS2R+Qwen3‑14B) consistently outperform other approaches, achieving the highest accuracy and lowest FAR within their respective LLM families. TS2R+ChatGLM-4.5 achieves the best overall performance, demonstrating its suitability for real-time LIB operation monitoring.
  • ...and 1 more figures