Table of Contents
Fetching ...

XForecast: Evaluating Natural Language Explanations for Time Series Forecasting

Taha Aksu, Chenghao Liu, Amrita Saha, Sarah Tan, Caiming Xiong, Doyen Sahoo

TL;DR

Two new performance metrics based on simulatability are introduced, assessing how well a human surrogate can predict model forecasts using the explanations, and it is found that numerical reasoning, rather than model size, is the main factor influencing explanation quality.

Abstract

Time series forecasting aids decision-making, especially for stakeholders who rely on accurate predictions, making it very important to understand and explain these models to ensure informed decisions. Traditional explainable AI (XAI) methods, which underline feature or temporal importance, often require expert knowledge. In contrast, natural language explanations (NLEs) are more accessible to laypeople. However, evaluating forecast NLEs is difficult due to the complex causal relationships in time series data. To address this, we introduce two new performance metrics based on simulatability, assessing how well a human surrogate can predict model forecasts using the explanations. Experiments show these metrics differentiate good from poor explanations and align with human judgments. Utilizing these metrics, we further evaluate the ability of state-of-the-art large language models (LLMs) to generate explanations for time series data, finding that numerical reasoning, rather than model size, is the main factor influencing explanation quality.

XForecast: Evaluating Natural Language Explanations for Time Series Forecasting

TL;DR

Two new performance metrics based on simulatability are introduced, assessing how well a human surrogate can predict model forecasts using the explanations, and it is found that numerical reasoning, rather than model size, is the main factor influencing explanation quality.

Abstract

Time series forecasting aids decision-making, especially for stakeholders who rely on accurate predictions, making it very important to understand and explain these models to ensure informed decisions. Traditional explainable AI (XAI) methods, which underline feature or temporal importance, often require expert knowledge. In contrast, natural language explanations (NLEs) are more accessible to laypeople. However, evaluating forecast NLEs is difficult due to the complex causal relationships in time series data. To address this, we introduce two new performance metrics based on simulatability, assessing how well a human surrogate can predict model forecasts using the explanations. Experiments show these metrics differentiate good from poor explanations and align with human judgments. Utilizing these metrics, we further evaluate the ability of state-of-the-art large language models (LLMs) to generate explanations for time series data, finding that numerical reasoning, rather than model size, is the main factor influencing explanation quality.

Paper Structure

This paper contains 44 sections, 1 equation, 10 figures, 14 tables, 2 algorithms.

Figures (10)

  • Figure 1: Example natural language explanation (NLE) for a time series forecast. While the raw forecast might be challenging for a layperson to interpret, the NLE provided by the LLM helps clarify the causal relationship.
  • Figure 2: Depiction of newly proposed metrics for evaluating explanations of black box forecasting models. The left-hand side depicts the direct simulatability metric, which quantifies the distance between the ground truth and the simulated forecast on the original input. The right-hand side depicts the synthetic simulatability metric, which conducts simulations on a newly generated time series.
  • Figure 3: The original and synthetic time series history and forecast pairs generated using the \ref{['fig:metrics']} pipeline (right) both align with the explanation, though they differ in scale and noise. The explanation is generated by the explainer using the original history and forecast.
  • Figure 4: Sample forecasting explanation generated by our pipeline. The colored snippets refer to the 4 salient points by Warner1998 when explaining time series.
  • Figure 5: Qualitative examples showing effect of explanation on the forecast. \ref{['fig:qual_2', 'fig:qual_3']} show examples for direct and synthetic simulatability respectively. Historical context shown in blue, black-box model forecast shown in orange, LLMTime prediction shown in red, and LLMTime_E prediction shown in green.
  • ...and 5 more figures