Table of Contents
Fetching ...

DOCTOR: Dynamic On-Chip Temporal Variation Remediation Toward Self-Corrected Photonic Tensor Accelerators

Haotian Lu, Sanmitra Banerjee, Jiaqi Gu

TL;DR

Photonic accelerators offer speed and energy benefits but suffer reliability issues from temporally drifting thermal variations. DOCTOR provides a dynamic, on-chip remediation workflow that combines salience-aware sparse calibration, variation-aware tile remapping, and an adaptive remediation controller to recover accuracy without training data or backpropagation. Key contributions include rigorous thermal variation modeling for MRR-based devices, training-free calibration via batched block-wise regression, and a LAP-based remapping strategy with adaptive remediation timing, achieving about 1%-2.5% accuracy drop and 0.1%-5.1% cycle overhead on average, outperforming prior on-chip training by ~34% in accuracy and 2–3 orders of magnitude in efficiency. These results support reliable, self-corrected photonic tensor accelerators suitable for dynamic, edge-focused AI workloads, with open-source code provided.

Abstract

Photonic computing has emerged as a promising solution for accelerating computation-intensive artificial intelligence (AI) workloads, offering unparalleled speed and energy efficiency, especially in resource-limited, latency-sensitive edge computing environments. However, the deployment of analog photonic tensor accelerators encounters reliability challenges due to hardware noise and environmental variations. While off-chip noise-aware training and on-chip training have been proposed to enhance the variation tolerance of optical neural accelerators with moderate, static noise, we observe a notable performance degradation over time due to temporally drifting variations, which requires a real-time, in-situ calibration mechanism. To tackle this challenging reliability issues, for the first time, we propose a lightweight dynamic on-chip remediation framework, dubbed DOCTOR, providing adaptive, in-situ accuracy recovery against temporally drifting noise. The DOCTOR framework intelligently monitors the chip status using adaptive probing and performs fast in-situ training-free calibration to restore accuracy when necessary. Recognizing nonuniform spatial variation distributions across devices and tensor cores, we also propose a variation-aware architectural remapping strategy to avoid executing critical tasks on noisy devices. Extensive experiments show that our proposed framework can guarantee sustained performance under drifting variations with 34% higher accuracy and 2-3 orders-of-magnitude lower overhead compared to state-of-the-art on-chip training methods. Our code is open-sourced at https://github.com/ScopeX-ASU/DOCTOR.

DOCTOR: Dynamic On-Chip Temporal Variation Remediation Toward Self-Corrected Photonic Tensor Accelerators

TL;DR

Photonic accelerators offer speed and energy benefits but suffer reliability issues from temporally drifting thermal variations. DOCTOR provides a dynamic, on-chip remediation workflow that combines salience-aware sparse calibration, variation-aware tile remapping, and an adaptive remediation controller to recover accuracy without training data or backpropagation. Key contributions include rigorous thermal variation modeling for MRR-based devices, training-free calibration via batched block-wise regression, and a LAP-based remapping strategy with adaptive remediation timing, achieving about 1%-2.5% accuracy drop and 0.1%-5.1% cycle overhead on average, outperforming prior on-chip training by ~34% in accuracy and 2–3 orders of magnitude in efficiency. These results support reliable, self-corrected photonic tensor accelerators suitable for dynamic, edge-focused AI workloads, with open-source code provided.

Abstract

Photonic computing has emerged as a promising solution for accelerating computation-intensive artificial intelligence (AI) workloads, offering unparalleled speed and energy efficiency, especially in resource-limited, latency-sensitive edge computing environments. However, the deployment of analog photonic tensor accelerators encounters reliability challenges due to hardware noise and environmental variations. While off-chip noise-aware training and on-chip training have been proposed to enhance the variation tolerance of optical neural accelerators with moderate, static noise, we observe a notable performance degradation over time due to temporally drifting variations, which requires a real-time, in-situ calibration mechanism. To tackle this challenging reliability issues, for the first time, we propose a lightweight dynamic on-chip remediation framework, dubbed DOCTOR, providing adaptive, in-situ accuracy recovery against temporally drifting noise. The DOCTOR framework intelligently monitors the chip status using adaptive probing and performs fast in-situ training-free calibration to restore accuracy when necessary. Recognizing nonuniform spatial variation distributions across devices and tensor cores, we also propose a variation-aware architectural remapping strategy to avoid executing critical tasks on noisy devices. Extensive experiments show that our proposed framework can guarantee sustained performance under drifting variations with 34% higher accuracy and 2-3 orders-of-magnitude lower overhead compared to state-of-the-art on-chip training methods. Our code is open-sourced at https://github.com/ScopeX-ASU/DOCTOR.
Paper Structure (21 sections, 10 equations, 14 figures, 4 tables)

This paper contains 21 sections, 10 equations, 14 figures, 4 tables.

Figures (14)

  • Figure 1: (a) MRR-based photonic accelerator is sensitive to temperature drift. (b) Drifting noises cause severe accuracy drop over time, including phase variation (PV), temperature drift (TD), and thermal crosstalk (CT). "Acc" represents the accuracy. (c) Noise distributions across devices are nonuniform.
  • Figure 2: Our proposed dynamic remediation DOCTOR can counter the accuracy degradation due to temporally drifting hardware variations.
  • Figure 3: Architecture settings of a multi-core photonic tensor accelerator. (a) The accelerator can map $Rk\times Rk$ matrix-vector multiplication at each cycle. Note that we draw $R=C=3$ and $k=4$ as an example for illustration but not the actual architecture setting. (b) The photonic accelerator include $R$ tiles, each tile including $C$ photonic tensor cores (PTCs). Each PTC is of size $k \times k$. Partial sum accumulation is performed by photocurrent accumulation across $C$ cores within one tile. The same input vector chunks are broadcast to $R$ tiles (vertically) using photonic on-chip interconnects. (c) An ideal add-drop micro-ring resonator has a tunable through-port transmission $a$ and a corresponding drop-port transmission $1-a$. (d) As a case study, each $k\times k$ PTC is assumed to be a multiple-wavelength add-drop MRR weight bank with local buffers and electronic local control units.
  • Figure 4: Random phase variations on MRRs lead to large weight errors. Different devices and cores have distinct noise distributions.
  • Figure 5: Illustration of temporally drifting phase noise distributions. We control the mean and std of the distribution. At every timestep, it samples a new noise std map from the scheduled distribution and smoothly evolves to a new distribution via a damping factor $\beta_{\sigma}$.
  • ...and 9 more figures