PRISM: Breaking the O(n) Memory Wall in Long-Context LLM Inference via O(1) Photonic Block Selection

Hyoseok Park; Yeonsang Park

PRISM: Breaking the O(n) Memory Wall in Long-Context LLM Inference via O(1) Photonic Block Selection

Hyoseok Park, Yeonsang Park

Abstract

Long-context LLM inference is bottlenecked not by compute but by the O(n) memory bandwidth cost of scanning the KV cache at every decode step -- a wall that no amount of arithmetic scaling can break. Recent photonic accelerators have demonstrated impressive throughput for dense attention computation; however, these approaches inherit the same O(n) memory scaling as electronic attention when applied to long contexts. We observe that the real leverage point is the coarse block-selection step: a memory-bound similarity search that determines which KV blocks to fetch. We identify, for the first time, that this task is structurally matched to the photonic broadcast-and-weight paradigm -- the query fans out to all candidates via passive splitting, signatures are quasi-static (matching electro-optic MRR programming), and only rank order matters (relaxing precision to 4-6 bits). Crucially, the photonic advantage grows with context length: as N increases, the electronic scan cost rises linearly while the photonic evaluation remains O(1). We instantiate this insight in PRISM (Photonic Ranking via Inner-product Similarity with Microring weights), a thin-film lithium niobate (TFLN) similarity engine. Hardware-impaired needle-in-a-haystack evaluation on Qwen2.5-7B confirms 100% accuracy from 4K through 64K tokens at k=32, with 16x traffic reduction at 64K context. PRISM achieves a four-order-of-magnitude energy advantage over GPU baselines at practical context lengths (n >= 4K).

PRISM: Breaking the O(n) Memory Wall in Long-Context LLM Inference via O(1) Photonic Block Selection

Abstract

Paper Structure (76 sections, 33 equations, 23 figures, 10 tables)

This paper contains 76 sections, 33 equations, 23 figures, 10 tables.

Introduction
Background
KV Cache in Transformer Inference
Retrieval Heads and Selective Attention
Photonic Similarity Engine
Broadcast-and-weight architecture.
WDM spectral encoding.
Comparison with other photonic paradigms.
MRR weight banks.
WDM-based matrix--vector multiplication.
Photonic Retrieval Architecture
System Overview
Signature Encoding
Mean key.
PCA projection.
...and 61 more sections

Figures (23)

Figure 1: Conceptual comparison of KV cache access strategies. Left: Electronic GPU full scan---the processor sequentially reads all $N$ KV blocks from HBM to compute attention, bottlenecked by memory bandwidth. Right:Prism photonic block selection---the query is broadcast optically to all $N$ signature channels in parallel; only the top-$k$ highest-scoring blocks are fetched from memory, reducing traffic by $N/k$ times.
Figure 2: Prism system architecture (five-stage pipeline). Stage 1 (Query Encoding): The GPU/ASIC computes the query sketch $\mathbf{q} = [q_1, \ldots, q_d]$ and encodes each component onto a WDM wavelength via DAC-driven modulators, producing a WDM query signal where $P(\lambda_j) = q_j$. Stage 2 (Broadcast): A $1 \times N$ optical splitter distributes identical copies of the $d$-wavelength signal to all $N$ signature channels (splitting loss: $-10\log_{10}N$ dB). Stage 3 (Signature Weighting): Each channel passes through a row of $d$ MRRs on the TFLN photonic chip; the transmission $t_{ij} = s_{ij}$ of each MRR is electro-optically programmed via DC bias electrodes to encode the block signature weight, performing wavelength-selective multiplication $P_{\text{out}}(\lambda_j) = q_j \times s_{ij}$. Stage 4 (Summation): Broadband photodetectors integrate all wavelengths, yielding photocurrents $I_i = \mathcal{R} \cdot \sum_j (q_j \cdot s_{ij})$ that are proportional to the inner product $\mathbf{q} \cdot \mathbf{s}_i$. Stage 5 (Top-$k$ Selection): ADCs digitize the $N$ photocurrents, a digital top-$k$ selector identifies the $k$ highest-scoring block indices, and a memory controller fetches only those KV blocks from HBM/flash storage.
Figure 3: Prism photonic chip layout for an $8 \times 8$ configuration ($d = 8$ WDM channels, $N = 8$ signature rows). Left: the WDM query input ($\lambda_1$--$\lambda_8$) enters and is split by cascaded $1 \times 2$ Y-junctions. Center: each row contains $d$ MRRs coupled to a bus waveguide with coupling gap of ${\sim}$200--300 nm; EO DC bias electrodes program the MRR resonances to encode signature weights via the Pockels effect. Right: through-port and drop-port outputs route to balanced Ge-on-Si PD pairs (or optionally on-chip integrated photodetectors). Scale bar: 100 µm. The layout scales to $d = 32$, $N = 256$ by increasing the splitter tree depth and the number of rows.
Figure 4: X-cut TFLN rib waveguide cross-section. The rib is etched 500 nm into a 600 nm LN film on SiO$_2$, leaving a 100 nm slab. Lateral Au electrodes apply DC bias for electro-optic (Pockels) tuning of the MRR resonance wavelength. Waveguide width: 1.4 µm.
Figure 5: Optical power budget analysis. (a) Per-detector received power vs. bank size $N$ for three laser powers. The horizontal dashed line indicates the minimum detectable power ($-20dBm$). (b) Electrical SNR at the photodetector vs. signature dimension $d$ for $N = 256$ and $N = 1024$. The shaded region marks SNR $> 20dB$, sufficient for reliable top-$k$ ranking.
...and 18 more figures

PRISM: Breaking the O(n) Memory Wall in Long-Context LLM Inference via O(1) Photonic Block Selection

Abstract

PRISM: Breaking the O(n) Memory Wall in Long-Context LLM Inference via O(1) Photonic Block Selection

Authors

Abstract

Table of Contents

Figures (23)