Table of Contents
Fetching ...

PRISM: Breaking the O(n) Memory Wall in Long-Context LLM Inference via O(1) Photonic Block Selection

Hyoseok Park, Yeonsang Park

Abstract

Long-context LLM inference is bottlenecked not by compute but by the O(n) memory bandwidth cost of scanning the KV cache at every decode step -- a wall that no amount of arithmetic scaling can break. Recent photonic accelerators have demonstrated impressive throughput for dense attention computation; however, these approaches inherit the same O(n) memory scaling as electronic attention when applied to long contexts. We observe that the real leverage point is the coarse block-selection step: a memory-bound similarity search that determines which KV blocks to fetch. We identify, for the first time, that this task is structurally matched to the photonic broadcast-and-weight paradigm -- the query fans out to all candidates via passive splitting, signatures are quasi-static (matching electro-optic MRR programming), and only rank order matters (relaxing precision to 4-6 bits). Crucially, the photonic advantage grows with context length: as N increases, the electronic scan cost rises linearly while the photonic evaluation remains O(1). We instantiate this insight in PRISM (Photonic Ranking via Inner-product Similarity with Microring weights), a thin-film lithium niobate (TFLN) similarity engine. Hardware-impaired needle-in-a-haystack evaluation on Qwen2.5-7B confirms 100% accuracy from 4K through 64K tokens at k=32, with 16x traffic reduction at 64K context. PRISM achieves a four-order-of-magnitude energy advantage over GPU baselines at practical context lengths (n >= 4K).

PRISM: Breaking the O(n) Memory Wall in Long-Context LLM Inference via O(1) Photonic Block Selection

Abstract

Long-context LLM inference is bottlenecked not by compute but by the O(n) memory bandwidth cost of scanning the KV cache at every decode step -- a wall that no amount of arithmetic scaling can break. Recent photonic accelerators have demonstrated impressive throughput for dense attention computation; however, these approaches inherit the same O(n) memory scaling as electronic attention when applied to long contexts. We observe that the real leverage point is the coarse block-selection step: a memory-bound similarity search that determines which KV blocks to fetch. We identify, for the first time, that this task is structurally matched to the photonic broadcast-and-weight paradigm -- the query fans out to all candidates via passive splitting, signatures are quasi-static (matching electro-optic MRR programming), and only rank order matters (relaxing precision to 4-6 bits). Crucially, the photonic advantage grows with context length: as N increases, the electronic scan cost rises linearly while the photonic evaluation remains O(1). We instantiate this insight in PRISM (Photonic Ranking via Inner-product Similarity with Microring weights), a thin-film lithium niobate (TFLN) similarity engine. Hardware-impaired needle-in-a-haystack evaluation on Qwen2.5-7B confirms 100% accuracy from 4K through 64K tokens at k=32, with 16x traffic reduction at 64K context. PRISM achieves a four-order-of-magnitude energy advantage over GPU baselines at practical context lengths (n >= 4K).
Paper Structure (76 sections, 33 equations, 23 figures, 10 tables)

This paper contains 76 sections, 33 equations, 23 figures, 10 tables.

Figures (23)

  • Figure 1: Conceptual comparison of KV cache access strategies. Left: Electronic GPU full scan---the processor sequentially reads all $N$ KV blocks from HBM to compute attention, bottlenecked by memory bandwidth. Right:Prism photonic block selection---the query is broadcast optically to all $N$ signature channels in parallel; only the top-$k$ highest-scoring blocks are fetched from memory, reducing traffic by $N/k$ times.
  • Figure 2: Prism system architecture (five-stage pipeline). Stage 1 (Query Encoding): The GPU/ASIC computes the query sketch $\mathbf{q} = [q_1, \ldots, q_d]$ and encodes each component onto a WDM wavelength via DAC-driven modulators, producing a WDM query signal where $P(\lambda_j) = q_j$. Stage 2 (Broadcast): A $1 \times N$ optical splitter distributes identical copies of the $d$-wavelength signal to all $N$ signature channels (splitting loss: $-10\log_{10}N$ dB). Stage 3 (Signature Weighting): Each channel passes through a row of $d$ MRRs on the TFLN photonic chip; the transmission $t_{ij} = s_{ij}$ of each MRR is electro-optically programmed via DC bias electrodes to encode the block signature weight, performing wavelength-selective multiplication $P_{\text{out}}(\lambda_j) = q_j \times s_{ij}$. Stage 4 (Summation): Broadband photodetectors integrate all wavelengths, yielding photocurrents $I_i = \mathcal{R} \cdot \sum_j (q_j \cdot s_{ij})$ that are proportional to the inner product $\mathbf{q} \cdot \mathbf{s}_i$. Stage 5 (Top-$k$ Selection): ADCs digitize the $N$ photocurrents, a digital top-$k$ selector identifies the $k$ highest-scoring block indices, and a memory controller fetches only those KV blocks from HBM/flash storage.
  • Figure 3: Prism photonic chip layout for an $8 \times 8$ configuration ($d = 8$ WDM channels, $N = 8$ signature rows). Left: the WDM query input ($\lambda_1$--$\lambda_8$) enters and is split by cascaded $1 \times 2$ Y-junctions. Center: each row contains $d$ MRRs coupled to a bus waveguide with coupling gap of ${\sim}$200--300 nm; EO DC bias electrodes program the MRR resonances to encode signature weights via the Pockels effect. Right: through-port and drop-port outputs route to balanced Ge-on-Si PD pairs (or optionally on-chip integrated photodetectors). Scale bar: 100 µm. The layout scales to $d = 32$, $N = 256$ by increasing the splitter tree depth and the number of rows.
  • Figure 4: X-cut TFLN rib waveguide cross-section. The rib is etched 500 nm into a 600 nm LN film on SiO$_2$, leaving a 100 nm slab. Lateral Au electrodes apply DC bias for electro-optic (Pockels) tuning of the MRR resonance wavelength. Waveguide width: 1.4 µm.
  • Figure 5: Optical power budget analysis. (a) Per-detector received power vs. bank size $N$ for three laser powers. The horizontal dashed line indicates the minimum detectable power ($-20dBm$). (b) Electrical SNR at the photodetector vs. signature dimension $d$ for $N = 256$ and $N = 1024$. The shaded region marks SNR $> 20dB$, sufficient for reliable top-$k$ ranking.
  • ...and 18 more figures