Table of Contents
Fetching ...

Cooperative Edge Caching with Large Language Model in Wireless Networks

Ning Yang, Wentao Wang, Lingtao Ouyang, Haijun Zhang

TL;DR

A Large Language Model (LLM)-based multi-BS orchestrator that approaches exhaustive-search performance, outperforms classical baselines, and demonstrates robust zero-shot transfer across varying cache capacities, library sizes, and user densities is proposed.

Abstract

Cooperative edge caching in overlapping zones creates intricate coupling among Base Station (BS) decisions, making content replacement highly sensitive to topology and temporal reuse. While heuristics are often myopic and Deep Reinforcement Learning lacks robustness under dynamics, this paper proposes a Large Language Model (LLM)-based multi-BS orchestrator. The LLM acts as the sole autonomous engine, interacting with the environment via a validated text-to-action interface. Each time slot, the system renders environmental states -- including cache inventories and frequency statistics -- into prompts, parsing LLM-generated decisions against strict feasibility constraints. We align the model through a two-stage paradigm: Supervised Fine-Tuning on oracle trajectories for syntax and initialization, followed by Group Relative Policy Optimization. The latter employs an ``opportunity-aware'' reward that prioritizes multi-step cooperative gains relative to a No-Operation baseline. Evaluated on identical request traces, the orchestrator approaches exhaustive-search performance (0.610 vs.\ 0.617 in a 5-BS scenario), outperforms classical baselines (e.g., +4.1\% over least-frequently used), and demonstrates robust zero-shot transfer across varying cache capacities, library sizes, and user densities.

Cooperative Edge Caching with Large Language Model in Wireless Networks

TL;DR

A Large Language Model (LLM)-based multi-BS orchestrator that approaches exhaustive-search performance, outperforms classical baselines, and demonstrates robust zero-shot transfer across varying cache capacities, library sizes, and user densities is proposed.

Abstract

Cooperative edge caching in overlapping zones creates intricate coupling among Base Station (BS) decisions, making content replacement highly sensitive to topology and temporal reuse. While heuristics are often myopic and Deep Reinforcement Learning lacks robustness under dynamics, this paper proposes a Large Language Model (LLM)-based multi-BS orchestrator. The LLM acts as the sole autonomous engine, interacting with the environment via a validated text-to-action interface. Each time slot, the system renders environmental states -- including cache inventories and frequency statistics -- into prompts, parsing LLM-generated decisions against strict feasibility constraints. We align the model through a two-stage paradigm: Supervised Fine-Tuning on oracle trajectories for syntax and initialization, followed by Group Relative Policy Optimization. The latter employs an ``opportunity-aware'' reward that prioritizes multi-step cooperative gains relative to a No-Operation baseline. Evaluated on identical request traces, the orchestrator approaches exhaustive-search performance (0.610 vs.\ 0.617 in a 5-BS scenario), outperforms classical baselines (e.g., +4.1\% over least-frequently used), and demonstrates robust zero-shot transfer across varying cache capacities, library sizes, and user densities.
Paper Structure (47 sections, 2 theorems, 26 equations, 8 figures, 5 tables, 1 algorithm)

This paper contains 47 sections, 2 theorems, 26 equations, 8 figures, 5 tables, 1 algorithm.

Key Result

Theorem 1

Define the state potential function as $\Phi(\bar{s}^{(t)}) \triangleq \mathcal{H}_H(\mathbf{X}^{(t)}; \mathcal{Q}^{(t+1:t+H)})$. The reward structure satisfies:

Figures (8)

  • Figure 1: Multi-cell cooperative edge caching architecture with overlapping BS coverage.
  • Figure 2: Two-stage alignment pipeline. In Stage I (SFT), the LLM is trained on oracle trajectories to produce strictly executable cache-update actions. In Stage II (GRPO), the policy is fine-tuned with a shaped multi-step gain reward together with explicit format and opportunity penalties, while actions are validated and executed through the same strict parser used at deployment.
  • Figure 3: Representative GRPO training trajectories under strict executability (normalized reward vs. training step) for two-BS and five-BS settings.
  • Figure 4: Two-BS horizon comparison across three frozen tasks: training with $H{=}10$ consistently improves cooperative hit rate over $H{=}5$.
  • Figure 5: Five-BS robustness sweep: cooperative hit rate vs. cache capacity $C_b$ (zero-shot, no retraining).
  • ...and 3 more figures

Theorems & Definitions (3)

  • Theorem 1: PBRS Invariance and Opportunity Correction
  • Proposition 1: Exponential growth of the joint action space
  • proof