Table of Contents
Fetching ...

When Routers, Switches and Interconnects Compute: A processing-in-interconnect Paradigm for Scalable Neuromorphic AI

Madhuvanthi Srivatsav, Chiranjib Bhattacharyya, Shantanu Chakrabartty, Chetan Singh Thakur

TL;DR

The paper proposes Processing-in-Interconnect ($\pi^2$), a neuromorphic paradigm that reinterprets routing/switching interconnect primitives (delays, causality, time-outs, drops, and broadcasts) as computational operations. By mapping neural computations onto interconnect behaviors via credit-based shaping and asynchronous shaping protocols, $\pi^2$ enables in-network computation with energy-scalable benefits that improve as interconnect bandwidth grows; knowledge distillation further allows existing neural network topologies to be trained onto $\pi^2$ with minimal loss in generalization. Analytical and simulation results suggest near-unit energy utilization ($\eta$) with bandwidth advances and show brain-scale inference may be achieved with hundreds of watts, leveraging Ethernet/TSN hardware for scalable, distributed neuromorphic processing. The work also explores trade-offs between delay-based computation, spiking sparsity, and hardware constraints, and demonstrates through OMNET++ and GPU experiments that $\pi^2$ can approximate MAC-based networks and support scalable visual recognition tasks when complemented by distillation and hardware-aware training. Overall, $\pi^2$ offers a practical path to scalable neuromorphic AI by turning interconnects into active computational substrates, potentially transforming data movement energy into productive computation as interconnects evolve."

Abstract

Routing, switching, and the interconnect fabric are essential for large-scale neuromorphic computing. While this fabric only plays a supporting role in the process of computing, for large AI workloads it ultimately determines energy consumption and speed. In this paper, we address this bottleneck by asking: (a) What computing paradigms are inherent in existing routing, switching, and interconnect systems, and how can they be used to implement a processing-in-Interconnect (π^2) computing paradigm? and (b) leveraging current and future interconnect trends, how will a π^2 system's performance scale compared to other neuromorphic architectures? For (a), we show that operations required for typical AI workloads can be mapped onto delays, causality, time-outs, packet drop, and broadcast operations -- primitives already implemented in packet-switching and packet-routing hardware. We show that existing buffering and traffic-shaping embedded algorithms can be leveraged to implement neuron models and synaptic operations. Additionally, a knowledge-distillation framework can train and cross-map well-established neural network topologies onto $π^2$ without degrading generalization performance. For (b), analytical modeling shows that, unlike other neuromorphic platforms, the energy scaling of $π^2$ improves with interconnect bandwidth and energy efficiency. We predict that by leveraging trends in interconnect technology, a π^2 architecture can be more easily scaled to execute brain-scale AI inference workloads with power consumption levels in the range of hundreds of watts.

When Routers, Switches and Interconnects Compute: A processing-in-interconnect Paradigm for Scalable Neuromorphic AI

TL;DR

The paper proposes Processing-in-Interconnect (), a neuromorphic paradigm that reinterprets routing/switching interconnect primitives (delays, causality, time-outs, drops, and broadcasts) as computational operations. By mapping neural computations onto interconnect behaviors via credit-based shaping and asynchronous shaping protocols, enables in-network computation with energy-scalable benefits that improve as interconnect bandwidth grows; knowledge distillation further allows existing neural network topologies to be trained onto with minimal loss in generalization. Analytical and simulation results suggest near-unit energy utilization () with bandwidth advances and show brain-scale inference may be achieved with hundreds of watts, leveraging Ethernet/TSN hardware for scalable, distributed neuromorphic processing. The work also explores trade-offs between delay-based computation, spiking sparsity, and hardware constraints, and demonstrates through OMNET++ and GPU experiments that can approximate MAC-based networks and support scalable visual recognition tasks when complemented by distillation and hardware-aware training. Overall, offers a practical path to scalable neuromorphic AI by turning interconnects into active computational substrates, potentially transforming data movement energy into productive computation as interconnects evolve."

Abstract

Routing, switching, and the interconnect fabric are essential for large-scale neuromorphic computing. While this fabric only plays a supporting role in the process of computing, for large AI workloads it ultimately determines energy consumption and speed. In this paper, we address this bottleneck by asking: (a) What computing paradigms are inherent in existing routing, switching, and interconnect systems, and how can they be used to implement a processing-in-Interconnect (π^2) computing paradigm? and (b) leveraging current and future interconnect trends, how will a π^2 system's performance scale compared to other neuromorphic architectures? For (a), we show that operations required for typical AI workloads can be mapped onto delays, causality, time-outs, packet drop, and broadcast operations -- primitives already implemented in packet-switching and packet-routing hardware. We show that existing buffering and traffic-shaping embedded algorithms can be leveraged to implement neuron models and synaptic operations. Additionally, a knowledge-distillation framework can train and cross-map well-established neural network topologies onto without degrading generalization performance. For (b), analytical modeling shows that, unlike other neuromorphic platforms, the energy scaling of improves with interconnect bandwidth and energy efficiency. We predict that by leveraging trends in interconnect technology, a π^2 architecture can be more easily scaled to execute brain-scale AI inference workloads with power consumption levels in the range of hundreds of watts.

Paper Structure

This paper contains 26 sections, 50 equations, 18 figures, 5 tables, 1 algorithm.

Figures (18)

  • Figure 1: A) Standard Von-Neumann architecture where data is moved to and from different memory hierarchies (labeled as $M_1$ and $M_2$) to the compute unit (labeled as C). B) In Compute-in-Memory (CIM) architectures, some compute operations are performed directly within the memory arrays, significantly reducing data movement and associated latency. C) Distributed neuromorphic architecture with multiple compute and memory cores communicating over the interconnect fabric (routers and switches). D) Projected energy utilization factor $\eta$ for conventional (Conv) and $\pi^2$ architectures with advancement in CMOS technology. E) Processing-in-interconnect ($\pi^2$) paradigm where interconnects serve as memory, compute, and communication units. F) $\pi^2$ compute primitives using fundamental interconnect operations: sorting of events (equivalent to ADD), delay of events (equivalent to MULTIPLY and memory) and event time-outs and drops (equivalent to non-linear activation).
  • Figure 2: A) Illustration of traditional Credit-Based Shaper (CBS) operation where each traffic class (e.g., $G_1$, $G_2$) is associated with an individual queue and a CBS module. Events are dropped if the queue overflows. Transmission occurs only when the associated credit is non-negative and the channel is free. In ii), Credit dynamics for two traffic classes: $G_1$ (red) and $G_2$ (blue) are shown. AE and TE represent the arrival and transmission of events, respectively. Initially, the channel is occupied (black region). So the arriving event of class $G_{1}$ has to be queued. Accordingly, its credit starts increasing. Once the channel is free, the packet corresponding to $G_{1}$ gets transmitted, and its credit decreases with a slope. Similar dynamics is plotted for $G_{2}$. B) i) The $\pi^2_{K}$ neuron integrates the first K temporal input events and generates a single output event based on a time-to-first-spike encoding scheme. ii) As input events arrive at times $T_1$ to $T_4$, the membrane potential increases with a slope proportional to the number of inputs received. After receiving $K$ inputs, the slope remains constant. When the membrane potential crosses a predefined threshold $M$, the neuron emits an output event at $T$. The dynamics of a $\pi^2_{3}$ neuron is plotted here. iii) $\pi^2$-neuron - The figure illustrates the operation of the proposed CBS protocol, which aligns the credit-based traffic shaping (CBS) protocol with the dynamics of the $\pi^2_{K}$ neuron model. The modified credit dynamics of a $\pi^2_{2}$ neuron for the traffic classes $G_{1},G_{2}$ is highlighted here. The bottom panels contrast traditional credit-based dynamics with the proposed $\pi^2_{K}$ dynamics. C) Traditional Asynchronous Traffic Shaping (ATS) mechanism: Incoming events are assigned to $2^P$ shaped queues based on their Priority Code Point (PCP) values. The ATS shaper delays each packet until its Transmission Eligibility Time (TET), calculated usually as a function of the traffic characteristics, is reached. Once eligible, events are forwarded to a shared queue (which is controlled by the modified CBS protocol to emulate the $\pi^2_{K}$ neuron dynamics) in a time-sorted order for transmission. The ATS protocol is re-envisioned to model the $\pi^2$ synapse. For the $i^{th}$ event, the $TET_{i}$ is computed as $T_{i} + W_{i}$, where $W_{i}$ is the synaptic delay (PCP code) associated with it. D) Illustration of a fully connected (n x n) $\pi^2$-NN architecture implemented using $\pi^2_{K}$ neurons, where synaptic weights are encoded as interconnect delays ($W_1$, $W_2$, $W_3$, …).
  • Figure 3: A) Realization of a $\pi^2$-NN architecture using a hierarchical routing table and modified traffic shaping protocols. Incoming events—e.g., an event generated by the $i^{\text{th}}$ transmitter neuron at time $T_i$—are tagged with their source address and routed to the appropriate destination using a hierarchical routing lookup. The routing table resolves the destination address and assigns a $p$-bit Priority Code Point (PCP) value that encodes the synaptic delay. For a connection to the $j^{\text{th}}$ destination neuron, this delay is denoted as $W_{ij}$. The synaptic delay $W_{ij}$ is composed of two components: a memory access delay ($d_v$ - delay for the $v^{th}$ value), determined by the depth of the hierarchical routing structure, and a queuing delay ($W_{ij}'$), enforced by the ATS protocol. Once the eligibility time is reached, the ATS protocol forwards the event to a shared queue governed by the modified CBS protocol, as described in Fig. \ref{['fig2']}C.
  • Figure 4: A) Scatter plot portraying the relationship between weights of the $\pi^2$ and MLP networks (784x50x10 architecture) trained on the MNIST dataset. We chose this network due to the memory constraints of running the simulator on our system. When the trained MLP weights are directly mapped to the $\pi^2$ network weights, there is a 1-1 relationship. After retraining the $\pi^2$ network, a high correlation still exists. B) t-SNE plot of the hidden layer representations of the ten classes extracted from the trained MAC-based MLP network. Distinct clusters are obtained from different classes of the trained network. C) When the $\pi^2$ network is initialized with the trained MLP weights, the separation between the clusters still exists as it tries to approximate the MAC operation. However, an approximation error exists. This error propagates through the next consecutive layers, making the classification accuracy drop from 97.2% to 88.7%. Each color represents a distinct class in these plots. D,E) The $\pi^2$ network has to be trained further to compensate for the drop in classification accuracy. The effect of the scaling parameter $\alpha$ on the test accuracy and training loss is portrayed here. Higher $\alpha$ helps in improving the separation between the classes to achieve robust classification. F) The 3-layer $\pi^2$ network with 50 nodes ($\pi^2(50)$) is retrained to achieve a classification accuracy of 97.34%. The weights of the trained $\pi^2$ network are quantized to 3 bits ($Q\pi^2$) and simulated using Ethernet switches on OMNET++ software. The simulator supports 3-bit PCP codes and a network with 50 nodes, so we simulate accordingly. The simulation results (sim) exactly match the quantized networks' outputs to achieve a test accuracy of 96.67%. By increasing the number of hidden nodes to 800, we can achieve the baseline accuracy reported in stanojevic2024 ($\pi^2(800)$).
  • Figure 5: A,B) The spatio-temporal patterns emerging from the layers (input (Layer 1), hidden (Layer 2), and output layer (Layer 3)) of the trained $\pi^2$ network are illustrated as a raster plot for different configurations of K. The blue and green dots represent the differential event generation times of the nodes in a layer. The population response to the input digit “4” is shown for (A) $K=[1,16]$ and (B) $K=[10,16]$ for the hidden and output layers, respectively. C,D) The corresponding evolution of neuronal activity for the same input is depicted for (C) $K=[1,16]$ and (D) $K=[10,16]$ (hidden and output layers). It can be observed that reducing the value of K to 1 leads to a computationally sparse (in terms of neuron and synaptic activity) and faster computation. Differential event activity (based on Eq. \ref{['k88']},\ref{['k77']} is selectively traced for classes “4” and “1” in these plots.
  • ...and 13 more figures