Table of Contents
Fetching ...

V-Seek: Accelerating LLM Reasoning on Open-hardware Server-class RISC-V Platforms

Javier J. Poveda Rodrigo, Mohamed Amine Ahmdi, Alessio Burrello, Daniele Jahier Pagliari, Luca Benini

TL;DR

The paper tackles the challenge of running LLM inference and reasoning on open-hardware, vendor-neutral RISC-V platforms. It introduces a three-fold optimization approach on the Sophon SG2042—a 64-core, vector-enabled CPU—comprising quantized kernel development, careful compiler/toolchain selection, and NUMA-aware model mapping, validated on Open-source LLMs such as DeepSeek R1 Distill Llama 8B and QWEN 14B (and Llama 7B). The results demonstrate substantial throughput gains (up to around 3x over baselines) and competitive energy efficiency relative to x86 CPUs, highlighting the practicality of CPU-based LLM inference on open RISC-V hardware. Overall, the work advances software and microarchitectural support for scalable LLM inference on open, many-core RISC-V systems, with meaningful implications for on-premise and edge deployments.

Abstract

The recent exponential growth of Large Language Models (LLMs) has relied on GPU-based systems. However, CPUs are emerging as a flexible and lower-cost alternative, especially when targeting inference and reasoning workloads. RISC-V is rapidly gaining traction in this area, given its open and vendor-neutral ISA. However, the RISC-V hardware for LLM workloads and the corresponding software ecosystem are not fully mature and streamlined, given the requirement of domain-specific tuning. This paper aims at filling this gap, focusing on optimizing LLM inference on the Sophon SG2042, the first commercially available many-core RISC-V CPU with vector processing capabilities. On two recent state-of-the-art LLMs optimized for reasoning, DeepSeek R1 Distill Llama 8B and DeepSeek R1 Distill QWEN 14B, we achieve 4.32/2.29 token/s for token generation and 6.54/3.68 token/s for prompt processing, with a speed up of up 2.9x/3.0x compared to our baseline.

V-Seek: Accelerating LLM Reasoning on Open-hardware Server-class RISC-V Platforms

TL;DR

The paper tackles the challenge of running LLM inference and reasoning on open-hardware, vendor-neutral RISC-V platforms. It introduces a three-fold optimization approach on the Sophon SG2042—a 64-core, vector-enabled CPU—comprising quantized kernel development, careful compiler/toolchain selection, and NUMA-aware model mapping, validated on Open-source LLMs such as DeepSeek R1 Distill Llama 8B and QWEN 14B (and Llama 7B). The results demonstrate substantial throughput gains (up to around 3x over baselines) and competitive energy efficiency relative to x86 CPUs, highlighting the practicality of CPU-based LLM inference on open RISC-V hardware. Overall, the work advances software and microarchitectural support for scalable LLM inference on open, many-core RISC-V systems, with meaningful implications for on-premise and edge deployments.

Abstract

The recent exponential growth of Large Language Models (LLMs) has relied on GPU-based systems. However, CPUs are emerging as a flexible and lower-cost alternative, especially when targeting inference and reasoning workloads. RISC-V is rapidly gaining traction in this area, given its open and vendor-neutral ISA. However, the RISC-V hardware for LLM workloads and the corresponding software ecosystem are not fully mature and streamlined, given the requirement of domain-specific tuning. This paper aims at filling this gap, focusing on optimizing LLM inference on the Sophon SG2042, the first commercially available many-core RISC-V CPU with vector processing capabilities. On two recent state-of-the-art LLMs optimized for reasoning, DeepSeek R1 Distill Llama 8B and DeepSeek R1 Distill QWEN 14B, we achieve 4.32/2.29 token/s for token generation and 6.54/3.68 token/s for prompt processing, with a speed up of up 2.9x/3.0x compared to our baseline.

Paper Structure

This paper contains 3 sections, 4 figures.

Figures (4)

  • Figure 1: From left: optimization flow and contributions. SG2042 block diagram. Pseudocode of the proposed kernel.
  • Figure 2: Matrix vector multiplication size scalability test
  • Figure 3: Compilers comparison scaling the n. of threads for DeepSeek's 8B model token gen., Bar, and prefill, Line.
  • Figure 4: NUMA policies exploration on DeepSeek's 8B model. Token generation shown with bars, prefill with lines.