Safe-NEureka: a Hybrid Modular Redundant DNN Accelerator for On-board Satellite AI Processing
Riccardo Tedeschi, Luigi Ghionda, Alessandro Nadalini, Yvan Tortorella, Arpan Suravi Prasad, Luca Benini, Davide Rossi, Francesco Conti
TL;DR
Safe-NEureka tackles the need for reliable on-board satellite AI by delivering a runtime-reconfigurable DNN accelerator that can switch between high-throughput and fault-tolerant modes. It combines Hybrid Modular Redundancy with hardware-assisted rollback, ECC-protected memory, and a $TMR$-protected controller to achieve end-to-end resilience in a heterogeneous RISC-V cluster. The GlobalFoundries $12\mathrm{nm}$ tapeout demonstrates a $\sim15\%$ area overhead while reducing faulty executions by $\sim96\%$ in redundancy mode and preserving near-baseline performance in the high-throughput mode with acceptable penalties. This mixed-criticality approach enables space missions to allocate overheads to critical tasks while maintaining real-time AI processing, and the authors also release the full RTL as open-source for broader adoption.
Abstract
Low Earth Orbit (LEO) constellations are revolutionizing the space sector, with on-board Artificial Intelligence (AI) becoming pivotal for next-generation satellites. AI acceleration is essential for safety-critical functions such as autonomous Guidance, Navigation, and Control (GNC), where errors cannot be tolerated, and performance-critical processing of high-bandwidth sensor data, where occasional errors are tolerable. Consequently, AI accelerators for satellites must combine robust protection against radiation-induced faults with high throughput. This paper presents Safe-NEureka, a Hybrid Modular Redundant Deep Neural Network (DNN) accelerator for heterogeneous RISC-V systems. It operates in two modes: a redundancy mode utilizing Dual Modular Redundancy (DMR) with hardware-based recovery, and a performance mode repurposing redundant datapaths to maximize parallel throughput. Furthermore, its memory interface is protected by Error Correction Codes (ECCs), and the controller by Triple Modular Redundancy (TMR). Implementation in GlobalFoundries 12nm technology shows a 96 reduction in faulty executions in redundancy mode, with a manageable 15 area overhead. In performance mode, the architecture achieves near-baseline speeds on 3x3 dense convolutions with a 5 throughput and 11 efficiency reduction, compared to 48 and 53 in redundancy mode. This flexibility ensures high overheads are limited to critical tasks, establishing Safe-NEureka as a versatile solution for space applications.
