A GPU-Accelerated RAG-Based Telegram Assistant for Supporting Parallel Processing Students
Guy Tel-Zur
TL;DR
This work presents a GPU-accelerated, privacy-preserving Retrieval-Augmented Generation (RAG) system implemented as a Telegram bot to support an undergraduate Parallel Processing course. It combines a quantized Mistral-7B-Instruct model with a local knowledge base built from course materials, using all-MiniLM-L6-v2 embeddings and a FAISS vector store to enable fast semantic retrieval, all running on consumer GPUs via CUDA offloading in llama.cpp. Benchmark results on a RTX 4060 laptop show generation speeds around 16 tokens/s with sub-second start-up times, yielding end-to-end latencies suitable for real-time tutoring, with detailed measurements reported in the Appendix. The solution is containerized for portability and reproducibility, and the authors outline a roadmap for scaling to more users and courses on a departmental server using open-source tooling and 4–5 sentence latency models. This approach demonstrates a practical, open-source path to private AI tutoring for HPC education that can scale within campus resources.
Abstract
This project addresses a critical pedagogical need: offering students continuous, on-demand academic assistance beyond conventional reception hours. I present a domain-specific Retrieval-Augmented Generation (RAG) system powered by a quantized Mistral-7B Instruct model and deployed as a Telegram bot. The assistant enhances learning by delivering real-time, personalized responses aligned with the "Introduction to Parallel Processing" course materials. GPU acceleration significantly improves inference latency, enabling practical deployment on consumer hardware. This approach demonstrates how consumer GPUs can enable affordable, private, and effective AI tutoring for HPC education.
