Table of Contents
Fetching ...

NinjaLLM: Fast, Scalable and Cost-effective RAG using Amazon SageMaker and AWS Trainium and Inferentia2

Tengfei Xue, Xuefeng Li, Roman Smirnov, Tahir Azim, Arash Sadrieh, Babak Pahlavan

TL;DR

The paper tackles the high cost and rigidity of traditional RAG systems by deploying a fine-tuned Llama3-Instruct-70B model on AWS Trainium and Inferentia2 chips via SageMaker, with a focus on elasticity, cost-effectiveness, and safe, citation-enabled responses. It introduces enhancements including tool integration, multi-hop reasoning, and hallucination mitigation using a Lima-style fine-tuning regime on 32 TRN1 instances at a cost under $30k. Deployment leverages the vLLM engine with memory-efficient techniques (PagedAttention, block-level memory management) plus multi-bucketing and continuous batching to reduce TTFT and increase throughput. Evaluation on Natural Questions Open and HotPotQA shows Ninja LLM achieving 62.22% and 58.84% respectively—better than DBRX and Mixtral Instruct, though still below GPT-4 Turbo—demonstrating strong, scalable performance with practical hosting and safety guarantees for real-time RAG applications.

Abstract

Retrieval-augmented generation (RAG) techniques are widely used today to retrieve and present information in a conversational format. This paper presents a set of enhancements to traditional RAG techniques, focusing on large language models (LLMs) fine-tuned and hosted on AWS Trainium and Inferentia2 AI chips via SageMaker. These chips are characterized by their elasticity, affordability, and efficient performance for AI compute tasks. Besides enabling deployment on these chips, this work aims to improve tool usage, add citation capabilities, and mitigate the risks of hallucinations and unsafe responses due to context bias. We benchmark our RAG system's performance on the Natural Questions and HotPotQA datasets, achieving an accuracy of 62% and 59% respectively, exceeding other models such as DBRX and Mixtral Instruct.

NinjaLLM: Fast, Scalable and Cost-effective RAG using Amazon SageMaker and AWS Trainium and Inferentia2

TL;DR

The paper tackles the high cost and rigidity of traditional RAG systems by deploying a fine-tuned Llama3-Instruct-70B model on AWS Trainium and Inferentia2 chips via SageMaker, with a focus on elasticity, cost-effectiveness, and safe, citation-enabled responses. It introduces enhancements including tool integration, multi-hop reasoning, and hallucination mitigation using a Lima-style fine-tuning regime on 32 TRN1 instances at a cost under $30k. Deployment leverages the vLLM engine with memory-efficient techniques (PagedAttention, block-level memory management) plus multi-bucketing and continuous batching to reduce TTFT and increase throughput. Evaluation on Natural Questions Open and HotPotQA shows Ninja LLM achieving 62.22% and 58.84% respectively—better than DBRX and Mixtral Instruct, though still below GPT-4 Turbo—demonstrating strong, scalable performance with practical hosting and safety guarantees for real-time RAG applications.

Abstract

Retrieval-augmented generation (RAG) techniques are widely used today to retrieve and present information in a conversational format. This paper presents a set of enhancements to traditional RAG techniques, focusing on large language models (LLMs) fine-tuned and hosted on AWS Trainium and Inferentia2 AI chips via SageMaker. These chips are characterized by their elasticity, affordability, and efficient performance for AI compute tasks. Besides enabling deployment on these chips, this work aims to improve tool usage, add citation capabilities, and mitigate the risks of hallucinations and unsafe responses due to context bias. We benchmark our RAG system's performance on the Natural Questions and HotPotQA datasets, achieving an accuracy of 62% and 59% respectively, exceeding other models such as DBRX and Mixtral Instruct.
Paper Structure (7 sections, 1 table)