Why RAG exists
LLMs are brilliant pattern machines — but they don’t “know” anything outside their training data.
Retrieval-Augmented Generation (RAG) fixes that by injecting live, external knowledge into the model’s prompt at runtime.
Instead of fine-tuning, you store domain data (docs, FAQs, code, etc.) in a vector database. When a user asks a question, you:
- Convert their query into an embedding vector
- Find the most relevant chunks in the vector store
- Append those chunks to the LLM prompt
- Let the model answer using the retrieved context
This approach is cheap, explainable, and instantly updatable — no retraining required.
A minimal example (Node + OpenAI + Pinecone)
import OpenAI from "openai";import { Pinecone } from "@pinecone-database/pinecone";import { embedTexts } from "./embed-utils";
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });const pinecone = new Pinecone({ apiKey: process.env.PINECONE_API_KEY });const index = pinecone.index("docs-index");
export async function askLLM(question: string) { // 1. Embed the query const queryEmbedding = await embedTexts([question]);
// 2. Retrieve top 3 relevant chunks const results = await index.query({ vector: queryEmbedding[0], topK: 3, includeMetadata: true, });
// 3. Build contextual prompt const context = results.matches .map((m) => m.metadata.text) .join("\n---\n");
const prompt = `You are a technical assistant. Use the context below to answer the question accurately.
Context:${context}
Question:${question} `;
// 4. Generate response const completion = await openai.chat.completions.create({ model: "gpt-4o-mini", messages: [{ role: "user", content: prompt }], });
return completion.choices[0].message.content;}