Logo
Overview

Understanding RAG: Retrieval-Augmented Generation in Practice

March 21, 2025
1 min read

Why RAG exists

LLMs are brilliant pattern machines — but they don’t “know” anything outside their training data.
Retrieval-Augmented Generation (RAG) fixes that by injecting live, external knowledge into the model’s prompt at runtime.

Instead of fine-tuning, you store domain data (docs, FAQs, code, etc.) in a vector database. When a user asks a question, you:

  1. Convert their query into an embedding vector
  2. Find the most relevant chunks in the vector store
  3. Append those chunks to the LLM prompt
  4. Let the model answer using the retrieved context

This approach is cheap, explainable, and instantly updatable — no retraining required.


A minimal example (Node + OpenAI + Pinecone)

rag-example.ts
import OpenAI from "openai";
import { Pinecone } from "@pinecone-database/pinecone";
import { embedTexts } from "./embed-utils";
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
const pinecone = new Pinecone({ apiKey: process.env.PINECONE_API_KEY });
const index = pinecone.index("docs-index");
export async function askLLM(question: string) {
// 1. Embed the query
const queryEmbedding = await embedTexts([question]);
// 2. Retrieve top 3 relevant chunks
const results = await index.query({
vector: queryEmbedding[0],
topK: 3,
includeMetadata: true,
});
// 3. Build contextual prompt
const context = results.matches
.map((m) => m.metadata.text)
.join("\n---\n");
const prompt = `
You are a technical assistant. Use the context below to answer the question accurately.
Context:
${context}
Question:
${question}
`;
// 4. Generate response
const completion = await openai.chat.completions.create({
model: "gpt-4o-mini",
messages: [{ role: "user", content: prompt }],
});
return completion.choices[0].message.content;
}