Understanding RAG: Retrieval-Augmented Generation in Practice

Why RAG exists

LLMs are brilliant pattern machines — but they don’t “know” anything outside their training data.
Retrieval-Augmented Generation (RAG) fixes that by injecting live, external knowledge into the model’s prompt at runtime.

Instead of fine-tuning, you store domain data (docs, FAQs, code, etc.) in a vector database. When a user asks a question, you:

Convert their query into an embedding vector
Find the most relevant chunks in the vector store
Append those chunks to the LLM prompt
Let the model answer using the retrieved context

This approach is cheap, explainable, and instantly updatable — no retraining required.

A minimal example (Node + OpenAI + Pinecone)

1
import OpenAI from "openai";
2
import { Pinecone } from "@pinecone-database/pinecone";
3
import { embedTexts } from "./embed-utils";
4

5
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
6
const pinecone = new Pinecone({ apiKey: process.env.PINECONE_API_KEY });
7
const index = pinecone.index("docs-index");
8

9
export async function askLLM(question: string) {
10
  // 1. Embed the query
11
  const queryEmbedding = await embedTexts([question]);
12

13
  // 2. Retrieve top 3 relevant chunks
14
  const results = await index.query({
15
    vector: queryEmbedding[0],
16
    topK: 3,
17
    includeMetadata: true,
18
  });
19

20
  // 3. Build contextual prompt
21
  const context = results.matches
22
    .map((m) => m.metadata.text)
23
    .join("\n---\n");
24

25
  const prompt = `
26
You are a technical assistant. Use the context below to answer the question accurately.
27

28
Context:
29
${context}
30

31
Question:
32
${question}
33
  `;
34

35
  // 4. Generate response
36
  const completion = await openai.chat.completions.create({
37
    model: "gpt-4o-mini",
38
    messages: [{ role: "user", content: prompt }],
39
  });
40

41
  return completion.choices[0].message.content;
42
}