{"id":"2065068729591874000","url":"https://x.com/h100envy/status/2065068729591874000","text":"","author":{"name":"h100envy","username":"h100envy","avatarUrl":"https://pbs.twimg.com/profile_images/2047054153990516736/UqUBF2QO_200x200.jpg"},"createdAt":"Thu Jun 11 13:48:46 +0000 2026","engagement":{"replies":7,"retweets":12,"likes":81,"views":56630},"article":{"title":"How to Build RAG That Actually Works","previewText":"Step by step, from corpus to production. Why most RAG systems are bad, and what specifically to do at each step so yours is not.\nRAG is simple as a diagram and hard in practice. The diagram: put","coverImageUrl":"https://pbs.twimg.com/media/HKiVFMAXAAAswcW.jpg","content":"Step by step, from corpus to production. Why most RAG systems are bad, and what specifically to do at each step so yours is not.\n\nRAG is simple as a diagram and hard in practice. The diagram: put documents in a database, find the ones relevant to a query, give them to the model as context. It sounds like \"grab LlamaIndex and you're done.\" But half of RAG systems in production are bad, and the cause is almost always in the details the diagram does not show. This article is about the details.\n\nWe will go the whole way step by step. At each step, two things: what to do and why it decides quality. RAG quality is not one big thing, it is the sum of the right small things.\n\n## Step 0: Understand That RAG Is About Retrieval, Not Generation\n\nThe main misconception at the start. People think the model decides RAG quality. So they grab a stronger model and are surprised the answers are still bad.\n\nRetrieval decides RAG quality, not generation. If you pulled the wrong chunks, no model will save you, it will honestly answer based on the garbage you gave it. If you pulled the right chunks, even a mediocre model answers well.\n\nThis shifts the focus. Most of the work in RAG is not the prompt and not the model choice, it is finding the right chunks. Everything below is about that.\n\n## Step 1: Chunking, Where Everything Breaks First\n\nChunking is slicing documents into pieces for indexing. It is the first place RAG breaks, and the most underrated.\n\nThe naive approach: cut every N characters. Take the text, chop every 500 characters. This is bad because it cuts a thought in half. A chunk ends on \"the company posted growth of\" and the number \"40%\" moved into the next chunk. Retrieval finds one of them, and the answer is incomplete.\n\nWhat to do better:\n\nCut by structure, not by length. Paragraphs, sections, headings are natural boundaries of meaning. Preserve them.\n\nUse overlap. Adjacent chunks should partially overlap, usually by 10-20%. Then a thought at a boundary lands whole in at least one chunk.\n\nKeep metadata. Attach to each chunk where it came from: document, section, date. You will need this both for filtering and for showing the source in the answer.\n\nChunk size is a tradeoff. Small chunks are more precise in meaning but lose context. Large ones carry context but dilute relevance at search time. 600-1000 characters is a working starting point, but the optimum depends on your data, tune it by eval (see the last step).\n\n## Step 2: Embeddings, Where Quality Matters More Than the Generation Model\n\nAn embedding is a vector that represents the meaning of a chunk. Search in RAG works like this: the query becomes a vector, and chunks whose vectors are closest are retrieved. The quality of the embedding model decides whether you find the right thing.\n\nWhat to understand:\n\nThe embedding model matters more than the generative one. It sounds counterintuitive, but if the embeddings are bad, you pull the wrong chunks, and after that it does not matter which model writes the answer. Invest in the embedding choice.\n\nTake a modern embedding model, not the first one you find. OpenAI's text-embedding-3 is a working default. Among open ones, models at the top of the MTEB benchmark are good, but verify on your data, the benchmark leader is not always best on your domain.\n\nDomain matters. Embeddings trained on general text work worse on narrow domains (medicine, law, specific technical jargon). If the domain is narrow, look toward domain models or fine-tuning.\n\nA vector database stores these vectors and searches the nearest ones quickly. ChromaDB to start, Qdrant or pgvector for production. The database choice is not about quality, it is about scale and operations, it barely affects retrieval quality.\n\n## Step 3: Hybrid Search, Because Vectors Are Not a Silver Bullet\n\nThis is where most \"decent\" RAG breaks. They do only vector search and think it is enough. It is not.\n\nVector search (by meaning) is good at one thing and bad at another. It catches semantic closeness perfectly: \"how to return an item\" finds a chunk about \"the purchase return procedure.\" But it catches exact matches poorly: error codes, names, SKUs, specific terms. The query \"error E-404\" may be missed by vectors, because \"E-404\" is almost noise to an embedding.\n\nThe solution is hybrid search: vectors plus keywords (BM25).\n\nBM25 is classic full-text search by words. It catches exactly what vectors miss: exact terms, codes, names. Together they cover each other's weaknesses.\n\nReciprocal rank fusion is a simple way to merge two result lists into one by their ranks, without tuning weights. A working default for merging.\n\nHybrid is not an option, it is the difference between a RAG that works on real queries and one that works only on \"convenient\" ones. Most bad RAG is purely vector RAG.\n\n## Step 4: Reranker, the Cheap Way to Sharply Raise Quality\n\nAfter search you have, say, 10-20 candidates. The problem: they are sorted by rough closeness, and the truly best answer may be in 7th place, not 1st. If you give the model the top 3, you lose it.\n\nA reranker is a second, more precise model that re-sorts the candidates. It is slower and more expensive per item, but it works not on the whole base, only on 10-20 candidates, so the total cost is small.\n\nHow it works: the first search (vectors plus BM25) pulls 20 candidates fast and rough. The reranker looks at the pair (query, chunk) more carefully and gives a precise score. You take the top 3-5 after reranking.\n\nCohere Rerank or open cross-encoder models (bge-reranker) are working options. A reranker is often the cheapest way to noticeably raise quality: one component, small cost, tangible gain in top precision.\n\n## Step 5: Building the Context and the Prompt\n\nNow you have the best chunks. What is left is giving them to the model correctly.\n\nShow the source. Add to each chunk in the context where it came from. This both lowers hallucinations and lets you show the user a link to the source.\n\nConstrain the model to the context. In the prompt, explicitly: answer only from the provided chunks, if the answer is not there, say so, do not make it up. This is what separates a working RAG from one that confidently lies.\n\nDo not overload the context. The temptation to stuff the top 20 chunks \"just in case\" hurts. Extra chunks dilute the relevant ones and confuse the model. Better 3-5 precise chunks after reranking than 20 mediocre ones.\n\n## Step 6: Evaluation, Without Which You Do Not Know If It Works\n\nThis is the step people skip, and that is why they do not know their RAG is bad. \"Asked a few things, seems to answer\" is not evaluation.\n\nBuild an eval set. 30-50 real questions about your data, with a known correct answer or a known correct source chunk. This is your measuring instrument.\n\nMeasure retrieval separately from generation. Two different metrics:\n\nRetrieval: did the right chunk land in the top K (recall@k). This checks your search, chunking, embeddings, hybrid, reranker. If recall is low, the problem is here, not in the model.\n\nGeneration: is the final answer correct given that the chunks are correct. This checks the prompt and the model.\n\nThe separation is critical. If answers are bad, you must know where: pulled the wrong chunks (retrieval) or pulled the right ones but answered badly (generation). These are different fixes.\n\nRun the eval on every change. Changed chunk size, embedding model, added a reranker, ran the eval, saw whether it got better or worse. Without this you turn knobs blind. With it you are an engineer.\n\n## Good RAG Checklist\n\nStructural chunking with overlap, not by raw length. A modern embedding model, verified on your data. Hybrid search (vectors plus BM25), not vectors only. A reranker over the candidates. A prompt with a hard \"only from context\" and source display. 3-5 precise chunks in context, not 20 mediocre ones. An eval set with separate retrieval and generation metrics.\n\nMost bad RAG fails points 3, 4, and 7: vectors only, no reranker, no evaluation. Close those three and you are already above most.\n\n## Final\n\nGood RAG is not about the model or the framework. It is about retrieval, assembled from the right details: structural chunking, good embeddings, hybrid search, a reranker, and an eval that tells you the truth. Bad RAG is bad not because of a weak model, but because at each of these steps the naive option was chosen.\n\nThe path to good RAG is to go through all the steps deliberately and measure the result at each. Not \"grabbed a framework and done,\" but \"assembled, measured, improved.\" That is the whole difference."}}