From zero to a RAG system: successes and failures

244 points - last Tuesday at 6:53 AM

Comments

diarmuidc today at 1:16 PM

>After several weeks, between 2 and 3, the indexing process finished without failures. ... we could finally shut down the virtual machine. The cost was 184 euros on Hetzner, not cheap.

184euro is loose change after spending 3 man weeks working on the process!

maxperience today at 5:02 PM

This article is interesting cause of its scale, but does not touch on how to properly use RAG best practices. We wrote up this blog post on how to actually build a smart enterprise AI RAG based on the latest research if it's interesting to anyone: https://bytevagabond.com/post/how-to-build-enterprise-ai-rag...

It's based on different chunking strategies that scale cheaply and advanced retrieval

shepherdjerred today at 8:12 PM

Is there a 'sqlite equivalent' for RAG? e.g. something I could give Claude w/o a backend and say use command X to add a document, command Y to search, all in a flat file?

_the_inflator today at 4:22 PM

I implemented many RAGs and feel sorry for anyone proclaiming "RAG is dead". These folks have never implemented one, maybe followed a tutorial and installed a "Hello World!" project but that's it.

I don't want to go into detail but 100% agree with the author's conclusion: data is key. Data ingestion to be precisely. Simply using docling and transforming PDFs to markdown and have a vector database doing the rest is ridiculous.

For example, for a high precision RAG with 100% accuracy in pricing as part of the information that RAG provided, I took a week to build a ETL for a 20 page PDF document to separate information between SQL and Graph Database.

And this was a small step with all the tweaking that laid ahead to ensure exceptional results.

What search algorithm or: how many? Embeddings, which quality? Semantics, how and which exactly?

Believe me, RAG is the finest of technical masterpiece there is. I have so many respect for the folks at OpenAI and Anthropic for the ingestion processes and tools they use, because they operate on a level, I will never touch with my RAG implementations.

RAG is really something you should try for yourself, if you love to solve tricky fundamental problems that in the end can provide a lot of value to you or your customers.

Simply don't believe the hype and ignore all "install and embed" solutions. They are crap, sorry to say so.

JKCalhoun today at 11:52 AM

And some have been saying that RAGs are obsolete—that the context window of a modern LLM is adequate (preferable?). The example I recently read was that the contexts are large enough for the entire "The Lord of the Rings" books.

That may be, but then there's an entire law library, the entirety of Wikipedia (and the example in this article of 451 GB). Surely those are at least an order of magnitude larger than Tolkien's prose and might still benefit from a RAG.

maxperience today at 5:04 PM

If you want to build a prod ready RAH architecture with decent benchmark scores I can recommend this blog post based on our experiences what techniques actually work: https://bytevagabond.com/post/how-to-build-enterprise-ai-rag...

deleted today at 5:51 PM

z02d today at 11:15 AM

Maybe a bit off-topic: For my PhD, I wanted to leverage LLMs and AI to speed up the literature review process*. Due to time constraints, this never really lifted off for me. At the time I checked (about 6 months ago), several tools were already available (NotebookLM, Anara, Connected Papers, ZotAI, Litmaps, Consensus, Research Rabbit) supporting Literature Review. They have all pros and cons (and different scopes), but my biggest requirement would be to do this on my Zotero bibliographic collection (available offline as PDF/ePub).

ZotAI can use LMStudio (for embeddings and LLM models), but at that time, ZotAI was super slow and buggy.

Instead of going through the valley of sorrows (as threatofrain shared in the blog post - thanks for that), is there a more or less out-of-the-box solution (paid or free) for the demand (RAG for local literature review support)?

*If I am honest, it was rather a procrastination exercise, but this is for sure relatable for readers of HN :-D

mettamage today at 11:22 AM

51 visitors in real-time.

I love those site features!

In a submission of a few days ago there was something similar.

I love it when a website gives a hint to the old web :)

whakim today at 2:41 PM

I'd argue the author missed a trick here by using a fancy embedding model without any re-ranking. One of the benefits of a re-ranker (or even a series of re-rankers!) is that you can embed your documents using a really small and cheap model (this also often means smaller embeddings).

throw831 today at 7:49 PM

Can anyone suggest a RAG pipeline that is production ready?

Also I wonder if it's now better to use Claude Agent SDK instead of RAG. If anyone has tried this, I would be interested in hearing more.

abd7894 today at 1:00 PM

What ended up being the main bottleneck in your pipeline—embedding throughput, cost, or something else? Did you explore parallelizing vectorization (e.g., multiple workers) or did that not help much in practice?

pussyjuice today at 5:34 PM

After a couple years of multi-modal LLM proving out product, I now consider RAG to be essentially "AI Lite", or just AI-inspired vector search.

It isn't really "AI" in the way ongoing LLM conversations are. The context is effectively controlled by deterministic information, and as LLMs continue improve through various context-related techniques like re-prompting, running multiple models, etc. that deterministic "re-basing" of context will stifle the output.

So I say over time it will be treated as less and less "AI" and more "AI adjacent".

The significance is that right now RAG is largely considered to be an "AI pipeline strategy" in its own right compared others that involve pure context engineering.

But when the context size of LLMs grows much larger (with integrity), when it can, say, accurately hold thousands and thousands of lines of code in context with accuracy, without having to use RAG to search and find, it will be doing a lot more for us. We will get the agentic automation they are promising and not delivering (due to this current limitation).

trgn today at 1:11 PM

Odd to me that Elasticsearch isn't finding a second breath in these new ecosystems. It basically is that now, a RAG engine with model integration.

deleted today at 3:36 PM

civeng today at 1:05 PM

Great write-up. Thank you! I’m contemplating a similar RAG architecture for my engineering firm, but we’re dealing with roughly 20x the data volume (estimating around 9TB of project files, specs, and PDFs). I've been reading about Google's new STATIC framework (sparse matrix constrained decoding) and am really curious about the shift toward generative retrieval for massive speedups well beyond this approach. For those who have scaled RAG into the multi-terabyte range: is it actually worth exploring generative retrieval approaches like STATIC to bypass standard dense vector search, or is a traditional sharded vector DB (Milvus, Pinecone, etc.) still the most practical path at this scale?

I would guess the ingestion pain is still the same.

This new world is astounding.

lucfranken today at 1:13 PM

Cool work! Would be so interested in what would happen if you would put the data and you plan / features you wanted in a Claude Code instance and let it go. You did carefully thinking, but those models now also go really far and deep. Would be really interested in seeing what it comes up with. For that kind of data getting something like a Mac mini or whatever (no not with OpenClaw) would be damn interesting to see how fast and far you can go.

Horatius77 last Tuesday at 7:36 AM

Great writeup but ... pretty sure ChromaDB is open source and not "Google's database"?

alansaber today at 1:01 PM

Think that's the first time i've seen someone write about checkpointing, definitely worth doing for similar projects.

supermooka today at 1:53 PM

Thanks for an interesting read! Are you monitoring usage, and what kind of user feedback have you received? Always curious if these projects end up used because, even with the perfect tech, if the data is low quality, nobody is going to bother

aledevv today at 11:32 AM

I made something similar in my project. My more difficult task has been choice the right approach to chunking long documents. I used both structural and semantic chunking approach. The semantic one helped to better store vectors in vectorial DB. I used QDrant and openAi embedding model.

brcmthrowaway today at 4:02 PM

What was the system prompt?

smrtinsert today at 1:51 PM

What would it look like to regularly react to source data changes? Seems like a big missing piece. Event based? regular cadence? Curious what people choose. Great post though.

KPGv2 today at 2:29 PM

This article came just in the nick of time. I'm in fandoms that lean heavily into fanfiction, and there's a LOT out there on Ao3. Ao3 has the worst search (and yo can't even search your account's history!), so I've been wanting to create something like this as a tool for the fandom, where we can query "what was the fic about XYZ where ABC happened?" and get hopefully helpful responses. I'm very tired of not being able to do this, and it would be a fun learning experience.

I've already got the data mostly structured because I did some research on the fandom last year, charting trends and such, so I don't even need to massage the data. I've got authors, dates, chapters, reader comments, and full text already in a local SQLite db.

redwood today at 1:12 PM

Cool to see Nomic embeddings mentioned. Though surpriser you didn't land on Voyage.

Did you look at Turbopuffer btw?

felixagentai today at 6:42 PM

[dead]

maxothex today at 4:04 PM

[dead]

aplomb1026 today at 5:31 PM

[dead]

philbitt today at 2:14 PM

[dead]

leontloveless today at 4:03 PM

[dead]

Yanko_11 today at 3:09 PM

[dead]

skillflow_ai today at 1:03 PM

[dead]

chattermate today at 2:43 PM

[dead]

BrianFHearn today at 1:31 PM

[dead]

deleted today at 1:47 PM

deleted today at 1:45 PM