Private RAG Chatbot
A deep dive into building an internal RAG chatbot, including the unconventional 2FA workaround, embedding strategies, and lessons learned.
Getting data out of our internal wiki meant dealing with corporate SSO, session tokens, and 2FA codes that rotated every 30 seconds. I couldn’t disable it, so I built something… unconventional.
I mirrored my phone’s screen to the server using scrcpy and wrote a script that watched my 2FA app in real-time, extracted the codes as they appeared, and fed them straight into the scraper. It worked, but I wouldn’t recommend it for production—it’s fragile and honestly a security shortcut I wouldn’t take again. For a real system, I’d request a service account exemption from IT instead.
The scraping itself was straightforward: 7,000 documents, ~24 hours for the initial pull, then incremental updates every 30 minutes. I compared file hashes to only re-download what changed.
Keeping It Simple with Embeddings
I used Ollama to generate embeddings locally (probably nomic-embed-text—honestly, I don’t remember exactly). The key was keeping it on CPU so the GPU stayed free for the main language model.
For chunking, I went with fixed 512-token chunks with 50-token overlap. Simple to implement, worked well enough. The tradeoff? Sometimes answers that spanned multiple chunks came out incomplete or disjointed. A smarter semantic chunker would’ve fixed this, but I didn’t have time.
I also skipped images, tables, and diagrams entirely. The system extracted raw text only. If your documentation is heavy on visuals, this approach won’t work for you—users quickly learned to ask text-based questions instead.
The Language Model
I tested Llama, Mistral, and a few others before landing on Gemma 2B with 4-bit quantization. It fit comfortably in my RTX 3060’s 11GB VRAM with room to spare, and the quality-to-speed ratio was the best I found.
Responses were fast (2-5 seconds most of the time) but limited. The bot excelled at finding exact wording in documentation—semantic grep, basically. It couldn’t synthesize new insights or connect ideas across documents. That’s partly the small model size, partly the chunking approach. It was fine for what we needed.
The Backend and Frontend
Instead of building a backend from scratch, I deployed OpenWebUI in an LXC container. It’s open-source, works with Ollama out of the box, and saved me weeks of development. For authentication, I just put Nginx in front with a simple email allowlist tied to our internal SSO.
OpenWebUI’s interface was good enough too. Dark mode, markdown rendering, citation support showing which chunks matched. I didn’t customize anything. Shipping beats perfection.
Where I Cut Corners
Everything ran on a home server in my closet with Cloudflare Tunnel for external access. No monitoring, no fancy infrastructure—I just checked logs when someone complained.
Here’s where I made a choice that would’ve bitten me at scale:
I never implemented document deletion or embedding model upgrades. This meant removed documents stayed in the vector database forever, taking up space and potentially polluting search results. And if I’d wanted to upgrade the embedding model even slightly? I would’ve had to re-embed all 150,000 chunks from scratch and rebuild the entire database.
It wasn’t a problem in practice because I never actually changed models or had significant document churn. But for a production system, you’d need proper versioning—tagging embeddings with their model version, or maintaining separate stores for each generation.
Similarly, OpenWebUI showed raw chunk text in citations (200-300 character excerpts), not links back to original documents. Good enough to verify answers came from somewhere real, but not ideal for deeper research. With better metadata, I could’ve added document titles or page numbers.
What Actually Worked
Zero data left the network. Fast enough for our needs. Shipped in days, not months. Cost nothing beyond hardware.
The gaps were real though: fixed-size chunking caused context fragmentation, no support for diagrams or tables, users couldn’t filter searches by department, and the 2FA hack was brittle.
Next time: proper document parsing with table extraction, metadata enrichment, monitoring from day one, and definitely not the phone-mirroring thing.