Building Production AI Systems That Don’t Break at 3AM

Building Production AI Systems
Our RAG System Didn’t Crash. It Just Got Slower. That Was Worse.
October 2024, 2:47 am. Latency monitoring lit up—p95 response time went from 2.5s to 45s over 20 minutes. No errors. No alerts. Just… slow.
The embedding API upstream had introduced silent latency (8-12s per call). Our retry logic worked perfectly—every request eventually succeeded. That was the problem. The queue backed up because we’d built for failure, not for slow success.
We fixed it in 15 minutes: aggressive 5s timeout on embeddings, cached fallback to stale vectors, async background refresh. The system recovered. The lesson stuck: retry logic handles failures. Timeouts handle slow success. You need both.
TL;DR: Production AI fails quietly—slow APIs, silent hallucinations, memory leaks in long conversations. Systems that survive have three things: clear workflow architecture (RAG/code-gen/vision), reliability layers (timeouts + retries + thresholds), and enough observability to debug at 3 am. Most failures trace to missing error handling, not model quality.

Three Workflow Patterns (Pick One)
Before choosing tools, understand which pattern fits your problem:
RAG Pipeline: Document Q&A, knowledge bases. Breaks on poor chunk boundaries—splitting mid-sentence loses context. Fix: overlap chunks 10-15%, test retrieval on actual queries before production.
Code Generation: Dev tools, PR reviews, test writing. Risks of unsafe execution. Fix: Docker sandboxes with no network, 512MB RAM limit, 30s timeout, whitelisted imports only.
Vision + Trigger: Object detection, monitoring. Struggles with lighting changes and partial occlusion. Fix: confidence threshold >0.7, cooldown windows to prevent alert spam.
Start with one. Mixing patterns before understanding failure modes compounds debugging pain.
Stack Rules (Principles Over Brands)
Choose models for reasoning depth vs speed. Multi-turn workflows need stronger models. Simple routing uses faster options. Don’t mix families mid-workflow—output consistency matters for debugging.
Embeddings: higher dimensions (1024+) for mixed content, lower (384-768) for pure text. Test on your data—benchmarks lie.
Vector stores: managed services for <1M vectors, self-hosted past 5M, or when privacy restricts cloud.
Current snapshot (early 2025, will age): Claude Sonnet for reasoning, GPT-4o for speed, Llama 4 for batch. OpenAI embeddings. Pinecone or Qdrant. LangSmith for LLM tracing.

Architecture: Three Layers
Ingest: Validate early. File type, size limits (PDF <10MB prevents memory spikes), schema checks. Reject bad inputs before they hit models.
Agent Logic: Loop: receive input → select tools → call model → validate → return or retry. State: in-memory for sessions, Redis for multi-turn, Postgres for history.
Guardrails: Confidence thresholds (RAG <0.7 = reject), retry with backoff (1s, 2s, 4s), output validation (verify citations, syntax-check code), rate limits (10/min per user).
What Actually Breaks (And Fixes)
Silent Hallucinations: Model returns unsupported answers. Fix: require citations—if chunks don’t support claims, return “no relevant info” with closest matches.
Timeout Cascades: Slow APIs block queues. Fix: async processing, aggressive timeouts (5s embeddings, 10s generation), cached fallbacks.
Context Window Growth: Long conversations exceed limits. Fix: sliding window (last 10 turns), summarize every 15-20 turns, DB storage with recent-only inference.
Cost Spikes: Heavy users drive unexpected spend. Fix: per-user quotas (500 req/day), caching (24h TTL), monitor cost per user.
Monitoring: Measure Hallucinations
Beyond latency and errors:
Citation checking: Parse output for claims, verify chunks support them. Flag responses with <50% coverage. Misses subtle errors but catches obvious hallucinations.
User feedback: Track “helpful” ratio. Hallucinations show <30% vs >70% for accurate responses. Manual review of low-rated outputs.
Golden dataset: Run 20-30 known queries daily. If the match rate drops below 85%, investigate drift.
Alert on: p95 latency >10s for 5min, error rate >5%, cost >2x daily average. Log: timestamp, user_id, input hash, models/tools, latency breakdown, errors.

The Debuggability Test
Can someone unfamiliar with this code debug production issues using only logs and docs?
Needs: clear errors (“Retrieval failed: no chunks >0.7 threshold”), logged tool calls with inputs/outputs, basic runbooks (“latency spike → check Redis pool → upstream API”).
Systems that only one person can debug create operational risk.
FAQ
Framework choice? LangChain for rapid prototyping. LlamaIndex for retrieval quality. Build from scratch when frameworks fight your logic.
Prevent unsafe code execution? Docker sandboxes: no network, 512MB RAM, 30s timeout, whitelisted imports. Log all attempts.
Minimum monitoring for RAG? Latency (p95), match scores, and response time. Alert if p95 >8s or scores <0.6 over 1hr. User feedback catches drift.
Fine-tune vs RAG? Start RAG. Fine-tune when quality plateaus, you have >1000 examples, iteration speed matters less than runtime.
Production RAG costs? ~$500-800/month for 1000 req/day. Caching is reduced by 40-60% after month one. Varies by doc size, queries, and models.
Author: Built document processing and dev tooling systems with LLM orchestration over 18 months, B2B SaaS. Limited consumer/real-time/recommendation experience. Tool recommendations reflect this sample and will age—principles should last longer.



