Case Study
Building Resilient AI Pipelines at Ampwise
Designing multi-provider LLM infrastructure with circuit breakers, exactly-once delivery, and 99.9% uptime.
The Challenge
At Ampwise, we're building AI infrastructure for B2B sales automation. The platform processes quotes, deals, and supplier communications using multiple LLM providers. The challenge: LLM APIs are unreliable in ways that traditional API integrations aren't prepared for.
Problems we faced:
- Rate limits that vary by time of day
- Latency spikes from 100ms to 30+ seconds
- Provider outages that last minutes to hours
- Cost variations of 10x between providers
- Inconsistent response formats despite identical prompts
Target: 99.9% uptime for our document processing pipeline
Architecture Overview
Our system has three main components:
- Ingestion Layer: Receives documents, validates, queues for processing
- Processing Pipeline: Extracts and validates information using LLMs
- Delivery Layer: Routes results to appropriate downstream systems
The key insight: each component needs different reliability patterns.
Multi-Provider Strategy
Provider Abstraction
We built an abstraction layer over LLM providers (OpenAI, Anthropic, Google). The system treats them as interchangeable backends, routing requests based on availability, cost, and performance.
Circuit Breaker Pattern: Each provider has a circuit breaker that monitors failure rates. When a provider starts failing, the circuit breaker "opens" and routes traffic elsewhere.
This prevents cascading failures. If OpenAI is having issues, we automatically fail over to Anthropic without manual intervention.
Dynamic Routing
Provider priority is dynamic based on:
- Cost: Cheaper providers preferred for non-critical tasks
- Latency: Real-time processing uses fastest available
- Capability: Some tasks require specific model capabilities
- Health: Circuit breaker state influences routing
Exactly-Once Processing
Document processing must happen exactly once. Duplicate processing means duplicate downstream effects (emails, notifications, CRM updates).
The Outbox Pattern
We use the transactional outbox pattern: process the document and write the result in a single database transaction. A separate worker reads from the outbox and publishes events to RabbitMQ.
This guarantees that if processing succeeds, the event will eventually be published. If the system crashes mid-processing, the transaction rolls back and nothing happens.
Key benefit: Atomic operations without distributed transactions.
Idempotent Consumers
Despite our best efforts, consumers might receive duplicates (network retries, etc.). Every consumer tracks message IDs to detect and ignore duplicates.
RAG for Document Processing
We use Retrieval-Augmented Generation for processing documents:
- Chunk the document into semantically meaningful sections
- Generate embeddings for each chunk
- Store in vector database for efficient retrieval
- Retrieve relevant chunks when building extraction prompts
- Generate structured extraction with LLM using relevant context
This approach improves accuracy significantly compared to naive "send entire document to LLM" approaches.
Validation with Pydantic
LLM outputs are unreliable. We validate everything with Pydantic schemas. If an LLM returns invalid data (wrong format, missing fields, out-of-range values), we retry with error feedback.
This validation reduced manual review time by 90%.
Observability
We track:
Per-Request Metrics:
- Provider used
- Latency
- Token counts (input/output)
- Cost
- Success/failure
System Health:
- Circuit breaker states
- Queue depths
- Processing latency percentiles
- Error rates by type
Business Metrics:
- Documents processed per hour
- Extraction accuracy (sampled)
- Manual review rate
Results
After six months:
- Uptime: 99.92% (exceeded target)
- Latency: P99 under 10 seconds
- Cost: 40% reduction through smart routing
- Manual Review: Reduced from 60% to 6% of documents
Lessons Learned
LLMs Fail Differently Than Traditional APIs: Rate limits vary by time of day. Latency can spike from 100ms to 30s without it being a "failure". Had to rethink what constitutes a timeout vs normal operation.
Validation Saved Us From Disasters: Early versions trusted LLM outputs. Bad idea. LLMs confidently return invalid JSON, wrong data types, hallucinated values. Pydantic schemas with strict validation caught these before they reached production.
Circuit Breakers Need Tuning: Default settings (5 failures, 30s timeout) didn't work. LLM APIs need higher thresholds (10 failures) and longer recovery (60s). Each provider needs different settings.
Multi-Provider Complexity Is Worth It: Building abstraction for 3 providers felt like overkill initially. But when OpenAI had a 4-hour outage, we automatically failed over to Anthropic. Zero downtime. Abstraction paid for itself on day one of that incident.
Cost Optimization Requires Routing Intelligence: Running all requests through GPT-4 would have cost 10x more. Routing simple extractions to cheaper models while reserving expensive models for complex tasks reduced costs by 40%.
Exactly-Once Processing Is Hard But Essential: First implementation had race conditions causing duplicate processing. Transactional outbox pattern was more complex but eliminated duplicate downstream effects (double emails, double CRM entries).
Observability From Day One: Built comprehensive metrics and logging before scale. When issues appeared at 1000 documents/hour, we had the data to debug. Would have been impossible to add after the fact.
RAG Accuracy vs Latency Trade-offs: More context chunks improved accuracy but increased latency and cost. Found sweet spot at 3-5 relevant chunks. Diminishing returns beyond that.
Technologies
- Python (backend services)
- NestJS (API layer)
- PostgreSQL (data + outbox)
- RabbitMQ (message queue)
- LangChain (LLM orchestration)
- Multiple LLM providers (OpenAI, Anthropic, Google)
This project reinforced that building reliable AI systems requires treating LLMs as unreliable infrastructure and adding appropriate resilience patterns. The patterns that work for traditional APIs (retry, timeout, circuit breaker) need adjustment for the unique failure modes of AI services.