Building Resilient AI Pipelines at Ampwise

Designing multi-provider LLM infrastructure with circuit breakers, exactly-once delivery, and 99.9% uptime.

The Challenge

At Ampwise, we're building AI infrastructure for B2B sales automation. The platform processes quotes, deals, and supplier communications using multiple LLM providers. The challenge: LLM APIs are unreliable in ways that traditional API integrations aren't prepared for.

Problems we faced:

Rate limits that vary by time of day
Latency spikes from 100ms to 30+ seconds
Provider outages that last minutes to hours
Cost variations of 10x between providers
Inconsistent response formats despite identical prompts

Target: 99.9% uptime for our document processing pipeline

Architecture Overview

Our system has three main components:

Ingestion Layer: Receives documents, validates, queues for processing
Processing Pipeline: Extracts and validates information using LLMs
Delivery Layer: Routes results to appropriate downstream systems

The key insight: each component needs different reliability patterns.

Multi-Provider Strategy

Provider Abstraction

We built an abstraction layer over LLM providers (OpenAI, Anthropic, Google). The system treats them as interchangeable backends, routing requests based on availability, cost, and performance.

Circuit Breaker Pattern: Each provider has a circuit breaker that monitors failure rates. When a provider starts failing, the circuit breaker "opens" and routes traffic elsewhere.

This prevents cascading failures. If OpenAI is having issues, we automatically fail over to Anthropic without manual intervention.

Dynamic Routing

Provider priority is dynamic based on:

Cost: Cheaper providers preferred for non-critical tasks
Latency: Real-time processing uses fastest available
Capability: Some tasks require specific model capabilities
Health: Circuit breaker state influences routing

Exactly-Once Processing

Document processing must happen exactly once. Duplicate processing means duplicate downstream effects (emails, notifications, CRM updates).

The Outbox Pattern

We use the transactional outbox pattern: process the document and write the result in a single database transaction. A separate worker reads from the outbox and publishes events to RabbitMQ.

This guarantees that if processing succeeds, the event will eventually be published. If the system crashes mid-processing, the transaction rolls back and nothing happens.

Key benefit: Atomic operations without distributed transactions.

Idempotent Consumers

Despite our best efforts, consumers might receive duplicates (network retries, etc.). Every consumer tracks message IDs to detect and ignore duplicates.

RAG for Document Processing

We use Retrieval-Augmented Generation for processing documents:

Chunk the document into semantically meaningful sections
Generate embeddings for each chunk
Store in vector database for efficient retrieval
Retrieve relevant chunks when building extraction prompts
Generate structured extraction with LLM using relevant context

This approach improves accuracy significantly compared to naive "send entire document to LLM" approaches.

Validation with Pydantic

LLM outputs are unreliable. We validate everything with Pydantic schemas. If an LLM returns invalid data (wrong format, missing fields, out-of-range values), we retry with error feedback.

This validation reduced manual review time by 90%.

Observability

We track:

Per-Request Metrics:

Provider used
Latency
Token counts (input/output)
Cost
Success/failure

System Health:

Circuit breaker states
Queue depths
Processing latency percentiles
Error rates by type

Business Metrics:

Documents processed per hour
Extraction accuracy (sampled)
Manual review rate

Results

After six months:

Uptime: 99.92% (exceeded target)
Latency: P99 under 10 seconds
Cost: 40% reduction through smart routing
Manual Review: Reduced from 60% to 6% of documents

Lessons Learned

LLMs Fail Differently Than Traditional APIs: Rate limits vary by time of day. Latency can spike from 100ms to 30s without it being a "failure". Had to rethink what constitutes a timeout vs normal operation.

Validation Saved Us From Disasters: Early versions trusted LLM outputs. Bad idea. LLMs confidently return invalid JSON, wrong data types, hallucinated values. Pydantic schemas with strict validation caught these before they reached production.

Circuit Breakers Need Tuning: Default settings (5 failures, 30s timeout) didn't work. LLM APIs need higher thresholds (10 failures) and longer recovery (60s). Each provider needs different settings.

Multi-Provider Complexity Is Worth It: Building abstraction for 3 providers felt like overkill initially. But when OpenAI had a 4-hour outage, we automatically failed over to Anthropic. Zero downtime. Abstraction paid for itself on day one of that incident.

Cost Optimization Requires Routing Intelligence: Running all requests through GPT-4 would have cost 10x more. Routing simple extractions to cheaper models while reserving expensive models for complex tasks reduced costs by 40%.

Exactly-Once Processing Is Hard But Essential: First implementation had race conditions causing duplicate processing. Transactional outbox pattern was more complex but eliminated duplicate downstream effects (double emails, double CRM entries).

Observability From Day One: Built comprehensive metrics and logging before scale. When issues appeared at 1000 documents/hour, we had the data to debug. Would have been impossible to add after the fact.

RAG Accuracy vs Latency Trade-offs: More context chunks improved accuracy but increased latency and cost. Found sweet spot at 3-5 relevant chunks. Diminishing returns beyond that.

Technologies

Python (backend services)
NestJS (API layer)
PostgreSQL (data + outbox)
RabbitMQ (message queue)
LangChain (LLM orchestration)
Multiple LLM providers (OpenAI, Anthropic, Google)

This project reinforced that building reliable AI systems requires treating LLMs as unreliable infrastructure and adding appropriate resilience patterns. The patterns that work for traditional APIs (retry, timeout, circuit breaker) need adjustment for the unique failure modes of AI services.