← Back to Projects

Building Resilient AI Pipelines at Ampwise

Designing multi-provider LLM infrastructure with circuit breakers, exactly-once delivery, and 99.9% uptime.

The Challenge

At Ampwise, we're building AI infrastructure for B2B sales automation. The platform processes quotes, deals, and supplier communications using multiple LLM providers. The challenge: LLM APIs are unreliable in ways that traditional API integrations aren't prepared for.

Problems we faced:

  • Rate limits that vary by time of day
  • Latency spikes from 100ms to 30+ seconds
  • Provider outages that last minutes to hours
  • Cost variations of 10x between providers
  • Inconsistent response formats despite identical prompts

Target: 99.9% uptime for our document processing pipeline

Architecture Overview

Our system has three main components:

  1. Ingestion Layer: Receives documents, validates, queues for processing
  2. Processing Pipeline: Extracts and validates information using LLMs
  3. Delivery Layer: Routes results to appropriate downstream systems

The key insight: each component needs different reliability patterns.

Multi-Provider Strategy

Provider Abstraction

We built an abstraction layer over LLM providers (OpenAI, Anthropic, Google). The system treats them as interchangeable backends, routing requests based on availability, cost, and performance.

Circuit Breaker Pattern: Each provider has a circuit breaker that monitors failure rates. When a provider starts failing, the circuit breaker "opens" and routes traffic elsewhere.

This prevents cascading failures. If OpenAI is having issues, we automatically fail over to Anthropic without manual intervention.

Dynamic Routing

Provider priority is dynamic based on:

  • Cost: Cheaper providers preferred for non-critical tasks
  • Latency: Real-time processing uses fastest available
  • Capability: Some tasks require specific model capabilities
  • Health: Circuit breaker state influences routing

Exactly-Once Processing

Document processing must happen exactly once. Duplicate processing means duplicate downstream effects (emails, notifications, CRM updates).

The Outbox Pattern

We use the transactional outbox pattern: process the document and write the result in a single database transaction. A separate worker reads from the outbox and publishes events to RabbitMQ.

This guarantees that if processing succeeds, the event will eventually be published. If the system crashes mid-processing, the transaction rolls back and nothing happens.

Key benefit: Atomic operations without distributed transactions.

Idempotent Consumers

Despite our best efforts, consumers might receive duplicates (network retries, etc.). Every consumer tracks message IDs to detect and ignore duplicates.

RAG for Document Processing

We use Retrieval-Augmented Generation for processing documents:

  1. Chunk the document into semantically meaningful sections
  2. Generate embeddings for each chunk
  3. Store in vector database for efficient retrieval
  4. Retrieve relevant chunks when building extraction prompts
  5. Generate structured extraction with LLM using relevant context

This approach improves accuracy significantly compared to naive "send entire document to LLM" approaches.

Validation with Pydantic

LLM outputs are unreliable. We validate everything with Pydantic schemas. If an LLM returns invalid data (wrong format, missing fields, out-of-range values), we retry with error feedback.

This validation reduced manual review time by 90%.

Observability

We track:

Per-Request Metrics:

  • Provider used
  • Latency
  • Token counts (input/output)
  • Cost
  • Success/failure

System Health:

  • Circuit breaker states
  • Queue depths
  • Processing latency percentiles
  • Error rates by type

Business Metrics:

  • Documents processed per hour
  • Extraction accuracy (sampled)
  • Manual review rate

Results

After six months:

  • Uptime: 99.92% (exceeded target)
  • Latency: P99 under 10 seconds
  • Cost: 40% reduction through smart routing
  • Manual Review: Reduced from 60% to 6% of documents

Lessons Learned

LLMs Fail Differently Than Traditional APIs: Rate limits vary by time of day. Latency can spike from 100ms to 30s without it being a "failure". Had to rethink what constitutes a timeout vs normal operation.

Validation Saved Us From Disasters: Early versions trusted LLM outputs. Bad idea. LLMs confidently return invalid JSON, wrong data types, hallucinated values. Pydantic schemas with strict validation caught these before they reached production.

Circuit Breakers Need Tuning: Default settings (5 failures, 30s timeout) didn't work. LLM APIs need higher thresholds (10 failures) and longer recovery (60s). Each provider needs different settings.

Multi-Provider Complexity Is Worth It: Building abstraction for 3 providers felt like overkill initially. But when OpenAI had a 4-hour outage, we automatically failed over to Anthropic. Zero downtime. Abstraction paid for itself on day one of that incident.

Cost Optimization Requires Routing Intelligence: Running all requests through GPT-4 would have cost 10x more. Routing simple extractions to cheaper models while reserving expensive models for complex tasks reduced costs by 40%.

Exactly-Once Processing Is Hard But Essential: First implementation had race conditions causing duplicate processing. Transactional outbox pattern was more complex but eliminated duplicate downstream effects (double emails, double CRM entries).

Observability From Day One: Built comprehensive metrics and logging before scale. When issues appeared at 1000 documents/hour, we had the data to debug. Would have been impossible to add after the fact.

RAG Accuracy vs Latency Trade-offs: More context chunks improved accuracy but increased latency and cost. Found sweet spot at 3-5 relevant chunks. Diminishing returns beyond that.

Technologies

  • Python (backend services)
  • NestJS (API layer)
  • PostgreSQL (data + outbox)
  • RabbitMQ (message queue)
  • LangChain (LLM orchestration)
  • Multiple LLM providers (OpenAI, Anthropic, Google)

This project reinforced that building reliable AI systems requires treating LLMs as unreliable infrastructure and adding appropriate resilience patterns. The patterns that work for traditional APIs (retry, timeout, circuit breaker) need adjustment for the unique failure modes of AI services.