Building Resilient Jobs for Unreliable APIs

We had a queue. A good one. RabbitMQ, properly configured, durable queues, the works. Push a "send email" message, a worker picks it up, calls Microsoft Graph API, done. Textbook stuff.

Except sometimes Microsoft would return a 429. Sometimes our OAuth token expired mid-flight. Sometimes the worker crashed right after sending but before acknowledging the message. And sometimes our RabbitMQ node itself would go down for maintenance.

The queue wasn't the problem. The queue was doing exactly what queues do. The problem was that we were treating an unreliable operation like a reliable one.

Can't queues handle this?

Before I go further, let me address the obvious question: can't you just configure your queue to handle retries, backoff, and rate limiting?

Yes. You can.

RabbitMQ has dead letter exchanges with TTL for delayed retries. BullMQ has built-in exponential backoff. SQS has visibility timeouts and redrive policies. You can implement rate limiting on the consumer side, check circuit breaker state before processing, use separate queues per resource.

So why didn't we?

A few reasons specific to our situation:

We needed to query jobs. "Show me all failed email jobs for this user in the last hour." "How many jobs are stuck on this OAuth connection?" With a database, that's a SQL query. With a queue, messages are opaque until consumed. You'd need to build a separate tracking system.

We needed to modify jobs after enqueueing. Cancel a scheduled email because the user deleted it. Update priority because something became urgent. Change the payload because upstream data changed. With queues, once a message is in, it's in. With a database, it's just an update.

We had very long delays. Some retry scenarios meant waiting days before trying again. Queues with multi-day delays get awkward. A runAt timestamp in a database is trivial.

We wanted one place to look. No separate monitoring for queue depth, no message broker to manage, no RabbitMQ dashboard plus application logs plus database queries to piece together what happened. Everything in one table.

The tradeoff is real. Database polling is slower than queue consumption. At very high throughput, you'd feel it. But for external API integrations, the API itself is usually the bottleneck, not your job processing speed. We weren't trying to process 10,000 jobs per second. We were trying to process a few hundred per minute reliably, with full visibility into what was happening.

If you're doing high-throughput internal microservice communication, use a queue. That's what they're built for. If you're integrating with external APIs where you need observability, modifiability, and long delays, a database-backed approach might be simpler.

Now, here's how we built it.

The database is your friend

Instead of treating jobs as ephemeral messages that exist only in transit, we persist them. Every job becomes a row in a table with a status, an attempt count, a scheduled execution time, and a resourceKey that identifies which external resource it's talking to.

model ResilientJob {
  id           String    @id @default(uuid())
  type         String    // 'send-email', 'sync-data', whatever
  resourceKey  String    // The specific API account/connection this job uses
  payload      Json
  
  status       JobStatus @default(PENDING)
  attempts     Int       @default(0)
  maxAttempts  Int       @default(3)
  runAt        DateTime  @default(now())
  lockedAt     DateTime? // When a worker claimed this job
  errorMessage String?
  createdAt    DateTime  @default(now())

  @@index([status, runAt])
}

That resourceKey field is doing more work than it looks. It's the key to everything that follows.

When you're sending emails through someone's connected Outlook account, you're not just hitting "Microsoft's API". You're hitting it through a specific OAuth connection with its own rate limits, its own token expiry, its own health. If User A's connection is broken, that shouldn't block User B's emails. If one API endpoint is rate-limited, jobs targeting other endpoints should keep flowing.

We track the health of each resource separately:

model ResourceRateLimit {
  resourceKey         String    @id
  nextAvailableAt     DateTime  @default(now())
  consecutiveFailures Int       @default(0)
  lastErrorAt         DateTime?
}

This gives us per-resource rate limiting and per-resource circuit breaking. The granularity matters.

The processing loop

The main processing loop runs on a simple cron. Every few seconds, it wakes up and looks for work. The interesting bits are in how it decides what to do.

const LOCK_TIMEOUT_MS = 5 * 60 * 1000; // 5 minutes
const MAX_ATTEMPTS = 3;

while (true) {
  const now = new Date();
  const staleThreshold = new Date(now.getTime() - LOCK_TIMEOUT_MS);

  // Find the next job that's ready to run
  const job = await db.resilientJob.findFirst({
    where: {
      OR: [
        // Ready to run
        {
          runAt: { lte: now },
          status: { in: ['PENDING', 'FAILED'] },
          attempts: { lt: MAX_ATTEMPTS }
        },
        // Stale lock - worker probably crashed
        {
          status: 'RUNNING',
          lockedAt: { lt: staleThreshold }
        }
      ]
    },
    orderBy: { runAt: 'asc' }
  });

  if (!job) break;

  // Optimistic lock - try to claim this job
  const claimed = await db.resilientJob.updateMany({
    where: { id: job.id, status: job.status },
    data: { status: 'RUNNING', lockedAt: now }
  });

  if (claimed.count === 0) continue; // Someone else got it

  // Check if this resource is healthy before we even try
  const healthy = await checkResourceHealth(job.resourceKey);
  if (!healthy) {
    await deferJob(job.id, getCircuitBreakerDelay());
    continue;
  }

  try {
    await executeJob(job);
    await markComplete(job);
    await resetResourceHealth(job.resourceKey);
  } catch (error) {
    await handleFailure(job, error);
  }
}

Aside

Big Word Alert: Optimistic Lock - Instead of locking the row upfront and making everyone wait, you just try to update it with a condition: "Set status to RUNNING, but only if status is still PENDING." If another worker already changed it, your update affects zero rows. You know you lost the race and move on. No distributed locks needed. No Redis. Just SQL doing what SQL does well.

(I wrote a deep dive on this here: Choosing the Right Lock)

The lockedAt timestamp handles a failure mode that's easy to miss: what if a worker crashes mid-execution? Without it, the job stays in RUNNING forever. With it, we can reclaim jobs where the lock has gone stale. If a job has been RUNNING for more than 5 minutes, something probably went wrong and another worker can pick it up.

Aside

Aside on horizontal scaling: This pattern works cleanly across multiple instances because the database is the coordination point. Each worker is stateless. It just polls, claims, executes, updates. You can run ten of them without any special configuration. The only thing to watch is the polling interval. Too aggressive and you'll burn database connections, too slow and you'll have latency. Every few seconds is usually the sweet spot.

Getting backoff right

The naive approach to retry is exponential backoff. Wait 1 second, then 2, then 4, then 8. It's fine. It's also insufficient for real-world API integrations.

Different failures need different responses. A 500 error probably means "try again soon, the server is having a moment." A 429 means "you're going too fast, here's exactly how long to wait." A 401 means "your credentials are wrong", but in OAuth land, that might just mean your token expired and you need to refresh it.

function computeBackoff(error: any, attempt: number): number {
  const status = error.response?.status;
  const headers = error.response?.headers;

  // If they told us exactly when to retry, listen to them
  if (status === 429) {
    const retryAfter = parseRetryAfterHeader(headers);
    if (retryAfter) {
      return retryAfter * 1000 * 1.2; // Add 20% buffer
    }
  }

  // Standard exponential backoff with jitter
  const base = Math.min(2 ** (attempt - 1), 60);
  const jitter = 0.8 + Math.random() * 0.4;
  return base * 1000 * jitter;
}

That jitter isn't optional. Without it, if you have 50 jobs that all failed at the same moment, they'll all retry at the same moment, fail again, and retry together again.

Aside

Big Word Alert: Thundering Herd - When a bunch of requests hit a service at exactly the same time. Usually happens when a service recovers from an outage and all the waiting clients rush in at once. Or when your retry logic doesn't have randomness and all failed jobs retry on the same schedule. The random jitter spreads retries across a window instead of a single instant.

The Retry-After header parsing is also worth getting right. Microsoft's Graph API uses it. So does Google. When they tell you to wait 47 seconds, they mean it. Ignoring it and hammering them again in 4 seconds just gets you blocked faster.

function parseRetryAfterHeader(headers: Record<string, string>): number | null {
  // Standard header (seconds or HTTP date)
  const retryAfter = headers['retry-after'];
  if (retryAfter) {
    const seconds = Number(retryAfter);
    if (!isNaN(seconds)) return seconds;
    
    const date = Date.parse(retryAfter);
    if (!isNaN(date)) return Math.max(0, (date - Date.now()) / 1000);
  }

  // Microsoft-specific (milliseconds)
  const msHeader = headers['x-ms-retry-after-ms'];
  if (msHeader) {
    const ms = Number(msHeader);
    if (!isNaN(ms)) return ms / 1000;
  }

  return null;
}

Circuit breakers

The circuit breaker pattern gets talked about a lot but implemented poorly almost as often.

The idea is simple: if a resource keeps failing, stop trying for a while. Don't burn through your retry budget, don't spam a broken endpoint, don't waste everyone's time.

async function checkResourceHealth(resourceKey: string): Promise<boolean> {
  const limit = await db.resourceRateLimit.findUnique({
    where: { resourceKey }
  });

  if (!limit) return true;

  // Circuit breaker: too many consecutive failures
  if (limit.consecutiveFailures >= FAILURE_THRESHOLD) {
    if (limit.nextAvailableAt > new Date()) {
      return false; // Circuit is open, don't even try
    }
    // Circuit half-open: allow one attempt to test
  }

  // Rate limiting: respect the cooldown
  return limit.nextAvailableAt <= new Date();
}

When a job fails, we increment the consecutive failure count. When it succeeds, we reset it to zero. Once failures hit a threshold (we used 3), the circuit opens and all jobs for that resource get automatically deferred by a longer period (we used 5 minutes).

This protects you and the API provider. If someone's OAuth connection is broken, you're not going to fix it by retrying every 10 seconds. You're just going to rack up rate limit violations and make your logs unreadable.

Handling 401s in OAuth land

There's one failure mode that deserves special handling: the 401.

In most contexts, a 401 Unauthorized means "you don't have permission, go away." In OAuth contexts, it often just means "your access token expired." And if you have a refresh token, you can fix it.

async function executeEmailJob(payload: EmailPayload): Promise<void> {
  const account = await db.getAccount(payload.accountId);

  try {
    await emailProvider.send(account.accessToken, payload.email);
  } catch (error) {
    if (error.response?.status === 401) {
      // Token expired - try to refresh
      const newTokens = await emailProvider.refreshToken(account.refreshToken);
      
      // Persist new tokens for future jobs
      await db.updateAccountTokens(account.id, newTokens);
      
      // Retry immediately with fresh token
      await emailProvider.send(newTokens.accessToken, payload.email);
      return;
    }
    
    throw error;
  }
}

This is self-healing code. The token expired, we fixed it, the job succeeded, the user never knew anything was wrong. This is the kind of resilience that actually matters. Not just "we'll try again later" but "we'll try to solve the problem right now."

Making other infrastructure reliable

One pattern that emerged later was using this system to make other infrastructure more reliable.

We use RabbitMQ for event broadcasting. When an email comes in, we publish an event so other services can react. But RabbitMQ can go down. Network partitions happen. What do you do?

You don't publish directly. You write the event to your database in the same transaction as your business logic, then create a resilient job to publish it.

async function handleIncomingEmail(email: Email): Promise<void> {
  // Both operations in same transaction
  await db.$transaction(async (tx) => {
    // 1. Save to database (source of truth)
    const saved = await tx.receivedEmail.create({ data: email });

    // 2. Create a job to publish to RabbitMQ
    await tx.resilientJob.create({
      data: {
        type: 'publish-email-event',
        resourceKey: 'rabbitmq-main',
        payload: { emailId: saved.id }
      }
    });
  });
}

// The job processor:
async function publishEmailEvent(payload: { emailId: string }): Promise<void> {
  const email = await db.receivedEmail.findUnique({
    where: { id: payload.emailId }
  });

  // Idempotency check
  if (email.isPublished) return;

  await rabbitMQ.publish('email-events', transformToEvent(email));

  await db.receivedEmail.update({
    where: { id: email.id },
    data: { isPublished: true }
  });
}

If RabbitMQ is down, the job fails and retries later. If the job succeeds but we crash before updating the database, the idempotency check prevents duplicate publishes. The email is never lost.

Aside

Aside: Transactional Outbox Pattern - This is a known pattern. The database is your source of truth and your buffer. The message broker is just a notification mechanism. The key is that the job creation happens in the same transaction as the business logic. Either both succeed or both fail. You never have a state where the email was saved but the job wasn't created.

When to use what

Queues are optimized for throughput. They want to move messages as fast as possible. That's good for many use cases. But when you're dealing with external APIs that have their own opinions about how fast you should go, throughput isn't the goal. Reliability is.

A database-backed job system is slower than a pure message queue. You're writing to disk, you're polling, you're doing SQL queries. But you gain observability (you can query your jobs), durability (nothing is lost), modifiability (you can change jobs after creation), and intelligence (you can make decisions based on history).

Use queues when:

High throughput internal communication
Simple fire-and-forget messages
You don't need to query or modify messages after enqueueing
You have existing queue infrastructure and expertise

Consider database-backed jobs when:

External API integrations with rate limits and auth flows
You need to query job status, history, and failures
You need to cancel or modify jobs after creation
Long delays (hours, days) between attempts
Observability is more important than raw speed

After we deployed this, the difference was obvious. Jobs that failed were automatically retried with appropriate backoff. OAuth tokens that expired overnight were refreshed transparently. When a user's connection had a persistent issue, the circuit breaker opened, their jobs were deferred, and there was a clean log entry explaining exactly what was wrong.

It wasn't magic. It was just software that remembered what happened and made reasonable decisions about what to do next.

That's all resilience really is.