The Unreasonable Effectiveness of SKIP LOCKED in PostgreSQL

Using SKIP LOCKED to distribute work across machines in a distributed system is a popular pattern. We explore why, the alternatives, and a real world example from our own codebase.

Nadeesha Cabral on 17-12-2024

When building distributed systems that need to process jobs or tasks concurrently, one of the most challenging problems is ensuring that work is distributed efficiently without duplication. PostgreSQL's SELECT FOR UPDATE SKIP LOCKED feature provides an elegant and performant solution to this problem that almost feels too good to be true. We're big believers in the just use postgres ethos, as our source code might tell you.

The Problem: Distributed Job Processing

Imagine you have a queue of jobs that need to be processed by multiple workers. The core challenge is: how do you ensure that each job is processed exactly once, while maximizing throughput and minimizing contention?

The naive approach might look something like this:

BEGIN;
SELECT * FROM jobs WHERE status = 'pending' LIMIT 1;
-- Process job
UPDATE jobs SET status = 'completed' WHERE id = ?;
COMMIT;

This has an obvious race condition - two workers could select the same job before either marks it as completed. Here's what that looks like:

sequenceDiagram
    participant W1 as Worker 1
    participant DB as Database
    participant W2 as Worker 2

    W1->>DB: SELECT pending job
    W2->>DB: SELECT pending job
    Note over W1,W2: Both workers get same job
    W1->>DB: UPDATE job as completed
    W2->>DB: UPDATE job as completed
    Note over W1,W2: Job processed twice!

You might try to fix this with SELECT FOR UPDATE:

BEGIN;
SELECT * FROM jobs WHERE status = 'pending' FOR UPDATE LIMIT 1;
-- Process job
UPDATE jobs SET status = 'completed' WHERE id = ?;
COMMIT;

With FOR UPDATE, the behavior changes but creates a bottleneck:

sequenceDiagram
    participant W1 as Worker 1
    participant DB as Database
    participant W2 as Worker 2

    W1->>DB: SELECT FOR UPDATE
    Note over DB: Row locked
    W2->>DB: SELECT FOR UPDATE
    Note over W2: Waits for lock...
    W1->>DB: UPDATE completed
    Note over DB: Lock released
    W2-->>DB: Finally gets lock
    W2->>DB: UPDATE completed

Enter SKIP LOCKED

This is where SKIP LOCKED comes in. By adding this clause, we tell PostgreSQL to skip any rows that are already locked by other transactions, rather than waiting for the locks to be released:

BEGIN;
SELECT * FROM jobs WHERE status = 'pending' FOR UPDATE SKIP LOCKED LIMIT 1;
-- Process job
UPDATE jobs SET status = 'completed' WHERE id = ?;
COMMIT;

Here's how SKIP LOCKED allows workers to process jobs concurrently:

sequenceDiagram
    participant W1 as Worker 1
    participant DB as Database
    participant W2 as Worker 2

    W1->>DB: SELECT FOR UPDATE SKIP LOCKED
    Note over DB: Job 1 locked
    W2->>DB: SELECT FOR UPDATE SKIP LOCKED
    Note over DB: Gets different job
    par Process in parallel
        W1->>DB: UPDATE Job 1 completed
    and
        W2->>DB: UPDATE Job 2 completed
    end

Now, transactions can claim jobs without ever blocking each other. This works for us because our goal is to get through the queue as quickly as possible.

Real World Usage

Postgres-based Job Queues

Solid Queue by 37signals showcases the power of this pattern. Solid Queue is a PostgreSQL-backed job queue for Ruby on Rails that relies heavily on SKIP LOCKED to handle job distribution.

In fact many other postgres backed job queues use this pattern, including PG Boss.

Inferable's Job Processing System

At Inferable, we use SKIP LOCKED in our job processing system to distribute work across machines that long poll the control plane.

When the agent runtime determines that an action needs to be peformed (a function/tool call), it will insert a new job into the jobs table with a status of pending. Then we wait until a suitable machine (worker) polls the jobs table for a job. Since at a given point, there could be hundreds of mahines competing for thousands of jobs concurrently, we use SKIP LOCKED to ensure that we don't block each other.

Here's a snippet from our codebase that shows how we claim jobs:

const results = await db.execute(sql`
  UPDATE jobs SET 
    status = 'running',
    remaining_attempts = remaining_attempts - 1,
    last_retrieved_at = now(),
    executing_machine_id = ${machineId}
  WHERE id IN (
    SELECT id FROM jobs 
    WHERE status = 'pending' 
      AND cluster_id = ${clusterId}
      AND service = ${service}
    LIMIT ${limit}
    FOR UPDATE SKIP LOCKED
  )
  RETURNING id, target_fn, target_args...`);

Technical Details

When using SELECT FOR UPDATE SKIP LOCKED, there are a few important technical considerations from the PostgreSQL documentation:

The SKIP LOCKED clause only applies to row-level locks - the required ROW SHARE table-level lock is still taken normally.
Using SKIP LOCKED provides an inconsistent view of the data by design. This is why it's perfect for queue-like tables where we want to distribute work, but not suitable for general purpose work where consistency is required.
SKIP LOCKED can be combined with LIMIT to grab batches of jobs. The locking will stop once enough rows have been returned to satisfy the limit. In fact, this is how we implement our long polling logic.