Back to Basics: Why We Chose Long Polling Over WebSockets

Learn how we implemented real-time updates using Node.js, TypeScript, and PostgreSQL with HTTP long polling. A practical guide to building scalable real-time systems without WebSockets.

Nadeesha Cabral

Like many teams building real-time systems with Node.js and TypeScript, we've been exploring ways to handle real-time updates at scale. Our system handles hundreds of worker nodes constantly polling our PostgreSQL-backed control plane for new jobs (tool calls issued by agents), while agents themselves continuously pull for execution and chat state updates. What started as an exploration into WebSockets led us to a surprisingly effective "old-school" solution: HTTP long polling with Postgres.

The Challenge: Real-time Updates at Scale

Our Node.js/TypeScript backend faced two main challenges:

  1. Worker Node Updates: Hundreds of worker nodes running our Node.js / Golang / C# SDKs needed to know about new jobs as soon as they were available, requiring a querying strategy that didn't bring down our Postgres database
  2. Agent State Synchronization: Agents required real-time updates about execution and chat state, which we needed to stream efficiently.

Long Polling vs WebSockets: A Refresher

How Long Polling Works

sequenceDiagram
    participant Client
    participant Server
    participant Database
    
    Client->>Server: Request new data
    
    alt Data available immediately
        Server->>Database: Check for data
        Database-->>Server: Return data
        Server-->>Client: Return response immediately
    else No data available
        Server->>Database: Check for data
        Database-->>Server: No data
        Note over Server: Hold connection
        loop Check periodically
            Server->>Database: Poll for new data
            Database-->>Server: New data arrives
        end
        Server-->>Client: Return response
    end
    Client->>Server: Next request begins

The key difference between approaches can be understood with a simple train analogy:

Short polling is like a train that departs strictly according to a timetable - it leaves the station at fixed intervals regardless of whether there are passengers or not. WebSockets, on the other hand, are like having a dedicated train line always ready to transport passengers.

Long polling? It's like a train that waits at the station until at least one passenger boards before departing. If no passengers show up within a certain time (TTL), only then does it leave empty. This approach gives you the best of both worlds - immediate departure when there's data (passengers) and efficient resource usage when there's not.

In technical terms:

  1. With short polling, the server responds immediately whether there's data or not
  2. With long polling, the server holds the connection open until either:
    • New data becomes available
    • A timeout is reached (TTL)

Our Implementation Deep Dive

Let's break down our Node.js implementation:

export const getJobStatusSync = async ({
  jobId,
  owner,
  ttl = 60_000,
}: {
  jobId: string;
  owner: { clusterId: string };
  ttl?: number;
}) => {
  let jobResult: {
    service: string;
    status: "pending" | "running" | "success" | "failure" | "stalled";
    result: string | null;
    resultType: ResultType | null;
  } | undefined;

  const start = Date.now();

The function accepts:

  • jobId: Unique identifier for the job we're tracking
  • owner.clusterId: Cluster identifier for multi-tenancy
  • ttl: Time-to-live in milliseconds (defaults to 60 seconds)

The Polling Loop

  do {
    const [job] = await data.db
      .select({
        service: data.jobs.service,
        status: data.jobs.status,
        result: data.jobs.result,
        resultType: data.jobs.result_type,
      })
      .from(data.jobs)
      .where(and(eq(data.jobs.id, jobId), eq(data.jobs.cluster_id, owner.clusterId)));

    if (!job) {
      throw new NotFoundError(`Job ${jobId} not found`);
    }

    if (job.status === "success" || job.status === "failure") {
      jobResult = job;
    } else {
      await new Promise(resolve => setTimeout(resolve, 500));
    }
  } while (!jobResult && Date.now() - start < ttl);

Key aspects:

  1. The loop continues until either:
    • We get a final status (success or failure)
    • We hit the TTL timeout
  2. We use a 500ms delay between checks to prevent hammering the database
  3. Database query is optimized with proper indexes on id and cluster_id

Error Handling and Response

  if (jobResult) {
    return jobResult;
  } else {
    throw new JobPollTimeoutError(`Call did not resolve within ${ttl}ms`);
  }

The function concludes by:

  1. Throwing a timeout error if no result was found
  2. Returning the job result if successful

Database Optimization

For this pattern to work efficiently, proper Postgres indexing needs to be implemented:

CREATE INDEX idx_jobs_status ON jobs(id, cluster_id);
CREATE INDEX idx_jobs_lookup ON jobs(status) WHERE status IN ('success', 'failure');

This ensures our frequent polling queries are fast and don't put unnecessary load on the database.

The Hidden Benefits of Long Polling

One of the most compelling aspects of long polling is what you don't have to build. Here's what we avoided:

Observability Remains Unchanged

One of the biggest wins is that we don't need to modify our observability stack for WebSockets. All our standard HTTP metrics just work out of the box, and our existing logging patterns do exactly what we need. There's no need to figure out new ways to monitor persistent connections or implement additional logging for WebSocket state.

Authentication Simplicity

We completely avoid the headache of implementing a new authentication mechanism for incoming WebSocket connections. We just keep using our standard HTTP authentication that we already have in place. All our existing security patterns continue to work exactly as they always have.

When we implemented Websockets earlier, this became extremely gnarly due to the RBAC restrictions we had to honor. Basically, we needed to be really careful about what data we push to the connected clients, and the privilege escalation that happens when a client moves from one cluster to another.

Infrastructure Compatibility

Corporate firewalls blocking WebSocket connections was one of our other worries. Some of our users are behind firewalls, and we don't need the IT headache of getting them to open up WebSockets.

Not our problem. We don't need any special proxy configurations or complex infrastructure setups. Our standard load balancer configuration works fine without any modifications. The entire stack just keeps humming along as it always has.

Operational Simplicity

We never have to worry about server restarts dropping WebSocket connections. There's no connection state to manage or maintain. When something goes wrong (and something always goes wrong), it's much easier to debug and troubleshoot because we're just dealing with standard HTTP requests and responses.

We use Cloudflare for our edge, and that means our existing configuration rules and DDoS protection didn't need any changing.

Client Implementation

The client-side code stays remarkably simple. It works with any HTTP client, no special WebSocket libraries needed. Even better, reconnection handling comes for free with basic retry logic. The entire client implementation can often be just a few lines of code.

Why Not ElectricSQL?

While exploring solutions, we looked at ElectricSQL, which synchronizes Postgres data to the frontend. They make an interesting case for long polling over WebSockets:

"Switching to an HTTP protocol may at first seem like a regression or a strange fit. Web sockets are built on top of HTTP specifically to serve the kind of realtime data stream that Electric provides. However, they are also more stateful and harder to cache."

In fact, we actually recommend ElectricSQL if you don't need extreme control or low-level constructs to handle real-time updates. It's a solid, battle-tested solution that handles many edge cases and provides a great developer experience.

Why We Chose Raw Long Polling

The message delivery mechanism is a core part of our product - it's not just an implementation detail, it's central to what we do. You can't afford to have something as fundamental as message delivery abstracted away in a third-party library, no matter how good that library might be.

Our specific use case required:

  1. Core Product Control: Full control over our message delivery mechanism - it's not just infrastructure, it's our product
  2. Zero External Dependencies: We needed our stack to be as simple as possible for self-hosting
  3. Close to the Metal: Direct control over the polling mechanism and connection handling
  4. Maximum Control: Ability to fine-tune every aspect of the implementation, including implementing dynamic polling intervals
  5. Simplicity: Making it easy for users to understand and modify the codebase

For us, staying close to the metal with a simple HTTP long polling implementation was the right choice. But if you don't need this level of control, ElectricSQL provides a more feature-rich solution that could save you significant development time.

Application Layer Best Practices

When implementing long polling, there are several critical practices to follow to ensure reliable operation:

Mandatory TTL Implementation

You must implement a Time-To-Live (TTL) for your HTTP connections. Without this, you'll inevitably run into connection reset errors. Your polling logic should always return within this TTL, no matter what.

const getJobStatus = async (jobId: string, ttl = 60_000) => {
  const start = Date.now();
  
  // Always check if we've exceeded TTL
  while (Date.now() - start < ttl) {
    // polling logic here
  }
  
  return null; // TTL exceeded
}

Client-Configurable TTL with Server Limits

While clients should be able to specify their desired TTL, the server must enforce a maximum limit:

const MAX_TTL = 120_000; // 2 minutes

const getJobStatus = async (jobId: string, clientTtl: number) => {
  const ttl = Math.min(clientTtl, MAX_TTL);
  // ... polling logic
}

Infrastructure-Aware TTL Settings

Your maximum TTL must stay under the minimum HTTP connection timeout across your entire infrastructure stack:

  • Application server timeouts
  • Client timeouts
  • Load balancer timeouts
  • Edge server timeouts
  • Proxy timeouts

For example, if your edge server has a 30-second timeout, your max TTL should be comfortably under this, say 25 seconds.

Sensible Database Polling Intervals

As shown in our implementation, include a reasonable wait time between database polls. We use a 500ms interval:

await new Promise(resolve => setTimeout(resolve, 500));

This prevents hammering your database while still providing reasonably quick updates.

Optional: Exponential Backoff

While not implemented in our current system, you can implement exponential backoff for more efficient resource usage:

const getJobStatus = async (jobId: string, ttl = 60_000) => {
  const start = Date.now();
  let waitTime = 100; // Start with 100ms
  
  while (Date.now() - start < ttl) {
    const result = await checkJob(jobId);
    
    if (result) return result;
    
    // Exponential backoff with max of 2 seconds
    waitTime = Math.min(waitTime * 2, 2000);
    await new Promise(resolve => setTimeout(resolve, waitTime));
  }
  
  return null;
}

This approach means:

  • Active requests (those likely to get data soon) terminate quickly
  • Inactive requests gradually increase their polling interval
  • System resources are used more efficiently

A Case for WebSockets: The Other Side of the Story

While we've found long polling to be a great solution for our needs, it's not the only option. WebSockets are not inherently bad. They just require a lot of love and attention.

The challenges we mentioned aren't insurmountable - they just require proper engineering attention:

  • Observability: WebSockets are more stateful, so you need to implement additional logging and monitoring for persistent connections.

  • Authentication: You need to implement a new authentication mechanism for incoming WebSocket connections.

  • Infrastructure: You need to configure your infrastructure to support WebSockets, including load balancers and firewalls.

  • Operations: You need to manage WebSocket connections and reconnections, including handling connection timeouts and errors.

  • Client Implementation: You need to implement a client-side WebSocket library, including handling reconnections and state management.