Back to Basics: Why We Chose Long Polling Over WebSockets
Learn how we implemented real-time updates using Node.js, TypeScript, and PostgreSQL with HTTP long polling. A practical guide to building scalable real-time systems without WebSockets.
Like many teams building real-time systems with Node.js and TypeScript, we've been exploring ways to handle real-time updates at scale. Our system handles hundreds of worker nodes constantly polling our PostgreSQL-backed control plane for new jobs (tool calls issued by agents), while agents themselves continuously pull for execution and chat state updates. What started as an exploration into WebSockets led us to a surprisingly effective "old-school" solution: HTTP long polling with Postgres.
The Challenge: Real-time Updates at Scale
Our Node.js/TypeScript backend faced two main challenges:
- Worker Node Updates: Hundreds of worker nodes running our Node.js / Golang / C# SDKs needed to know about new jobs as soon as they were available, requiring a querying strategy that didn't bring down our Postgres database
- Agent State Synchronization: Agents required real-time updates about execution and chat state, which we needed to stream efficiently.
Long Polling vs WebSockets: A Refresher
How Long Polling Works
sequenceDiagram participant Client participant Server participant Database Client->>Server: Request new data alt Data available immediately Server->>Database: Check for data Database-->>Server: Return data Server-->>Client: Return response immediately else No data available Server->>Database: Check for data Database-->>Server: No data Note over Server: Hold connection loop Check periodically Server->>Database: Poll for new data Database-->>Server: New data arrives end Server-->>Client: Return response end Client->>Server: Next request begins
The key difference between approaches can be understood with a simple train analogy:
Short polling is like a train that departs strictly according to a timetable - it leaves the station at fixed intervals regardless of whether there are passengers or not. WebSockets, on the other hand, are like having a dedicated train line always ready to transport passengers.
Long polling? It's like a train that waits at the station until at least one passenger boards before departing. If no passengers show up within a certain time (TTL), only then does it leave empty. This approach gives you the best of both worlds - immediate departure when there's data (passengers) and efficient resource usage when there's not.
In technical terms:
- With short polling, the server responds immediately whether there's data or not
- With long polling, the server holds the connection open until either:
- New data becomes available
- A timeout is reached (TTL)
Our Implementation Deep Dive
Let's break down our Node.js implementation:
export const getJobStatusSync = async ({
jobId,
owner,
ttl = 60_000,
}: {
jobId: string;
owner: { clusterId: string };
ttl?: number;
}) => {
let jobResult: {
service: string;
status: "pending" | "running" | "success" | "failure" | "stalled";
result: string | null;
resultType: ResultType | null;
} | undefined;
const start = Date.now();
The function accepts:
jobId
: Unique identifier for the job we're trackingowner.clusterId
: Cluster identifier for multi-tenancyttl
: Time-to-live in milliseconds (defaults to 60 seconds)
The Polling Loop
do {
const [job] = await data.db
.select({
service: data.jobs.service,
status: data.jobs.status,
result: data.jobs.result,
resultType: data.jobs.result_type,
})
.from(data.jobs)
.where(and(eq(data.jobs.id, jobId), eq(data.jobs.cluster_id, owner.clusterId)));
if (!job) {
throw new NotFoundError(`Job ${jobId} not found`);
}
if (job.status === "success" || job.status === "failure") {
jobResult = job;
} else {
await new Promise(resolve => setTimeout(resolve, 500));
}
} while (!jobResult && Date.now() - start < ttl);
Key aspects:
- The loop continues until either:
- We get a final status (
success
orfailure
) - We hit the TTL timeout
- We get a final status (
- We use a 500ms delay between checks to prevent hammering the database
- Database query is optimized with proper indexes on
id
andcluster_id
Error Handling and Response
if (jobResult) {
return jobResult;
} else {
throw new JobPollTimeoutError(`Call did not resolve within ${ttl}ms`);
}
The function concludes by:
- Throwing a timeout error if no result was found
- Returning the job result if successful
Database Optimization
For this pattern to work efficiently, proper Postgres indexing needs to be implemented:
CREATE INDEX idx_jobs_status ON jobs(id, cluster_id);
CREATE INDEX idx_jobs_lookup ON jobs(status) WHERE status IN ('success', 'failure');
This ensures our frequent polling queries are fast and don't put unnecessary load on the database.
The Hidden Benefits of Long Polling
One of the most compelling aspects of long polling is what you don't have to build. Here's what we avoided:
Observability Remains Unchanged
One of the biggest wins is that we don't need to modify our observability stack for WebSockets. All our standard HTTP metrics just work out of the box, and our existing logging patterns do exactly what we need. There's no need to figure out new ways to monitor persistent connections or implement additional logging for WebSocket state.
Authentication Simplicity
We completely avoid the headache of implementing a new authentication mechanism for incoming WebSocket connections. We just keep using our standard HTTP authentication that we already have in place. All our existing security patterns continue to work exactly as they always have.
When we implemented Websockets earlier, this became extremely gnarly due to the RBAC restrictions we had to honor. Basically, we needed to be really careful about what data we push to the connected clients, and the privilege escalation that happens when a client moves from one cluster to another.
Infrastructure Compatibility
Corporate firewalls blocking WebSocket connections was one of our other worries. Some of our users are behind firewalls, and we don't need the IT headache of getting them to open up WebSockets.
Not our problem. We don't need any special proxy configurations or complex infrastructure setups. Our standard load balancer configuration works fine without any modifications. The entire stack just keeps humming along as it always has.
Operational Simplicity
We never have to worry about server restarts dropping WebSocket connections. There's no connection state to manage or maintain. When something goes wrong (and something always goes wrong), it's much easier to debug and troubleshoot because we're just dealing with standard HTTP requests and responses.
We use Cloudflare for our edge, and that means our existing configuration rules and DDoS protection didn't need any changing.
Client Implementation
The client-side code stays remarkably simple. It works with any HTTP client, no special WebSocket libraries needed. Even better, reconnection handling comes for free with basic retry logic. The entire client implementation can often be just a few lines of code.
Why Not ElectricSQL?
While exploring solutions, we looked at ElectricSQL, which synchronizes Postgres data to the frontend. They make an interesting case for long polling over WebSockets:
"Switching to an HTTP protocol may at first seem like a regression or a strange fit. Web sockets are built on top of HTTP specifically to serve the kind of realtime data stream that Electric provides. However, they are also more stateful and harder to cache."
In fact, we actually recommend ElectricSQL if you don't need extreme control or low-level constructs to handle real-time updates. It's a solid, battle-tested solution that handles many edge cases and provides a great developer experience.
Why We Chose Raw Long Polling
The message delivery mechanism is a core part of our product - it's not just an implementation detail, it's central to what we do. You can't afford to have something as fundamental as message delivery abstracted away in a third-party library, no matter how good that library might be.
Our specific use case required:
- Core Product Control: Full control over our message delivery mechanism - it's not just infrastructure, it's our product
- Zero External Dependencies: We needed our stack to be as simple as possible for self-hosting
- Close to the Metal: Direct control over the polling mechanism and connection handling
- Maximum Control: Ability to fine-tune every aspect of the implementation, including implementing dynamic polling intervals
- Simplicity: Making it easy for users to understand and modify the codebase
For us, staying close to the metal with a simple HTTP long polling implementation was the right choice. But if you don't need this level of control, ElectricSQL provides a more feature-rich solution that could save you significant development time.
Application Layer Best Practices
When implementing long polling, there are several critical practices to follow to ensure reliable operation:
Mandatory TTL Implementation
You must implement a Time-To-Live (TTL) for your HTTP connections. Without this, you'll inevitably run into connection reset errors. Your polling logic should always return within this TTL, no matter what.
const getJobStatus = async (jobId: string, ttl = 60_000) => {
const start = Date.now();
// Always check if we've exceeded TTL
while (Date.now() - start < ttl) {
// polling logic here
}
return null; // TTL exceeded
}
Client-Configurable TTL with Server Limits
While clients should be able to specify their desired TTL, the server must enforce a maximum limit:
const MAX_TTL = 120_000; // 2 minutes
const getJobStatus = async (jobId: string, clientTtl: number) => {
const ttl = Math.min(clientTtl, MAX_TTL);
// ... polling logic
}
Infrastructure-Aware TTL Settings
Your maximum TTL must stay under the minimum HTTP connection timeout across your entire infrastructure stack:
- Application server timeouts
- Client timeouts
- Load balancer timeouts
- Edge server timeouts
- Proxy timeouts
For example, if your edge server has a 30-second timeout, your max TTL should be comfortably under this, say 25 seconds.
Sensible Database Polling Intervals
As shown in our implementation, include a reasonable wait time between database polls. We use a 500ms interval:
await new Promise(resolve => setTimeout(resolve, 500));
This prevents hammering your database while still providing reasonably quick updates.
Optional: Exponential Backoff
While not implemented in our current system, you can implement exponential backoff for more efficient resource usage:
const getJobStatus = async (jobId: string, ttl = 60_000) => {
const start = Date.now();
let waitTime = 100; // Start with 100ms
while (Date.now() - start < ttl) {
const result = await checkJob(jobId);
if (result) return result;
// Exponential backoff with max of 2 seconds
waitTime = Math.min(waitTime * 2, 2000);
await new Promise(resolve => setTimeout(resolve, waitTime));
}
return null;
}
This approach means:
- Active requests (those likely to get data soon) terminate quickly
- Inactive requests gradually increase their polling interval
- System resources are used more efficiently
A Case for WebSockets: The Other Side of the Story
While we've found long polling to be a great solution for our needs, it's not the only option. WebSockets are not inherently bad. They just require a lot of love and attention.
The challenges we mentioned aren't insurmountable - they just require proper engineering attention:
-
Observability: WebSockets are more stateful, so you need to implement additional logging and monitoring for persistent connections.
-
Authentication: You need to implement a new authentication mechanism for incoming WebSocket connections.
-
Infrastructure: You need to configure your infrastructure to support WebSockets, including load balancers and firewalls.
-
Operations: You need to manage WebSocket connections and reconnections, including handling connection timeouts and errors.
-
Client Implementation: You need to implement a client-side WebSocket library, including handling reconnections and state management.