The Hidden Costs of Synchronous Tool Calls
Learn why traditional synchronous tool calls in AI agents come with high security, operational, and compliance costs, and discover alternative approaches for better scalability.
When an agent determines it needs to invoke a function (commonly called "tools" nowadays), the conventional approach is to invoke a function within the same process as the agent.
This is a good starting point for simple tools, written in the same language as the agent orchestration, but as systems get larger they tend involve multiple services or other constraints that would make this infeasible.
An initial approach may be making an API call to the other service. However, this method creates a few challenges:
- Internal tools become exposed to the public internet
- Operations work is required to enable ingress to the tool
- Authentication mechanisms must be implemented to prevent malicious access
- Network ingress needs to handle load balancing between multiple tool instances
- HTTP timeouts mean that the agent can't execute a tool that takes longer than the HTTP timeout
- Avoiding the HTTP timeout means that you have to implement your own internal job queue
- HTTP server needs to be scaled to handle the peak concurrency of the tool
...and I could go on. So, no - in our humble opinion - we don't think that's good enough way to build production grade agentic workflows.
sequenceDiagram
participant Agent as AI Agent
participant LB as Load Balancer
participant Tool1 as Tool Instance 1
participant Tool2 as Tool Instance 2
Agent->>LB: 1. Request exposed to internet
activate LB
LB->>Tool1: 2. Route request
activate Tool1
Note over Tool1: 3. Instance overloaded<br/>due to peak traffic
Tool1--xLB: 4. Server error (503)
deactivate Tool1
LB->>Tool2: 5. Retry on different instance
activate Tool2
Note over Tool2: 6. Long-running task<br/>exceeds HTTP timeout
Tool2--xLB: 7. Server error (503)
deactivate Tool2
LB--xAgent: 7. Gateway timeout (504)
deactivate LB
In environments where security is paramount and all infrastructure is deployed in private subnets, exposing an internal service to the internet ranges from "impossible" to "very hard". Additionally, many legacy systems remain secure primarily because they're isolated from the public internet.
If you need to be convinced of this, all you need to do is have a 2-minute chat with a security and compliance team that manages a mid to large size stack.
What if we could run our tools without exposing them to the public internet?
This reality - and our own experience of building distributed systems - drove our early architectural decision to build Inferable in a way that requires:
- Zero network ingress to a VPC
- No additional custom authentication to make an Inferable service operational
Enter (long) polling.
sequenceDiagram
participant SDK as Inferable SDK
participant CP as Control Plane
participant DB as Postgres Queue
loop Every few seconds
SDK->>CP: "I'm alive, any jobs for me?"
CP->>DB: Check for pending jobs
alt Jobs available
DB-->>CP: Return pending jobs
CP-->>SDK: Here are jobs to execute
SDK->>SDK: Execute jobs
SDK->>CP: Send results
CP->>DB: Update job status
else No jobs
DB-->>CP: No pending jobs
CP-->>SDK: No jobs available
end
end
Essentially, when a function gets registered (either in the codebase directly or via a proxy service), rather than keeping a port open, the SDK periodically sends a heartbeat to the control plane saying - "Hey, I'm alive, and I'm capable of running these functions. Do you have anything for me"?
The control plane then responds with either:
- Nothing for now, keep ticking.
- Yes, execute these functions for me, and these are the input params. When you're done, call me back on this address.
Distributed System Guarantees
The inferable execution engine implements a distributed job queue (something like SQS, but especially designed for these workloads) which makes sure:
- Message Visibility: Jobs cannot be processed by multiple machines simultaneously through our locking mechanism
- At-least-once Processing: A job will be processed at least once as long as a capable machine is available
- Timeout Management: If a machine exceeds timeout or stalls, the job is automatically rerouted to a different replica
- Exclusive Processing: Only one machine can process a job at any time via a time-limited lease
- Result Authentication: Only the machine with the acquired lease can submit results
- Load Distribution: Workload is balanced across multiple replicas running the same service configuration
Implementation: PostgreSQL-Based Job Queue
Rather than using a traditional message queue, we've implemented our distributed job queue using PostgreSQL. We did this becuase - well, why not? All you need is postgres, and you get:
- Transactional guarantees
- Familiar operational characteristics for most engineering teams for self-hosting
Here's a look at the core polling logic:
UPDATE jobs SET
status = $1,
remaining_attempts = remaining_attempts - 1,
last_retrieved_at = now(),
executing_machine_id=$2
WHERE id IN (
SELECT id FROM jobs
WHERE
status = 'pending'
AND cluster_id = $3
AND service = $4
LIMIT $5 -- limit of jobs to claim
FOR UPDATE SKIP LOCKED
And here's how a job moves between states:
stateDiagram-v2
[*] --> pending
pending --> running: Machine claims job
running --> success: Job completes
running --> failure: Job fails
running --> stalled: Timeout/Machine failure
stalled --> pending: Retry available
stalled --> failure: No retries left
success --> [*]
failure --> [*]
Our job queue implementation handles all of the above scenarios, and a few more that we need to get right in order to make this work at scale.
1. Job Claiming
We use PostgreSQL's SELECT FOR UPDATE SKIP LOCKED pattern to ensure atomic job claiming:
// Inside pollJobs function
SELECT id FROM jobs
WHERE
status = 'pending'
AND cluster_id = ${clusterId}
AND service = ${service}
LIMIT ${limit}
FOR UPDATE SKIP LOCKED
2. Approval Workflows
Some functions require explicit approval before execution. And a human approval is done out of band, and we should be able to pause the job in-queue to wait for approval. See human in the loop for more details.
This is not something we can implement easily via synchronous HTTP calls, as approvals may exceed the HTTP timeout.
export async function submitApproval({
call,
clusterId,
approved,
}: {
call: NonNullable<Awaited<ReturnType<typeof getJob>>>;
clusterId: string;
approved: boolean;
}) {
if (approved) {
await data.db
.update(data.jobs)
.set({
approved: true,
status: "pending",
executing_machine_id: null,
remaining_attempts: sql`remaining_attempts + 1`,
})
.where(/* conditions */);
} else {
// Handle rejection...
}
}
3. Job Recovery
The system automatically recovers from machine failures:
-- run periodically
UPDATE jobs SET
status = 'pending',
executing_machine_id = null
WHERE
status = 'stalled'
AND cluster_id = $1
AND service = $2
How does this work in practice?
Consider running a Kubernetes cluster with a user-service comprising 20 pods, each registering a getUser() function. If we receive 100 getUser() requests, our system ensures:
- Under normal conditions, exactly 100 executions occur across the 20 pods
- In failure scenarios, all requests are processed at least once (at least 100 executions)
- Requests are distributed to avoid overwhelming any single pod
- If a pod fails, pending requests are automatically reassigned
sequenceDiagram
participant CP as Control Plane
participant PG as Postgres Queue
participant P1 as Pod 1
participant P2 as Pod 2
participant P3 as Pod 3
Note over CP,P3: Normal Operation: Balanced Distribution
CP->>PG: Enqueue 100 jobs
loop Normal Processing
P1->>PG: Poll for jobs
PG-->>P1: Take 5 jobs (FOR UPDATE SKIP LOCKED)
P2->>PG: Poll for jobs
PG-->>P2: Take 5 jobs (FOR UPDATE SKIP LOCKED)
P3->>PG: Poll for jobs
PG-->>P3: Take 5 jobs (FOR UPDATE SKIP LOCKED)
end
Note over CP,P3: ⚠️ Failure Scenario
rect rgb(40, 12, 12)
Note over P2: Pod 2 crashes
P2--xPG: Connection lost
Note over PG: Jobs from P2 stall
PG->>PG: Self-heal check (every 5s)
Note over PG: Reset stalled jobs to 'pending'
P1->>PG: Regular poll
PG-->>P1: Gets some of P2's jobs
P3->>PG: Regular poll
PG-->>P3: Gets remaining P2's jobs
end
Note over CP,P3: All 100 jobs complete despite failure
The beauty of this approach is that it maintains security through isolation while ensuring reliable function execution, all without requiring developers to implement complex distributed systems patterns or security measures themselves.
By using PostgreSQL as our job queue, we get the benefits of a battle-tested database while maintaining the flexibility to implement complex workflows. The SELECT FOR UPDATE SKIP LOCKED pattern ensures that our distributed system maintains consistency even under heavy load with multiple consumers.
All of this is what a developer gets for free when they register a function like this:
service.register({
func: getUser,
});
If you find this interesting, all of our source code is open source and available on GitHub. Also, as long as you can provision a postgres instance, you can self-host Inferable.