The Hidden Costs of Synchronous Tool Calls
Learn why traditional synchronous tool calls in AI agents come with high security, operational, and compliance costs, and discover alternative approaches for better scalability.
When an agent determines it needs to invoke a function (commonly called "tools" nowadays), the conventional approach is to invoke a function within the same process as the agent.
This is a good starting point for simple tools, written in the same language as the agent orchestration, but as systems get larger they tend involve multiple services or other constraints that would make this infeasible.
An initial approach may be making an API call to the other service. However, this method creates a few challenges:
- Internal tools become exposed to the public internet
- Operations work is required to enable ingress to the tool
- Authentication mechanisms must be implemented to prevent malicious access
- Network ingress needs to handle load balancing between multiple tool instances
- HTTP timeouts mean that the agent can't execute a tool that takes longer than the HTTP timeout
- Avoiding the HTTP timeout means that you have to implement your own internal job queue
- HTTP server needs to be scaled to handle the peak concurrency of the tool
...and I could go on. So, no - in our humble opinion - we don't think that's good enough way to build production grade agentic workflows.
sequenceDiagram participant Agent as AI Agent participant LB as Load Balancer participant Tool1 as Tool Instance 1 participant Tool2 as Tool Instance 2 Agent->>LB: 1. Request exposed to internet activate LB LB->>Tool1: 2. Route request activate Tool1 Note over Tool1: 3. Instance overloaded<br/>due to peak traffic Tool1--xLB: 4. Server error (503) deactivate Tool1 LB->>Tool2: 5. Retry on different instance activate Tool2 Note over Tool2: 6. Long-running task<br/>exceeds HTTP timeout Tool2--xLB: 7. Server error (503) deactivate Tool2 LB--xAgent: 7. Gateway timeout (504) deactivate LB
In environments where security is paramount and all infrastructure is deployed in private subnets, exposing an internal service to the internet ranges from "impossible" to "very hard". Additionally, many legacy systems remain secure primarily because they're isolated from the public internet.
If you need to be convinced of this, all you need to do is have a 2-minute chat with a security and compliance team that manages a mid to large size stack.
What if we could run our tools without exposing them to the public internet?
This reality - and our own experience of building distributed systems - drove our early architectural decision to build Inferable in a way that requires:
- Zero network ingress to a VPC
- No additional custom authentication to make an Inferable service operational
Enter (long) polling.
sequenceDiagram participant SDK as Inferable SDK participant CP as Control Plane participant DB as Postgres Queue loop Every few seconds SDK->>CP: "I'm alive, any jobs for me?" CP->>DB: Check for pending jobs alt Jobs available DB-->>CP: Return pending jobs CP-->>SDK: Here are jobs to execute SDK->>SDK: Execute jobs SDK->>CP: Send results CP->>DB: Update job status else No jobs DB-->>CP: No pending jobs CP-->>SDK: No jobs available end end
Essentially, when a function gets registered (either in the codebase directly or via a proxy service), rather than keeping a port open, the SDK periodically sends a heartbeat to the control plane saying - "Hey, I'm alive, and I'm capable of running these functions. Do you have anything for me"?
The control plane then responds with either:
- Nothing for now, keep ticking.
- Yes, execute these functions for me, and these are the input params. When you're done, call me back on this address.
Distributed System Guarantees
The inferable execution engine implements a distributed job queue (something like SQS, but especially designed for these workloads) which makes sure:
- Message Visibility: Jobs cannot be processed by multiple machines simultaneously through our locking mechanism
- At-least-once Processing: A job will be processed at least once as long as a capable machine is available
- Timeout Management: If a machine exceeds timeout or stalls, the job is automatically rerouted to a different replica
- Exclusive Processing: Only one machine can process a job at any time via a time-limited lease
- Result Authentication: Only the machine with the acquired lease can submit results
- Load Distribution: Workload is balanced across multiple replicas running the same service configuration
Implementation: PostgreSQL-Based Job Queue
Rather than using a traditional message queue, we've implemented our distributed job queue using PostgreSQL. We did this becuase - well, why not? All you need is postgres, and you get:
- Transactional guarantees
- Familiar operational characteristics for most engineering teams for self-hosting
Here's a look at the core polling logic:
UPDATE jobs SET
status = $1,
remaining_attempts = remaining_attempts - 1,
last_retrieved_at = now(),
executing_machine_id=$2
WHERE id IN (
SELECT id FROM jobs
WHERE
status = 'pending'
AND cluster_id = $3
AND service = $4
LIMIT $5 -- limit of jobs to claim
FOR UPDATE SKIP LOCKED
And here's how a job moves between states:
stateDiagram-v2 [*] --> pending pending --> running: Machine claims job running --> success: Job completes running --> failure: Job fails running --> stalled: Timeout/Machine failure stalled --> pending: Retry available stalled --> failure: No retries left success --> [*] failure --> [*]
Our job queue implementation handles all of the above scenarios, and a few more that we need to get right in order to make this work at scale.
1. Job Claiming
We use PostgreSQL's SELECT FOR UPDATE SKIP LOCKED
pattern to ensure atomic job claiming:
// Inside pollJobs function
SELECT id FROM jobs
WHERE
status = 'pending'
AND cluster_id = ${clusterId}
AND service = ${service}
LIMIT ${limit}
FOR UPDATE SKIP LOCKED
2. Approval Workflows
Some functions require explicit approval before execution. And a human approval is done out of band, and we should be able to pause the job in-queue to wait for approval. See human in the loop for more details.
This is not something we can implement easily via synchronous HTTP calls, as approvals may exceed the HTTP timeout.
export async function submitApproval({
call,
clusterId,
approved,
}: {
call: NonNullable<Awaited<ReturnType<typeof getJob>>>;
clusterId: string;
approved: boolean;
}) {
if (approved) {
await data.db
.update(data.jobs)
.set({
approved: true,
status: "pending",
executing_machine_id: null,
remaining_attempts: sql`remaining_attempts + 1`,
})
.where(/* conditions */);
} else {
// Handle rejection...
}
}
3. Job Recovery
The system automatically recovers from machine failures:
-- run periodically
UPDATE jobs SET
status = 'pending',
executing_machine_id = null
WHERE
status = 'stalled'
AND cluster_id = $1
AND service = $2
How does this work in practice?
Consider running a Kubernetes cluster with a user-service
comprising 20 pods, each registering a getUser()
function. If we receive 100 getUser()
requests, our system ensures:
- Under normal conditions, exactly 100 executions occur across the 20 pods
- In failure scenarios, all requests are processed at least once (at least 100 executions)
- Requests are distributed to avoid overwhelming any single pod
- If a pod fails, pending requests are automatically reassigned
sequenceDiagram participant CP as Control Plane participant PG as Postgres Queue participant P1 as Pod 1 participant P2 as Pod 2 participant P3 as Pod 3 Note over CP,P3: Normal Operation: Balanced Distribution CP->>PG: Enqueue 100 jobs loop Normal Processing P1->>PG: Poll for jobs PG-->>P1: Take 5 jobs (FOR UPDATE SKIP LOCKED) P2->>PG: Poll for jobs PG-->>P2: Take 5 jobs (FOR UPDATE SKIP LOCKED) P3->>PG: Poll for jobs PG-->>P3: Take 5 jobs (FOR UPDATE SKIP LOCKED) end Note over CP,P3: ⚠️ Failure Scenario rect rgb(40, 12, 12) Note over P2: Pod 2 crashes P2--xPG: Connection lost Note over PG: Jobs from P2 stall PG->>PG: Self-heal check (every 5s) Note over PG: Reset stalled jobs to 'pending' P1->>PG: Regular poll PG-->>P1: Gets some of P2's jobs P3->>PG: Regular poll PG-->>P3: Gets remaining P2's jobs end Note over CP,P3: All 100 jobs complete despite failure
The beauty of this approach is that it maintains security through isolation while ensuring reliable function execution, all without requiring developers to implement complex distributed systems patterns or security measures themselves.
By using PostgreSQL as our job queue, we get the benefits of a battle-tested database while maintaining the flexibility to implement complex workflows. The SELECT FOR UPDATE SKIP LOCKED
pattern ensures that our distributed system maintains consistency even under heavy load with multiple consumers.
All of this is what a developer gets for free when they register a function like this:
service.register({
func: getUser,
});
If you find this interesting, all of our source code is open source and available on GitHub. Also, as long as you can provision a postgres instance, you can self-host Inferable.