The Hidden Costs of Synchronous Tool Calls

Learn why traditional synchronous tool calls in AI agents come with high security, operational, and compliance costs, and discover alternative approaches for better scalability.

Nadeesha Cabral on 13-12-2024

When an agent determines it needs to invoke a function (commonly called "tools" nowadays), the conventional approach is to invoke a function within the same process as the agent.

This is a good starting point for simple tools, written in the same language as the agent orchestration, but as systems get larger they tend involve multiple services or other constraints that would make this infeasible.

An initial approach may be making an API call to the other service. However, this method creates a few challenges:

  1. Internal tools become exposed to the public internet
  2. Operations work is required to enable ingress to the tool
  3. Authentication mechanisms must be implemented to prevent malicious access
  4. Network ingress needs to handle load balancing between multiple tool instances
  5. HTTP timeouts mean that the agent can't execute a tool that takes longer than the HTTP timeout
  6. Avoiding the HTTP timeout means that you have to implement your own internal job queue
  7. HTTP server needs to be scaled to handle the peak concurrency of the tool

...and I could go on. So, no - in our humble opinion - we don't think that's good enough way to build production grade agentic workflows.

sequenceDiagram
    participant Agent as AI Agent
    participant LB as Load Balancer
    participant Tool1 as Tool Instance 1
    participant Tool2 as Tool Instance 2

    Agent->>LB: 1. Request exposed to internet
    activate LB

    LB->>Tool1: 2. Route request
    activate Tool1
    Note over Tool1: 3. Instance overloaded<br/>due to peak traffic
    Tool1--xLB: 4. Server error (503)
    deactivate Tool1

    LB->>Tool2: 5. Retry on different instance
    activate Tool2
    Note over Tool2: 6. Long-running task<br/>exceeds HTTP timeout
    Tool2--xLB: 7. Server error (503)
    deactivate Tool2

    LB--xAgent: 7. Gateway timeout (504)
    deactivate LB

In environments where security is paramount and all infrastructure is deployed in private subnets, exposing an internal service to the internet ranges from "impossible" to "very hard". Additionally, many legacy systems remain secure primarily because they're isolated from the public internet.

If you need to be convinced of this, all you need to do is have a 2-minute chat with a security and compliance team that manages a mid to large size stack.

What if we could run our tools without exposing them to the public internet?

This reality - and our own experience of building distributed systems - drove our early architectural decision to build Inferable in a way that requires:

  1. Zero network ingress to a VPC
  2. No additional custom authentication to make an Inferable service operational

Enter (long) polling.

sequenceDiagram
    participant SDK as Inferable SDK
    participant CP as Control Plane
    participant DB as Postgres Queue

    loop Every few seconds
        SDK->>CP: "I'm alive, any jobs for me?"
        CP->>DB: Check for pending jobs
        alt Jobs available
            DB-->>CP: Return pending jobs
            CP-->>SDK: Here are jobs to execute
            SDK->>SDK: Execute jobs
            SDK->>CP: Send results
            CP->>DB: Update job status
        else No jobs
            DB-->>CP: No pending jobs
            CP-->>SDK: No jobs available
        end
    end

Essentially, when a function gets registered (either in the codebase directly or via a proxy service), rather than keeping a port open, the SDK periodically sends a heartbeat to the control plane saying - "Hey, I'm alive, and I'm capable of running these functions. Do you have anything for me"?

The control plane then responds with either:

  1. Nothing for now, keep ticking.
  2. Yes, execute these functions for me, and these are the input params. When you're done, call me back on this address.

Distributed System Guarantees

The inferable execution engine implements a distributed job queue (something like SQS, but especially designed for these workloads) which makes sure:

  1. Message Visibility: Jobs cannot be processed by multiple machines simultaneously through our locking mechanism
  2. At-least-once Processing: A job will be processed at least once as long as a capable machine is available
  3. Timeout Management: If a machine exceeds timeout or stalls, the job is automatically rerouted to a different replica
  4. Exclusive Processing: Only one machine can process a job at any time via a time-limited lease
  5. Result Authentication: Only the machine with the acquired lease can submit results
  6. Load Distribution: Workload is balanced across multiple replicas running the same service configuration

Implementation: PostgreSQL-Based Job Queue

Rather than using a traditional message queue, we've implemented our distributed job queue using PostgreSQL. We did this becuase - well, why not? All you need is postgres, and you get:

  1. Transactional guarantees
  2. Familiar operational characteristics for most engineering teams for self-hosting

Here's a look at the core polling logic:

UPDATE jobs SET
  status = $1,
  remaining_attempts = remaining_attempts - 1,
  last_retrieved_at = now(),
  executing_machine_id=$2
WHERE id IN (
  SELECT id FROM jobs
  WHERE
    status = 'pending'
    AND cluster_id = $3
    AND service = $4
  LIMIT $5 -- limit of jobs to claim
  FOR UPDATE SKIP LOCKED

And here's how a job moves between states:

stateDiagram-v2
    [*] --> pending
    pending --> running: Machine claims job
    running --> success: Job completes
    running --> failure: Job fails
    running --> stalled: Timeout/Machine failure
    stalled --> pending: Retry available
    stalled --> failure: No retries left
    success --> [*]
    failure --> [*]

Our job queue implementation handles all of the above scenarios, and a few more that we need to get right in order to make this work at scale.

1. Job Claiming

We use PostgreSQL's SELECT FOR UPDATE SKIP LOCKED pattern to ensure atomic job claiming:

// Inside pollJobs function
SELECT id FROM jobs
WHERE
  status = 'pending'
  AND cluster_id = ${clusterId}
  AND service = ${service}
LIMIT ${limit}
FOR UPDATE SKIP LOCKED

2. Approval Workflows

Some functions require explicit approval before execution. And a human approval is done out of band, and we should be able to pause the job in-queue to wait for approval. See human in the loop for more details.

This is not something we can implement easily via synchronous HTTP calls, as approvals may exceed the HTTP timeout.

export async function submitApproval({
  call,
  clusterId,
  approved,
}: {
  call: NonNullable<Awaited<ReturnType<typeof getJob>>>;
  clusterId: string;
  approved: boolean;
}) {
  if (approved) {
    await data.db
      .update(data.jobs)
      .set({
        approved: true,
        status: "pending",
        executing_machine_id: null,
        remaining_attempts: sql`remaining_attempts + 1`,
      })
      .where(/* conditions */);
  } else {
    // Handle rejection...
  }
}

3. Job Recovery

The system automatically recovers from machine failures:

-- run periodically
UPDATE jobs SET
  status = 'pending',
  executing_machine_id = null
WHERE
  status = 'stalled'
  AND cluster_id = $1
  AND service = $2

How does this work in practice?

Consider running a Kubernetes cluster with a user-service comprising 20 pods, each registering a getUser() function. If we receive 100 getUser() requests, our system ensures:

  1. Under normal conditions, exactly 100 executions occur across the 20 pods
  2. In failure scenarios, all requests are processed at least once (at least 100 executions)
  3. Requests are distributed to avoid overwhelming any single pod
  4. If a pod fails, pending requests are automatically reassigned
sequenceDiagram
    participant CP as Control Plane
    participant PG as Postgres Queue
    participant P1 as Pod 1
    participant P2 as Pod 2
    participant P3 as Pod 3

    Note over CP,P3: Normal Operation: Balanced Distribution

    CP->>PG: Enqueue 100 jobs

    loop Normal Processing
        P1->>PG: Poll for jobs
        PG-->>P1: Take 5 jobs (FOR UPDATE SKIP LOCKED)
        P2->>PG: Poll for jobs
        PG-->>P2: Take 5 jobs (FOR UPDATE SKIP LOCKED)
        P3->>PG: Poll for jobs
        PG-->>P3: Take 5 jobs (FOR UPDATE SKIP LOCKED)
    end

    Note over CP,P3: ⚠️ Failure Scenario

    rect rgb(40, 12, 12)
        Note over P2: Pod 2 crashes
        P2--xPG: Connection lost

        Note over PG: Jobs from P2 stall

        PG->>PG: Self-heal check (every 5s)
        Note over PG: Reset stalled jobs to 'pending'

        P1->>PG: Regular poll
        PG-->>P1: Gets some of P2's jobs
        P3->>PG: Regular poll
        PG-->>P3: Gets remaining P2's jobs
    end

    Note over CP,P3: All 100 jobs complete despite failure

The beauty of this approach is that it maintains security through isolation while ensuring reliable function execution, all without requiring developers to implement complex distributed systems patterns or security measures themselves.

By using PostgreSQL as our job queue, we get the benefits of a battle-tested database while maintaining the flexibility to implement complex workflows. The SELECT FOR UPDATE SKIP LOCKED pattern ensures that our distributed system maintains consistency even under heavy load with multiple consumers.

All of this is what a developer gets for free when they register a function like this:

service.register({
  func: getUser,
});

If you find this interesting, all of our source code is open source and available on GitHub. Also, as long as you can provision a postgres instance, you can self-host Inferable.

Subscribe to our newsletter for high signal updates from the cross section of AI agents, LLMs, and distributed systems.

Maximum one email per week.

Subscribe to Newsletter