Stop Treating LLMs Like Databases: The Financial Case for Event-Driven AI
Event-driven architecture is not about technical purity. For AI-heavy systems, it is the difference between a product that scales and one that destroys your unit economics.
Your database queries complete in 50 milliseconds. Your Redis cache hits return in under 1 millisecond. Your LLM API call takes 15 to 30 seconds.
That is not a performance gap. It is three orders of magnitude. And every architectural assumption your backend runs on — thread pool sizing, connection pool limits, timeout configurations, retry semantics — was built for the first two numbers, not the third.
Most teams treat LLM integration as an API swap. Add the client library, call the endpoint, handle the response. Same pattern they use for Stripe or Twilio. It works fine in development. In production, under real load, it starts bleeding cash and dropping users in ways that take months to trace back to the architecture choice made on day one.
Here is what actually happens, and why event-driven architecture is the structural fix. Not because it is architecturally elegant — it adds real operational complexity — but because the economics of LLM workloads leave you very little choice.
The Mental Model That Gets You Into Trouble
When you call a database, the cost is roughly fixed per query. An indexed read on ten rows costs more or less the same as one on a hundred. Infrastructure costs are predictable, and you build your pricing assumptions on predictability.
LLM API pricing does not work like that.
You pay per token — both input and output. A simple classification task might consume 500 tokens. The same user asking for a detailed analysis, or an agent tool-calling its way through three reasoning steps, might consume 15,000. You cannot know ahead of time how many tokens a given request will use, because output length depends on the nature of the question and the model’s reasoning path.
OpenAI’s pricing for GPT-4o sits at roughly $2.50 per million input tokens and $10 per million output tokens. Those numbers sound small. Run an agent that does five tool calls per user session, each call consuming 3,000 tokens of context, and you are at 15,000 tokens per session. At modest scale — ten thousand sessions per day — that is 150 million tokens. Do the math. It adds up fast, and it adds up differently every single day depending on what users actually ask.
The core issue: you cannot build stable unit economics on a variable you cannot bound. And with a synchronous architecture, you have very little leverage to impose bounds.
The Latency Trap Nobody Talks About in Architecture Reviews
Here is a number worth knowing: a typical LLM response takes between 5 and 30 seconds to complete, depending on the model, output length, and server load. Not 30 milliseconds. Seconds.
Your database queries return in 10–50ms. Your Redis calls return in under 1ms. You have built your entire backend mental model around sub-100ms operations, and now you have inserted a dependency that can take 20,000ms.
This matters structurally. Apache Tomcat, which backs most Spring Boot deployments, defaults to 200 threads. Apply Little’s Law — throughput equals arrival rate multiplied by service time — and the math gets uncomfortable quickly. At 20 seconds per LLM request, your 200-thread pool can handle exactly ten concurrent users before new requests start queuing. At 100ms per request, the same pool handles 2,000 concurrent users.
You have reduced your system’s throughput capacity by two orders of magnitude by making a single synchronous external call.
The standard objection is: “We will just use async/await or reactive programming.” This is partially correct. Non-blocking I/O helps. But it does not solve the user experience problem. The user is still staring at a spinner for 20 seconds. And research on user behavior is not subtle here — Google found that 53% of mobile users abandon sites that take longer than 3 seconds to load. An AI feature that makes users wait 20 seconds does not get a pass because it is “AI.”
This is where products lose retention without ever seeing it in their AI metrics. Users do not report “the AI was slow.” They just stop using the product.
The Cascade That Takes Everything Down
Slow is one problem. What happens under load is another.
When your LLM API starts returning errors — due to rate limits, provider outages, or network issues — a synchronous backend fails in the most expensive possible way. Threads holding open connections start accumulating. Your connection pool exhausts. Now your entire application is down, not just the AI feature.
OpenAI and Anthropic both publish rate limits. Tier 1 OpenAI accounts are rate-limited at 500 requests per minute and 30,000 tokens per minute for GPT-4o. Hit those limits and you start seeing 429 errors. In a synchronous architecture, those 429s surface directly to users and the retry logic — if you wrote any — is occupying threads that other requests need.
Worse: every LLM provider has periodic incidents. When Anthropic or OpenAI has a degraded service window, your synchronous architecture propagates that degradation directly into your product. There is no buffer.
The cascade looks like this: provider slowdown → requests take longer → thread pool fills up → database connection pool fills up → your entire application appears down. A third-party AI API has taken your whole system offline.
Why Event-Driven Architecture Is the Structural Fix
The pattern maps directly onto what LLM workloads actually need. Instead of blocking on the API call, you publish an event — SQS, Kafka, RabbitMQ, whatever fits your stack — and return a job ID to the user immediately. Workers subscribe to the queue, make the LLM call, and push the result back when ready via SSE or a polling endpoint.
What makes this more than just “async” is that every specific failure mode described above has a direct fix in this model:
Cost control through worker concurrency. The number of simultaneous LLM calls is exactly equal to your worker count. You control that number. You can set it based on your API tier limits and your budget. You cannot overshoot. In a synchronous architecture, the number of concurrent LLM calls is however many requests hit your server at the same time — which you do not control.
Provider failures stop at the queue boundary. When OpenAI has a degraded period, your workers slow down and events accumulate in the queue. Your application keeps accepting requests and returning job IDs. Users see “processing” instead of “error.” The queue drains when the provider recovers. Your database pool and application threads are completely unaffected.
You get retries for free, with backoff. Dead letter queues handle the failure cases that would otherwise require complex try-catch logic spread across your synchronous call chain. A failed LLM call becomes an event that gets retried with exponential backoff. You can also inspect dead-lettered events to understand failure patterns.
Output tokens become a lever, not a liability. You can attach a token budget per event type. A “quick answer” event submits to the API with max_tokens: 500. A “detailed analysis” event gets max_tokens: 2000. You decide what each event type is allowed to cost before the request is ever processed.
What This Actually Looks Like
The core pattern is a three-component system:
The API layer accepts user requests, validates them, enqueues a job event with a unique ID, and returns that ID to the client in under 50ms. The user is not waiting for the LLM. They are waiting for the acknowledgment that their request was received.
The worker pool subscribes to the queue. Each worker pulls an event, calls the LLM API synchronously (from the worker’s perspective, this is fine — the worker is not holding a user connection open), stores the result, and publishes a “completed” event. You scale worker count based on your API rate limits, not user traffic.
The delivery layer gets the result to the user. The simplest approach is polling: the client hits /jobs/{id} every few seconds. Better is Server-Sent Events: the client opens an SSE connection and the server pushes chunks as the LLM streams them. Klarna’s AI assistant, which reportedly handled 2.3 million customer conversations in its first month after launch in January 2024, uses a push-based notification model precisely because polling introduces its own latency and server load.
For agent systems, where one LLM call kicks off tool executions that feed into more LLM calls, the event bus becomes the coordination mechanism. Each tool execution publishes an event with its result. The orchestrator subscribes, assembles context, and publishes the next LLM call event. You get natural observability — every step in the agent’s reasoning is an event log entry — and you can pause, replay, or inspect any point in the workflow.
OpenAI Already Told You This Was the Right Model
In 2023, OpenAI launched their Batch API, which processes async requests within 24 hours at a 50% cost reduction compared to their standard synchronous endpoint. Their engineering rationale was explicit: batching requests allows them to use otherwise idle compute capacity.
The same logic applies to your system. Async workloads are cheaper to run than synchronous ones because they can be scheduled, throttled, and batched. The LLM providers know this. They are already giving you a discount to not be synchronous.
GitHub Copilot, which runs inference for millions of developers simultaneously, does not block the IDE while it waits for a completion. The suggestion appears when it is ready, or not at all if you have already typed past it. The architecture treats inference as a background process, not a blocking call — which is why the editor remains responsive even during inference.
Where This Gets Complicated
Event-driven architecture is not free complexity. It introduces operational overhead that a synchronous system does not have.
You now have a queue to operate, monitor, and tune. Dead letter queue handling needs thought — what do you do with a job that fails three times? You need to design a status model: pending, processing, completed, failed. Your client needs to handle all those states gracefully.
Testing gets harder. End-to-end tests now involve multiple components. Local development requires either running a real queue or mocking one, and mocks tend to paper over the real failure modes.
The right answer on complexity is not “therefore avoid EDA.” It is “be deliberate about where you introduce it.” Not every LLM call needs event-driven handling. A classification call that takes 2 seconds and powers a simple tag suggestion can stay synchronous. The economics of EDA make sense when you are dealing with calls that take 10+ seconds, touch multiple tool executions, need cost bounding, or need to stay resilient through provider instability. Agent workflows almost always qualify.
The Business Case, Stated Plainly
If your product has AI features, you have three choices on architecture:
One: stay synchronous, accept the thread exhaustion, the variable costs, and the user experience degradation at scale. This works fine at low traffic. It does not work as a growth plan.
Two: try to solve this with more clever synchronous code — connection pools, async frameworks, complex retry logic. You will spend engineering time on problems that have already been solved by queue-based systems. And you still will not get the cost control lever.
Three: adopt event-driven patterns for your LLM workloads. Accept the operational complexity upfront. Get cost predictability, fault isolation, and a user experience that does not degrade under load.
The businesses that are currently making money on AI features — not burning it — are running option three. They have decoupled their user-facing response time from their LLM processing time. They have put a hard ceiling on concurrent API calls. They treat LLM calls as jobs, not queries.
The businesses still running option one are the ones whose CFOs are looking at unexplained API bill spikes and waiting for the engineering team to explain it.

