The AI Agent Problem Nobody's Talking About

June 5, 2026

ai enterprise operations llm ai-agents enterprise-it tokenization infrastructure

Everyone is racing to deploy AI agents into IT operations. Bigger models, longer context windows, faster chips. The pitch decks are perfect. The demos run clean, and then you put the thing in front of real workloads and it starts falling apart.

ITBench-AA — the first benchmark specifically designed for agentic tasks — shows frontier models scoring below 50%. Not legacy models. Not fine-tuned experiments. The best models available today, tested against the kind of tasks enterprise IT teams actually need them to handle. Below half. That number should be making more noise than it is.

The part that surprised me wasn’t the score itself. It was how predictable the failure modes are once you’ve seen them in production. These aren’t edge cases. They’re structural.

Here’s what actually happens when you try to run multi-user agent workloads at scale: continuous batching, the efficiency mechanism that makes LLM serving economically viable, starts breaking down. Server capacity constraints hit faster than your infrastructure team projected, and suddenly you’re not getting the throughput the benchmark promised. The gap between single-user demo performance and multi-user production performance is wide enough to drive a truck through, and most teams don’t discover it until they’re already committed.

Then there’s the tokenization problem, which is underappreciated. As context windows expand, tokenization failures start degrading multilingual pipelines in ways that are hard to catch before they’re in front of real users.

A tokenizer that handles English cleanly can quietly mangle inputs in other languages — and the model doesn’t always know it’s working with corrupted input. It just produces confident, wrong output. I’ve seen this cause cascading failures in automation workflows where the agent is supposed to be parsing structured logs. The logs look fine. The tokenization doesn’t.

The industry’s response to all of this has been to reach for custom fine-tunes. Which makes sense on paper. If plug-and-play frontier models underperform on your specific domain, you tune them on your data. The problem is that fine-tuned models still underperform on agent-specific tasks — particularly tool use and IT automation — and now you’ve added a maintenance burden on top of a model that still can’t reliably execute a multi-step remediation workflow without hallucinating a step somewhere in the middle.

Long-running agents introduce another layer of operational complexity that doesn’t show up in benchmarks: context bloat. As the agent accumulates conversation history, tool call results, and intermediate reasoning, you eventually need context pruning pipelines just to keep inference efficient. That’s not a feature you bolt on later. It’s an architectural decision you need to make before you deploy, and most teams aren’t making it because they’re still focused on getting the model to work at all.

What I keep noticing is that the teams struggling most aren’t the ones with bad models or bad data. They’re the ones who designed their agent architecture around the assumption that the model would be the reliable part. The model is not the reliable part. The model is the part that scores below 50% on the benchmark designed to measure it. The reliable part — or the unreliable part, depending on how you’ve built it — is everything around the model: the tokenization pipeline, the batching strategy, the context management, the tool call validation layer.

The obsession with model capability is real and understandable. Capability is measurable. It shows up in demos. It’s easy to communicate to stakeholders. What’s harder to communicate is that your serving infrastructure will saturate before your model hits its capability ceiling, or that a tokenization mismatch in your multilingual pipeline will produce failures that look like model errors but aren’t.

I’m not convinced the “bigger model” instinct is wrong, exactly. But I’m starting to think it’s solving for the wrong bottleneck. The score on ITBench-AA isn’t going to double because the next model has a longer context window. The operational friction lives somewhere else.

Most teams I’ve spoken to can’t cleanly separate model failures from infrastructure failures in their production logs. That seems like the more urgent problem.