The Production Reality of AI Agents: What Datadog, T-Mobile, and CrewAI Are Actually Shipping

Reviewing AI-generated code is now harder than writing it from scratch. That’s the inversion Datadog’s Chief Scientist Ameet Talwalkar dropped during the opening keynote at this week’s AI Agent Conference in New York, and it captures the tension running through the entire enterprise agent conversation: shipping is easy, trusting what you shipped is the actual job. Across the keynotes and exhibit floor, the message from Datadog, T-Mobile, CrewAI, Akamai, and RingCentral was consistent — autonomy is no longer the headline. Validation, simulation, and human supervision are.

Why Vibe-Coded Software Broke the Review Process

Talwalkar told the audience that “one of the hardest things for humans to do is no longer building production systems. It’s actually reviewing the vibe-coded software that gets shipped into production.” Datadog is responding by extending its observability product line to model real-world systems and predict production issues caused by AI agents before they happen.

The entire DevOps stack was built on the assumption that humans wrote most of the code and machines monitored the running result. When agents write the code and other agents run it, the human review layer becomes the bottleneck — not the keyboard. Observability vendors that can flag agent-generated regressions before they hit traffic are going to own a category that didn’t exist eighteen months ago.

If you’re a platform team that just enabled Claude or Copilot for everyone, the practical implication is that your existing pull-request review SLA is about to look ridiculous. The volume of generated diffs will outpace your reviewers within a quarter. Expect to see “agent-aware” linting, runtime guardrails, and pre-merge simulation become standard line items in 2026 budgets.

How T-Mobile Scaled to 200,000 Agent Conversations a Day

Julianne Roberson, Director of AI Engineering at T-Mobile, said the carrier now uses AI agents to handle 200,000 customer conversations a day — and it took roughly a year to get there. Customer service remains the most popular enterprise application for agents, and T-Mobile’s number is the kind of figure boards actually understand.

What’s instructive isn’t the scale; it’s the timeline. A year is not the “deploy in an afternoon” story that demo videos sell. Real enterprise rollout means red-teaming, escalation paths, compliance review, and integration with legacy CRM and billing systems. The companies winning here are the ones that treated the agent as one component inside a larger workflow, not as a magical replacement for the call center. Teams weighing AI agents versus traditional automation should read the T-Mobile timeline as the realistic floor, not the ceiling.

If your roadmap assumes a customer-facing agent ships in a quarter, halve your scope or double your timeline. The teams hitting six-figure conversation volumes are not the ones who started latest.

The Simulation Layer Nobody Was Selling Last Year

Zhou Yu, co-founder and CEO of ArklexAI, was blunt with The New Stack: “You can use Claude Code to build an agent in five minutes, but you don’t know what it will do when it goes into production, especially when you have a large group of customers.” His company’s new ArkSim product simulates AI-agent interactions with synthetic users and collects data to improve quality, because agentic interactions aren’t deterministic.

The market underestimated this layer. Traditional QA assumes a fixed input produces a fixed output. Agents don’t work that way — the same chatbot can give different answers to the same question, as Akamai CTO Bobby Blumofe pointed out in his own keynote. Simulation tools sit in the gap between unit tests and production traffic, and they’re going to become as standard as load testing was for web apps.

Yu also noted that agent frameworks themselves have commoditized — Walmart still uses Arklex’s original framework, but the company pivoted to simulation because the framework layer no longer differentiates. CrewAI’s Joe Moura made a similar point from the keynote stage: “Initially, it was all about building and deploying agents. But now it’s all about security and enterprise adoption.” Moura said CrewAI’s next bet is on “entangled agents” — agents that adapt automatically to what each customer is doing with them, becoming unique to that company over time.

Prediction: by the end of 2026, simulation-as-a-service will be a line item every serious agent team budgets for, and the framework wars everyone fought in 2024 will look as dated as the JavaScript framework wars of 2016.

Knowledge Graphs, Context, and the Hallucination Tax

Blumofe was direct about the underlying problem: “As you all probably know, most chatbots, when they sample from an LLM, sample probabilistically. The same chatbot can give you different answers at different times.” Agents that rely solely on an LLM, he said, are unlikely to produce accurate results. Pulling information from web search into the context window — and increasingly, from structured knowledge graphs — is the fix the industry has converged on.

Chang She, founder and CEO of LanceDB, told The New Stack that LanceDB has been adopted as a storage plug-in for OpenClaw and now ships a new Lance Graph project so teams can store knowledge graphs alongside voice, video, text, structured, and unstructured data in the same format. Unifying modalities in one store is what makes RAG and graph-augmented retrieval actually maintainable at scale.

The practical scenario: if your team is debating whether to bolt a vector store onto an existing data lake or move to a unified multi-modal format, the answer is increasingly the latter. The cost of stitching three retrieval systems together exceeds the cost of migrating once. That’s also why the custom-vs-SaaS AI decision keeps pushing serious teams toward purpose-built stacks — off-the-shelf chatbots can’t expose the retrieval internals you need to debug a hallucination at 2 a.m.

RingCentral’s Tim Dreyer offered the cleanest framing of where this lands operationally. After shipping an AI Conversation Expert that analyzes call recordings for coaching insights, the company added an AI Receptionist agent. “Our goal isn’t to eliminate a live agent,” Dreyer said. “If we can offload fifty or sixty percent of the tedious stuff that they have to do, that leaves them more time for strategic work.” That framing — assistive offload, not replacement — is exactly how successful AI automation deployments are scoped today.

Why Human Supervision Quietly Won the Debate

When Bill Gates wrote about AI agents in 2023, autonomy was the headline feature. At this conference, almost no speaker treated autonomy as the driving force for adoption. They treated it as a long-term destination, contingent on resolving errors. The replace-vs-supercharge debate has landed on supercharge — every speaker in the enterprise track said human supervision stays, regardless of the task.

That’s a meaningful retreat from the 2024 narrative. It’s also more honest. The teams shipping at T-Mobile-scale didn’t get there by removing humans; they got there by giving humans better tools to oversee a much larger surface area.

FAQ

Q: What is a vibe-coded software problem? A: It refers to code generated quickly by AI agents — often via tools like Claude Code or Copilot — and shipped to production without the rigorous review traditional code receives. Datadog’s Talwalkar argues the bottleneck has shifted from writing code to reviewing the AI-generated output before it causes incidents.

Q: Why are companies investing in agent simulation tools? A: Because agent behavior is non-deterministic — the same input can produce different outputs. Tools like ArklexAI’s ArkSim create synthetic user populations to stress-test agents before they reach real customers, surfacing failure modes that traditional QA misses.

Q: Are AI agents replacing customer service teams? A: Based on the conference, no. Vendors including T-Mobile and RingCentral framed agents as offloading 50–60% of tedious work so human agents focus on higher-value conversations. Full replacement was not pitched as a near-term goal.

Key Takeaways

Budget for code review and observability tooling to scale with your AI-generated code volume — the review queue, not the build pipeline, is your next bottleneck.
If your enterprise agent roadmap promises production-scale customer deployment in under six months, plan for the T-Mobile reality: roughly a year of integration, governance, and validation work.
Agent simulation will become a standard pre-production layer; teams that treat it as optional will discover regressions in customer traffic instead of in test environments.
Unify your retrieval stack — vector, graph, and multi-modal — into one format before you scale, or pay a compounding integration tax later.
Frame agent projects as offload-and-coach rather than replace-and-eliminate; the deployments getting renewed are the ones that made human operators measurably more effective.

Why Vibe-Coded Software Broke the Review Process

How T-Mobile Scaled to 200,000 Agent Conversations a Day

The Simulation Layer Nobody Was Selling Last Year

Knowledge Graphs, Context, and the Hallucination Tax

Why Human Supervision Quietly Won the Debate

FAQ

Key Takeaways

Build With Zyfolks

AI-Integrated Software

AI Automation

AI Agents

Have a project in mind?