460,000 AI Calls in 9 Days: Why Token Leaderboards Are Quietly Wrecking Enterprise AI ROI

One Disney employee fired off 460,000 interactions with Claude AI in nine days, according to Business Insider. That isn’t productivity — that’s a leaderboard doing exactly what leaderboards do: turning a tool into a contest, and a contest into a cost center. Across Amazon, JP Morgan, Meta, and Disney, internal AI usage rankings are now standard practice, and the industry has even coined a name for the behavior they’re producing: tokenmaxxing. For the executives signing the AI invoices, this may be the most expensive measurement mistake of the decade.

Why Token Counts Became the Default Enterprise AI Metric

Tracking AI adoption inside large organizations creates a real dilemma for IT leaders. The ultimate scoreboard is ROI, but you can’t get to ROI unless employees actually use the tools the company has paid for. Token consumption became the proxy of choice because it’s the easiest signal to collect — and because, as Pendo CEO Todd Olson points out, someone using zero tokens is by definition extracting zero value.

That logic works for exactly one phase: the cold start. Olson calls it the “zero-to-one problem” — the initial inertia of getting people to break old habits and try something new. Once adoption clears that bar, token counts stop telling you anything useful about whether the spending is producing business outcomes. The metric that helped you launch becomes the metric that misleads you.

If you’re a CIO rolling out an enterprise copilot to 10,000 employees, tokens are a fine week-one signal. By month three, you need to know whether the marketing team is shipping campaigns faster and whether the engineers are merging more code — not who hit the API the most times. The companies that don’t graduate to outcome metrics are the ones about to be surprised by their cloud bill.

The Tokenmaxxing Effect on Real Budgets

The behavior that gamified leaderboards reward is exactly the behavior finance doesn’t want. Trevor Stewart, senior vice president at Harness, told the original report that token leaderboards started from good intentions — companies wanted visibility into who was adopting the tools. The unintended consequence is that employees stop thinking about cost entirely, and start routing trivial tasks through the most expensive frontier models available.

Stewart’s analogy: it’s like using a complex power tool for a job a screwdriver could finish. Multiply that across a workforce competing for the top spot, and the original report notes that some companies have seen top users rack up costs in the millions of dollars. Logan Wolfe, a partner at Kyndryl, adds that the metric is also trivially easy to game — and warns that with energy costs putting upward pressure on per-token and per-inference pricing, the unit economics of AI projects are more likely to get worse before they get better.

Imagine you’re a 5,000-person financial services firm where every analyst has access to a premium reasoning model. A quarterly leaderboard launches. Within weeks, analysts are running multi-step reasoning chains to summarize one-paragraph emails, because the prompt that would have cost a fraction of a cent on a smaller model now costs orders of magnitude more — and racks up leaderboard points faster. The CFO will notice. Within the next 18 months, expect at least one publicly named enterprise to disclose an AI cost overrun explicitly tied to internal gamification, and expect a wave of vendors to pitch “token efficiency” dashboards as the fix.

Measuring the Wrong Thing in the Wrong Place

Qodo CEO Itamar Friedman: tracking token usage alone is like tracking how far you walk every day while ignoring how many calories you eat. Walk two miles, eat 5,000 calories, and your health isn’t improving — no matter how proud the pedometer is.

The problem cuts deepest in software engineering, where some companies are now tracking developer token consumption directly. Friedman warns that if developers are pushed to maximize AI-generated code output without proportional investment in review and security validation, the result is code shipped with serious bugs and vulnerabilities baked in. Wolfe makes the same point with a different analogy — rewarding the developer who writes the most lines of code produces bloated applications, not better ones. The moment token usage becomes a KPI, raw output crowds out efficiency, quality, and risk reduction. Teams picking AI metrics should read AI agents vs AI automation first.

If you’re a SaaS engineering org pushing your developers to use an AI coding assistant, the leaderboard you want isn’t tokens consumed — it’s code that survived review and reached production. Everything else is theater. And theater compounds: AI-generated code that gets rejected in review is double-billed waste — you paid for the tokens, and you paid for the human time to throw the output away.

What a Sane Enterprise AI Scorecard Looks Like

Stewart’s recommendation is to design gamification around the behaviors and incentives the business actually cares about. At Harness, he says, what matters is the output delivered, not the tokens consumed. Productivity metrics will differ by function: for developers using AI coding tools, the meaningful number isn’t lines written, it’s lines deployed to production.

He proposes a four-part framework: optimizable cost, wasted cost, tokens consumed, and actual output. Tracking all four together is what separates an AI program that compounds value from one that compounds spend. A wasted-cost line item alone — capturing the dollars spent on code that got rejected, reverted, or never shipped — would change how most enterprises evaluate their copilots overnight. Organizations building AI-integrated software for regulated industries already have to think this way, because audit trails leave no room for vanity metrics.

FAQ

Q: What is tokenmaxxing? A: Tokenmaxxing is the workplace behavior of inflating AI tool usage to climb internal leaderboards or hit consumption targets, regardless of whether the work produces value. It typically manifests as employees routing simple tasks through expensive frontier models, generating excess prompts, or using AI where a simpler tool would suffice.

Q: Why is tracking AI token usage as a KPI risky? A: Token usage is easy to collect and easy to manipulate, which makes it a poor proxy for productivity. Logan Wolfe of Kyndryl warns that the moment tokens become a core KPI, raw output is prioritized over efficiency, quality, and risk reduction — and with per-token costs unlikely to fall soon, the financial damage compounds.

Q: What should enterprises measure instead of token usage? A: According to Harness’s Trevor Stewart, organizations should track four dimensions together: optimizable cost, wasted cost, tokens consumed, and actual outcomes — for example, whether AI-assisted code reached production. The goal is to measure outputs, not inputs.

Key Takeaways

Token leaderboards are useful only during the zero-to-one adoption phase; keeping them as your primary metric after that actively destroys ROI.
Expect a wave of “token efficiency” tooling and audit dashboards to emerge as enterprises confront their first AI budget overruns tied to gamification.
Engineering organizations should replace token-based developer metrics with deployment-based ones — code in production beats code in prompts.
Build a wasted-cost line item into your AI reporting; the dollars spent on AI output that never ships are the fastest way to expose a broken incentive system.
Procurement and finance need to sit at the AI metrics table now, not after the first invoice shock, because per-inference pricing pressure isn’t easing anytime soon.

Why Token Counts Became the Default Enterprise AI Metric

The Tokenmaxxing Effect on Real Budgets

Measuring the Wrong Thing in the Wrong Place

What a Sane Enterprise AI Scorecard Looks Like

FAQ

Key Takeaways

Build With Zyfolks

AI-Integrated Software

AI Automation

AI Agents

Have a project in mind?