Evals: the strategic IP that will define the next era of AI

We've spoken to hundreds of execs in the past few months, and we're hearing a clear refrain: "AI isn't delivering ROI yet, but we're all in, so we need to figure it out."

Execs know there's no going back. But their AI programs are stalling at the pilot stage in most large companies, due to inconsistent output quality, an inability to reach the confidence needed to take on real work, uncertainty about security risks, and token cost spikes. Put another way: how many business leaders can actually quantify how accurate their AI programs are?

Everyone is coming to the same realization: if you want production-quality agents that can actually do the work, it starts with evals.

Satya is the latest leader to zero in on evals as strategic IP. He makes the case eloquently and forcefully: “Companies need to turn their workflows, domain knowledge, and accumulated judgment into AI systems that improve with each use. Private evals should capture whether a model is actually improving against outcomes that matter to the business (not just external benchmarks!)” (https://x.com/satyanadella/status/2066182223213293753).

So what are evals? Short for "evaluations," they're a comprehensive, rigorous framework to systematically measure and improve an AI system. We're not talking thumbs up/down or even human review of agent outputs. A strong evaluation suite captures the nuances of judgment, tone, and taste; assesses agentic use of tools; breaks down tasks into specific, scorable dimensions (a "rubric"); and is typically deployed inside a simulation or reinforcement learning environment, where agents can be run repeatedly and trained to improve performance over time.

The best companies treat agentic evals as a core quality, reliability and governance layer; far beyond the ad hoc testing or pre-launch checks most teams rely on today.

Over the past 2 years we ‘refounded’ Handshake as an AI company. Today we're a leading provider of evals to both frontier LLM labs and Fortune 500 enterprises. Our Handshake AI research team is pioneering new research on verifiers, and we're working with visionary leaders at the world's biggest enterprises to shape their AI strategy. A few themes are becoming clear.

Evals must be a cornerstone of a comprehensive approach to drive business impact from AI. Here are the five pillars we're seeing, which I'll expand on in future posts:

1. It all starts with evals. AI performance is entirely defined by the evaluation suite used to measure it: you can only track performance to the extent you've accurately defined what "good" looks like. Leading organizations now build evaluations into a simulation to improve AI in a controlled environment before deployment in the real world. Domain experts curate historical data and plant deliberate edge cases (corrupted text, contradictory instructions) to pressure-test the model. The simulation then scores every update against objective rubrics, whether exact-match string parsing, code-level assertions, or LLM-as-judge criteria, turning AI development from a guessing game into a predictable engineering discipline.

2. Each function needs a distinct AI strategy. A complex enterprise requires a segmented approach: where to build, buy, optimize, or train, by business unit. A mid-sized insurer should probably buy a coding agent off the shelf and pay for frontier tokens, while also building proprietary agents that encode its unique underwriting decisions as a sovereign IP asset. In customer service, vertical solutions optimized for RAG often make more sense, but they still require real setup, maintenance, and ongoing evals. In the world of agents, performance management is evals.

3. Don't overlook safety and security. Many leaders assume their cyber risk is handled because they secured cloud infrastructure & apps during the SaaS era. The agentic AI era introduces new vulnerabilities: standard firewalls don't stop prompt injection attacks or prevent proprietary data from leaking into public training loops. Securing a mid-sized enterprise means deploying data-scrubbing pipelines to strip identifiers before queries leave the network, and input-validation layers to neutralize malicious prompts before they reach your models.

4. Optimized model routing is the new salary banding. You wouldn't pay executive salary for data entry, yet most enterprises route simple tasks to expensive frontier models. A routing layer that matches model cost to task complexity is essential, but it only works if you have the evals to know whether a cheaper model can actually deliver. We've seen companies over-optimize for cost and pay for it in quality. You get what you pay for in LLMs; the discipline is spending tokens where the task is genuinely complex.

5. Fine-tuning is back in the enterprise playbook. At meaningful scale, the most cost-effective strategy often isn't agent iteration or routing alone, but tailoring smaller open-weight models to specific tasks. Fine-tuning shouldn't teach a model new information (that's what RAG is for), but it can standardize workflows, communication style, and tool-calling. The real value comes from treating the resulting model like any software asset: regression testing and feedback loops to catch drift. Discipline and data quality matter more than compute budget.

This shift to an eval-first mentality isn't just technical plumbing. It's a change in how we define success for AI: moving from "let's see what it does" to "let's measure precisely what it should do, and improve it until it does." The organizations that figure this out now will turn AI from a cost center into a durable, compounding asset.

Our work improving frontier models has given us a front seat to this discipline. Our shared goal with enterprise partners is closing the gap between "it works in the lab" and "it does real work for tangible value".

If you're working through this transition, or trying to scale your AI programs beyond pilot, I'd love to hear how you're framing the challenge. It's the most important problem we're solving in 2026.

X Article

Found something good?

Evals: the strategic IP that will define the next era of AI