kishan

@kishan_dahya

Enough About Harnesses, Your Org Needs Its Own Coding Agent

Elite engineering orgs like Stripe, Ramp, and Coinbase are building their own internal coding agents. These agents run as Slackbots, CLIs, web apps, and Chrome Extensions, meeting engineers where they already work.

They're connected to internal systems with the right context, permissioning, and safety boundaries to operate with minimal or no human approvals.

And because they're so useful, they're spreading beyond engineering as product managers, GTM, and other non-technical team members see their success in Slack and begin using them too.

Background: The best engineering orgs are AI-native already. It's not coming, it's here:

Stripe runs hundreds of millions of lines of Ruby with Sorbet typing — a stack most LLMs struggle with — while processing over a trillion dollars in payments annually. Their agents now produce over 1,300 merged pull requests per week.

Ramp needed agents that could verify their own work across both frontend and backend, with full engineer-level context. They built a multi-client agent platform in-house that spans Slack, web, and a Chrome extension.

Coinbase had financial and crypto security requirements that blocked third-party background agents entirely. They compressed PR cycle time from 150 hours to 15 hours, and are now targeting 5 minutes.

And as @rywalker's survey of in-house coding agents shows, they're far from alone — this is becoming a pattern across the industry.

What follows is a practical guide to the decisions you'll face if you go down this path, showing how Stripe, Ramp, and Coinbase approached each one differently and what you can learn from each.

Sources: This article draws from Stripe's two-part Minions blog series on stripe.dev, Ramp's"Why We Built Our Background Agent", Chintan Turakhia's appearance on the How I AI podcast , and original research comparing open-source agent harness implementations.

1. The Agent Harness

The first decision is what harness your agent will run in.

Stripe forked. They took Block's open-source goose coding agent and customized it with opinionated orchestration that interleaves agent loops with deterministic code for git operations, linters, and testing. Forking gave them a head start on the core agent loop while letting them impose strict control over how that loop interacts with Stripe's infrastructure.

Ramp composed. They built on top of OpenCode as the underlying agent, choosing it for its server-first architecture and typed SDK. A practical bonus: the agent can read its own source code, which helps it understand its own capabilities. Composing on an existing agent gives you an upgrade path — you can pull in improvements from upstream — but couples you to that project's architectural decisions.

Coinbase built from scratch. Their agent, Cloudbot, is in-house built and multi-model — not locked to any single provider. Security requirements for a financial platform handling crypto drove this decision. Building from scratch gives you total control but carries the highest implementation cost.

The tradeoffs are straightforward. Forking gives you speed but ties you to upstream decisions you may not agree with. Composing gives you an upgrade path but couples you to a framework's center of gravity. Building gives you full control but means you own every bug.

2. The Sandbox: Where Agents Run Code

You could have engineers run agents locally, but once agents are writing and executing code autonomously, uncontrolled local execution gets risky fast. All three companies converge on cloud-based sandboxes whether ephemeral VMs or containers. The sandbox is part of your safety model. Your agent execution environment is one of the most consequential decisions you'll make.

Stripe: Cloud VMs as Cattle

Stripe runs agents on devboxes — AWS EC2 instances that serve as standardized cloud developer environments. They're treated as cattle, not pets: easily replaceable, spun up from a proactively warmed pool with 10-second readiness.

Each devbox comes pre-loaded with everything an engineer (or agent) needs:

Pre-cloned git repositories (gigabytes of source code)

Warmed Bazel and type-checking caches

Running code generation services

Checked out to a recent copy of master

The isolation model is what makes this work at a payments company. Devboxes run in a QA environment with no real user data, no access to production Stripe services, and no arbitrary network egress. Because the blast radius of any mistake is fully contained, agents can run with full permissions and no confirmation prompts.

Stripe's insight here is: a development environment that is safe for humans has proven to be just as useful for agents. You don't need to invent new security primitives — you need to make your existing ones fast enough for agents to use.

Ramp: Container Platform with Pre-Warming

Ramp uses Modal for isolated development environments. Pre-built images and snapshots keep repositories current within a 30-minute window — fresh enough for most work, fast enough for on-demand spin-up.

Ramp optimizes for speed. They pre-warm sandboxes while the user is still typing their prompt. By the time the user hits enter, the sandbox is ready. They also do early file reads before sync is fully complete and batch repository-level build steps to minimize startup latency. Agents can also spawn child sessions for parallel work — a sandbox-within-a-sandbox model that lets one agent fan out across multiple tasks.

The result: "Inspect sessions are fast to start and effectively free to run...There's no limit to how many sessions you can have running concurrently, and your laptop doesn't need to be involved at all." They can start a session the moment inspiration hits from anywhere.

Coinbase: Security-Driven In-House

Coinbase built their sandbox in-house, driven by security requirements specific to handling financial and crypto infrastructure. The specifics aren't public, but the motivation is clear: when you're a regulated financial institution, the sandbox isn't just a developer convenience — it's a compliance boundary.

The Pattern

All three converge on the same principle: isolate first, then give full permissions inside the boundary. The sandbox is what makes unattended agent execution safe. If you try to make agents safe through permission prompts and approval gates instead of isolation, you'll end up with an agent that's too slow to be useful or too permissive to be safe.

3. Tools and Context: What Agents Can See and Do

How many tools your LLM can handle and what context to give it is more art than science. Here's how each company approaches it:

Tool Infrastructure

All three companies give their agents access to internal tools via structured interfaces, but at very different scales.

Stripe built an internal MCP server called Toolshed hosting nearly 500 tools spanning internal systems and SaaS platforms. But the critical design decision isn't the number of tools — it's the curation. Agents receive an intentionally small default subset of tools, not unrestricted access to all 500. Each agent instance gets a curated toolset, with per-user customizability and thematic tool grouping. Security controls prevent destructive actions.

The insight: tool curation matters more than tool quantity. Giving an agent access to 500 tools doesn't make it more capable — it makes it more confused and wastes tokens on tool selection. Constraining the toolset per agent type produces better results.

Coinbase takes a different approach to tool breadth. Cloudbot connects to MCPs for Datadog, Sentry, Amplitude, and internal Snowflake databases, plus custom Skills layered on top. It can work across multiple codebases. The emphasis is less on a unified tool platform and more on connecting the specific observability and data sources that matter for debugging and implementation.

Ramp builds on OpenCode's built-in tool system at the SDK level, extending it with their own integrations.

Context Engineering

This is where the real sophistication lives. Getting the right information into the agent's context — not too much, not too little — is the difference between an agent that produces useful PRs and one that hallucinates.

Stripe's rule files use Cursor's format with directory and pattern scoping. Rules automatically attach as the agent traverses the filesystem, and they're synced across three platforms: Minions, Cursor, and Claude Code. This means the same institutional knowledge that helps a human engineer in Cursor also helps an unattended agent. Almost all rules are conditionally applied based on subdirectories — necessary for a codebase with hundreds of millions of lines where different regions have radically different conventions.

Stripe also does context pre-hydration: before a Minion run even starts, the orchestrator scans the Slack thread for links, deterministically pulls Jira tickets, documentation, and Sourcegraph code search results, and runs relevant MCP tools over likely-looking links. The agent starts its work with a rich, pre-assembled context rather than having to discover everything through tool calls.

Coinbase uses Linear as a single context source. All context gets captured in Linear tickets first — the structured bug report, the relevant user journey, the attached files. Then Cloudbot pulls from Linear and fans out into MCPs for additional context. This creates a clean separation: humans curate context into Linear, agents consume it from Linear plus everything else.

As Turakhia put it on the How I AI podcast, the thing he realized is that context is the most important thing — so they funnel everything into Linear first, then let Cloudbot fan out from here.

4. Orchestration: How Agents Think and Act

The fourth decision is how you structure the agent's execution — the loop between receiving a task and producing a pull request.

Stripe's Blueprints

This is Stripe's most distinctive architectural contribution. Blueprints are a hybrid pattern that combines the determinism of workflows with the flexibility of agents, implemented as a state machine that alternates between two types of nodes.

Deterministic nodes always execute the same way: run linters, format code, push to git, execute pre-push hooks. These are the steps that must happen and that LLMs are bad at remembering to do consistently.

Agentic subtask nodes give the LLM creative freedom within a bounded scope: "implement the task described in this ticket," "fix CI failures from the previous run." The LLM can use whatever tools and reasoning it needs, but only within that subtask boundary.

The power of Blueprints is composability. Teams create team-specific custom blueprints for specialized workflows. A team that owns a particular service can encode their deployment conventions, testing requirements, and code review standards into a blueprint — and every agent that runs against their code automatically follows those conventions.

The key principle: putting LLMs into contained boxes compounds reliability. Each deterministic node you add is one fewer thing the LLM can get wrong, which saves tokens, saves CI costs, and makes the overall pipeline more predictable.

Ramp's Session Model

Ramp's orchestration centers on sessions — long-running agent contexts that support follow-up prompts, stopping mechanisms, and multiplayer collaboration.

The session model introduces a design decision Stripe doesn't face with one-shot agents: when a user sends a follow-up prompt, do you queue it or execute immediately? Ramp handles both cases. They also support child sessions, where an agent can spawn sub-agents for parallel work while maintaining a parent context.

Multiplayer is a feature Ramp frames as mission-critical. Multiple team members can collaborate on a single agent session with individual authorship tracking. Use cases include teaching workflows (a senior engineer guides a junior through an agent-assisted task) and QA workflows (a reviewer joins an active session to inspect the agent's work in progress).

Coinbase's Three-Mode Model

Cloudbot offers three distinct modes, each optimized for a different interaction pattern:

Create PR: Takes a Linear ticket and generates a full pull request with code changes.

Plan: Like Cursor's plan mode — generates an implementation plan and writes it back to the Linear ticket for human review before any code is written.

Explain: Debug mode — answers questions about why something isn't working, pulling context from MCPs (Datadog, Sentry, etc.) to diagnose issues.

When a PR is complete, Cloudbot responds in Slack with a link to the Cursor branch using Cursor's deep link format, plus a QR code so the engineer can scan it on their phone and immediately test a mobile build. This close-the-loop output design — from Slack invocation to mobile testing in one flow — reflects Coinbase's focus on compressing the entire feedback cycle.

5. Testing and Validation: How Agents Prove Their Work

The fifth decision is how you validate what the agent produces. This is where the three companies diverge most in philosophy, forming a spectrum from conservative to radical.

Stripe: Shift-Left, Max Two CI Runs

Stripe's testing strategy has three layers:

Local: Automated lint heuristics run within 5 seconds per git push via pre-push hooks with cached results. This catches the trivial stuff before it ever touches CI.

CI: Selective test execution drawn from over 3 million total tests. Many CI failures include autofixes that are automatically applied without human intervention.

Agent retry: If CI fails and there's no autofix, the agent gets one single additional attempt.

After that second CI run, if failures persist, it goes to human review. No third attempt, no retry loop. The philosophy is explicit: avoid diminishing returns from excessive LLM iteration. Every additional retry burns tokens and CI capacity with a declining probability of success. Better to hand it to a human than to let the agent spin.

Ramp: Visual Verification

Ramp's distinctive contribution to validation is visual verification via their Chrome extension. The extension is React-aware and operates on DOM trees rather than screenshots — more reliable for detecting actual UI state versus pixel-level appearance. Streamed desktop views let both the agent and human reviewers see the frontend output as it renders.

This matters because a lot of agent-generated code can pass CI while producing visually broken output. Unit tests don't catch that a modal is rendering behind the page overlay or that a button is the right color but in the wrong position. DOM-based verification does.

Coinbase: Agent Councils and Auto-Merge

Coinbase is the most aggressive of the three when they judge PRs to be low risk. Their approach has evolved rapidly.

Where they started: Average PR cycle time of roughly 150 hours, most of that in review queues. They built internal review tools and implemented auto-merge by risk — low-risk changes (copy, minor bug fixes) merge automatically, higher-risk changes go through review.

Where they are now: PR cycle time down to roughly 15 hours. They use Greptile for automated code reviews and have introduced agent councils — groups of AI agents that do first-pass code review. According to Chintan Turakhia, these agent councils produce reviews that are "95%+ better than human" reviews.

Where they're going: The target is 15 hours down to 5 minutes. They're rethinking where reviews belong in the development cycle entirely — whether the traditional PR review model even makes sense when agents can validate at the point of creation rather than after the fact.

The Spectrum

These three approaches form a clear spectrum:

Conservative (Stripe): Bounded retries, human review as the backstop, explicitly designed to avoid throwing compute at diminishing returns.

Moderate (Ramp): Automated visual verification augments human review but doesn't replace it.

Radical (Coinbase): Agent councils doing first-pass reviews, auto-merge by risk level, working toward removing human review from the critical path entirely.

Where you land on this spectrum depends on your risk tolerance, domain, and your engineers' own preferences.

6. Invocation: How Engineers Access Agents

The sixth decision is the interface. How do your engineers actually talk to the agent?

Slack Is the Universal Layer

All three companies converge on Slack as the primary invocation surface. This isn't a coincidence.

Stripe's agents are invoked most commonly through Slack, though they also support CLI, a web interface, and embedded buttons in internal systems (docs platform, feature flag UI, ticketing). Ramp offers Slack alongside a polished web interface with hosted VS Code and a Chrome extension. Coinbase's Cloudbot is Slack-native — it lives in Slack channels and is invoked via cloudbot <command>.

Chintan Turakhia articulated why Slack wins more clearly than anyone: the cost of writing something in Slack is zero, but the cost of answering something in Slack is enormous. Most of what flows through Slack is humans pretending to be systems — answering questions that could be automated, triaging requests that could be routed, summarizing context that could be assembled. An agent in Slack turns that dynamic inside out: the cost of answering drops to zero too.

There's a second reason, and it's about adoption: Slack is how things go viral within your company. If you hide the agent behind a separate tool, behind a new URL, behind a new login — it doesn't spread. When agent results show up in channels that everyone's already watching, people see what's possible without having to opt in.

Beyond Slack

The interesting divergence is in what each company builds beyond Slack.

Stripe embeds invocation buttons directly into internal platforms. Their docs platform, feature flag UI, and ticketing system all have built-in "run a minion" buttons. This is the logical end state of Slack-first: once the agent proves useful, you push invocation to the point of need rather than making people context-switch to Slack.

Ramp builds three distinct client surfaces, each optimized for different workflows. The web interface includes hosted VS Code integration and organization-wide analytics dashboards. The Chrome extension operates on DOM trees for visual verification. The Slack interface handles chat-based interaction with automatic repository classification and status updates via Block Kit. This also allows non-technical members of the team to adopt Inspect and use it right where they work like in the browser.

Coinbase keeps Slack as the center but adds creative output touches: when Cloudbot finishes a PR, it responds with both a Cursor deep link (so you can jump straight into the branch in your editor) and a QR code (so you can scan with your phone and immediately test a mobile build). Small details, but they close the loop from invocation to validation in a single flow.

7. Adoption: Making It Stick

Building the agent is one thing. Getting a thousand engineers to actually use it is another.

None of these companies forced adoption — they all let the product spread organically, but through different mechanisms.

Ramp: Let the Product Talk

Ramp's approach was explicit: "We didn't force anyone to use Inspect over their own tools. We built to people's needs, created virality loops through letting it work in public spaces, and let the product do the talking." Within a couple of months, ~30% of all pull requests merged to their frontend and backend repos were written by Inspect — and usage continues to grow. They track a real-time metric they call "humans prompting" (users who've sent a prompt in the last 5 minutes) as their adoption pulse. Ramp is also expanding beyond engineering, teaching non-engineering builders like product managers and designers how to use the agent for their own work.

Stripe: Make It Impossible to Avoid

Stripe doesn't have a dedicated adoption playbook — they make the agent impossible to avoid. Invocation is embedded everywhere: the docs platform, the feature flag UI, the ticketing system all have built-in "run a minion" buttons. Rule files are synced across Minions, Cursor, and Claude Code, so institutional knowledge works the same regardless of tool. They've also built a no-code internal agent builder for non-engineering teams, spreading agent usage beyond the engineering org. The result: over 1,300 fully agent-produced PRs merged per week, with a fleet of hundreds of different agents running across the company.

Coinbase: Social Proof at Scale

Coinbase drove adoption through social proof and collective events across their 1,000+ engineer org. They created a Slack channel called "Cursor Wins and Losses" — the "losses" part turned it into a self-reinforcing learning loop where failures got fixed publicly. They ran PR speedruns — time-boxed events where everyone picks a trivial task and ships a PR. The first one produced 70 PRs in 15 minutes; a company-wide follow-up with 800 engineers produced 300-400 PRs in 30 minutes. They also invented a dedicated "Super Builder" role — someone whose full-time job is making everyone else faster with agents.

The Pattern

All three converge on the same lesson: don't mandate, demonstrate. Put the agent where people already work (Slack, internal tools), make results visible, and let adoption compound from there.

Conclusion: The Decision Matrix

Here's how the three companies stack up across every decision.

If You're Starting Today

Pick your agent harness wisely. Consider your requirements and what makes the most sense for your organization.

Build an environment where agents can make mistakes without consequences. This is the single biggest unlock. Stripe's devboxes have no production access, no real user data, no network egress, so agents run with full permissions and zero confirmation prompts. Invest in isolation and pre-warming early; sandbox startup time is agent latency.

Curate tools, don't accumulate them. More tools = more confusion. Give each agent type a small, focused toolset.

Pick your spot on the validation spectrum. Understand which parts of your codebase and PRs are high-risk vs. low-risk, then review accordingly. Don't be afraid to auto-merge the safe stuff and gate the dangerous stuff.

Let the agent work in public. Don't mandate usage, let agent results show up in Slack channels and shared spaces where people can see what's possible without opting in. Adoption follows visibility.

Sources:

Stripe:"Minions: Stripe's one-shot, end-to-end coding agents"andPart 2*

Ramp:"Why We Built Our Background Agent"

Coinbase: Chintan Turakhia on"How I AI" podcast;@chintanturakhia tweet, Mar 2 2026*

tracks the broader landscape of companies building their ownFurther reading:Ry Walker's survey of in-house coding agents

Harness research: Original comparison of coding harness architectures

X Article

Found something good?