Suryansh Tiwari

@Suryanshti777

4mo ago

"Claude Code + Karpathy's Repo + One MacBook = A 400B AI Running Without the Cloud"

An AI Hired an AI to Run an AI on a Laptop. It Worked. And the Entire Industry Should Pay Attention.

There’s a specific kind of quiet that happens right before something changes everything.

Not a bang. Not a press conference. Not a $10 billion funding round with a glossy announcement deck.

Just a developer, a laptop, a terminal window — and an AI that spent two and a half hours working while its human went to do something else.

This past week, Dan Woods (@danveloper ) posted a thread on X that most people scrolled past. The ones who stopped and read it carefully understood immediately: something had shifted. Not in the way AI hype shifts every other Tuesday — but in the deep, structural, this-changes-the-math kind of way that only happens a few times a decade.

Dan ran a nearly 400 billion parameter AI model on his MacBook Pro.

No cloud. No GPU cluster. No data center. No API calls to some server farm humming in a warehouse somewhere.

Just a laptop, an SSD, and an AI agent that figured out how to make the impossible work.

Before we get into what happened, meet the three forces that collided to make this possible.

@karpathy . Co-founder of OpenAI. Former Director of AI at Tesla. One of the most respected deep learning researchers alive. Karpathy has spent years quietly building tools that democratize AI research — infrastructure that lets a single developer run the kind of systematic experiments that once required institutional resources and a full team. His autoresearch repo is one of those tools: an automated research framework that designs experiments, runs them, evaluates results, and iterates without constant human babysitting.

Apple’s “LLM in a Flash” paper. A research paper from Apple’s ML team that proposed something conceptually radical: you don’t need an entire model in RAM to run it. If you stream only the parts you’re actively using directly from fast SSD storage, you can run models far larger than your available memory. The math works. The question was whether someone would actually build it.

Claude Code. Anthropic’s agentic AI coding tool. Not a chatbot. Not an autocomplete engine. An agent — something that takes a goal, breaks it into tasks, writes code, runs experiments, reads results, and iterates autonomously. Think of it less like a search engine and more like a brilliant, tireless junior engineer who never sleeps, never gets frustrated, and can hold an entire complex codebase in its head simultaneously.

Dan handed Claude Code the Karpathy autoresearch repo and Apple’s flash paper and said, in effect: figure this out.

What happened next is why we’re here.

This is the part most articles about this story are going to miss, so let’s slow down and be precise.

Dan did not write the optimization code himself. He did not manually design the experiments. He did not sit at his computer tweaking parameters and rerunning tests for hours.

He gave Claude Code a mission and stepped back.

Claude Code read the Apple research paper. It understood the architecture of the Qwen3.5-397B model. It read Karpathy’s autoresearch framework and understood how to use it as a scaffolding for systematic experimentation. Then it got to work.

For 2 hours and 24 minutes, Claude Code ran autonomously. It designed and executed 11 separate experiments. It kept 7 configurations that showed promise. It discarded 4 that didn’t. It crashed zero times. It maintained an internal research log, evaluated each experiment’s results against the goal, and used those results to inform the next experiment — exactly the way a good researcher would.

At the end of those two and a half hours, Qwen3.5-397B — a model stored in 209 gigabytes on disk, with nearly 400 billion parameters — was generating coherent, correct text on a MacBook Pro with 48GB of RAM.

The speed was 0.96 tokens per second. Slow, but alive.

And this is the part that deserves a moment of reflection: an AI agent just conducted original systems research. It didn’t copy a Stack Overflow answer. It didn’t regurgitate a tutorial. It read primary research papers, understood the underlying principles, designed experiments to test implementations of those principles, and iterated toward a working solution autonomously.

That is not a party trick. That is a preview of how software is going to be built.

To feel the weight of what Claude Code accomplished, you need to understand what it was working with.

Qwen3.5-397B is a model from Alibaba’s AI research lab. The “397B” means approximately 397 billion parameters — the numerical weights that encode everything the model knows, every pattern it’s learned, every capability it has. For context, GPT-3 had 175 billion parameters. GPT-4 is estimated to be substantially larger, but it runs on infrastructure that costs tens of millions of dollars.

Qwen3.5-397B uses a Mixture of Experts (MoE) architecture. Rather than activating the entire network for every token it generates, a MoE model routes each computation through a small subset of specialized “expert” sub-networks. In Qwen’s case, about 10 out of 512 experts activate per layer per token. This makes inference far more efficient than a dense model of equivalent parameter count.

It also creates a specific engineering challenge.

Because those experts are stored non-contiguously across disk, you can’t predict exactly which ones will be needed, and you can’t arrange them sequentially. Every token requires scattered SSD reads. And scattered reads are the enemy of speed.

Apple’s “LLM in a Flash” approach solved this in a clever way.

Only the experts needed for the next step are loaded into memory. Everything else stays on disk. A prediction system anticipates which experts will be needed and pre-fetches them just in time. Non-expert weights — roughly 5GB — stay pinned in memory.

In Dan’s implementation, about 1.8GB of SSD data is read per token at roughly 1.4 GB/s bandwidth. Each token becomes a tightly orchestrated cycle of storage, memory, and compute.

An LRU (Least Recently Used) expert cache improves performance over time. After just 20 tokens, cache hit rate reached 44% and kept rising.

The system gets faster the more you use it.

First result: 0.96 tokens per second.

Then came the bottleneck: Python.

Python’s Global Interpreter Lock prevents true parallel execution. For a system juggling SSD streaming, caching, and compute, this becomes a hard ceiling.

The solution was simple in idea, difficult in execution: remove Python.

Claude Code helped rebuild the inference engine using Apple Metal — low-level GPU programming that runs directly on hardware.

The result:

Model load time dropped to 0.1 seconds.

Speed increased to around 6 tokens per second.

Memory usage stayed between 6 and 10GB.

And importantly, the hardware was still not fully utilized.

Six tokens per second might not sound fast, but it matches human reading speed.

The model generates text at roughly the same speed you consume it.

And it’s running locally.

No cloud. No cost per token. No latency from servers.

A 400B parameter model, on a laptop.

This is where Karpathy’s vision becomes real.

AI doesn’t have to be centralized. The gap between consumer hardware and data center capability is shrinking faster than most people realize.

The tools exist. The research exists. The hardware exists.

This experiment connected them.

There are still limitations.

MoE models are worst-case for this setup due to unpredictable access patterns.

Dense models could run up to 4x faster.

Prefill caching is missing, meaning multi-turn conversations are inefficient.

And SSD bandwidth is not fully saturated yet.

There is still significant headroom.

Now consider what this unlocks.

True privacy. Your data never leaves your machine.

Zero inference cost after hardware purchase.

Offline capability anywhere.

Air-gapped deployments for sensitive industries.

Global accessibility independent of internet quality.

And resilience — no outages, no rate limits, no API changes.

Dan plans to open-source the code.

That matters.

Because once it’s public, this stops being one experiment and becomes infrastructure.

Others will improve it. Optimize it. Extend it.

This is how ecosystems start.

The deeper story is not just about running a massive model locally.

It’s about how it happened.

An AI read research papers, designed experiments, ran them, and built a working system.

Then it helped redesign the system at a lower level when it hit a bottleneck.

That is a shift.

We are moving from AI as tool to AI as collaborator.

From assistant to builder.

The line is blurring.

The AI industry focuses on spectacle.

But the real shifts often look like this.

A developer. A laptop. A terminal window.

And an AI working quietly for two and a half hours.

The laptop is becoming the data center.

AI doesn’t have to live in the cloud.

And one of the most important AI breakthroughs this week didn’t come from a lab or a press release.

It came from a MacBook, at night, in a terminal window.

X Article

Found something good?

"Claude Code + Karpathy's Repo + One MacBook = A 400B AI Running Without the Cloud"