{"id":"2034663049584693489","url":"https://x.com/Suryanshti777/status/2034663049584693489","text":"","author":{"name":"Suryansh Tiwari","username":"Suryanshti777","avatarUrl":"https://pbs.twimg.com/profile_images/2003864246833319936/d5GQLwdV_200x200.jpg"},"createdAt":"Thu Mar 19 16:07:27 +0000 2026","engagement":{"replies":30,"retweets":40,"likes":215,"views":221053},"article":{"title":"\"Claude Code + Karpathy's Repo + One MacBook = A 400B AI Running Without the Cloud\"","previewText":"An AI Hired an AI to Run an AI on a Laptop. It Worked. And the Entire Industry Should Pay Attention.\n\nThere’s a specific kind of quiet that happens right before something changes everything.\n\nNot a","coverImageUrl":"https://pbs.twimg.com/media/HDySC7CbkAAQkSB.jpg","content":"An AI Hired an AI to Run an AI on a Laptop. It Worked. And the Entire Industry Should Pay Attention.\n\nThere’s a specific kind of quiet that happens right before something changes everything.\n\nNot a bang. Not a press conference. Not a $10 billion funding round with a glossy announcement deck.\n\nJust a developer, a laptop, a terminal window — and an AI that spent two and a half hours working while its human went to do something else.\n\nThis past week, Dan Woods (@danveloper ) posted a thread on X that most people scrolled past. The ones who stopped and read it carefully understood immediately: something had shifted. Not in the way AI hype shifts every other Tuesday — but in the deep, structural, this-changes-the-math kind of way that only happens a few times a decade.\n\nDan ran a nearly 400 billion parameter AI model on his MacBook Pro.\n\nNo cloud. No GPU cluster. No data center. No API calls to some server farm humming in a warehouse somewhere.\n\nJust a laptop, an SSD, and an AI agent that figured out how to make the impossible work.\n\nBefore we get into what happened, meet the three forces that collided to make this possible.\n\n @karpathy . Co-founder of OpenAI. Former Director of AI at Tesla. One of the most respected deep learning researchers alive. Karpathy has spent years quietly building tools that democratize AI research — infrastructure that lets a single developer run the kind of systematic experiments that once required institutional resources and a full team. His autoresearch repo is one of those tools: an automated research framework that designs experiments, runs them, evaluates results, and iterates without constant human babysitting.\n\nApple’s “LLM in a Flash” paper. A research paper from Apple’s ML team that proposed something conceptually radical: you don’t need an entire model in RAM to run it. If you stream only the parts you’re actively using directly from fast SSD storage, you can run models far larger than your available memory. The math works. The question was whether someone would actually build it.\n\nClaude Code. Anthropic’s agentic AI coding tool. Not a chatbot. Not an autocomplete engine. An agent — something that takes a goal, breaks it into tasks, writes code, runs experiments, reads results, and iterates autonomously. Think of it less like a search engine and more like a brilliant, tireless junior engineer who never sleeps, never gets frustrated, and can hold an entire complex codebase in its head simultaneously.\n\nDan handed Claude Code the Karpathy autoresearch repo and Apple’s flash paper and said, in effect: figure this out.\n\nWhat happened next is why we’re here.\n\nThis is the part most articles about this story are going to miss, so let’s slow down and be precise.\n\nDan did not write the optimization code himself. He did not manually design the experiments. He did not sit at his computer tweaking parameters and rerunning tests for hours.\n\nHe gave Claude Code a mission and stepped back.\n\nClaude Code read the Apple research paper. It understood the architecture of the Qwen3.5-397B model. It read Karpathy’s autoresearch framework and understood how to use it as a scaffolding for systematic experimentation. Then it got to work.\n\nFor 2 hours and 24 minutes, Claude Code ran autonomously. It designed and executed 11 separate experiments. It kept 7 configurations that showed promise. It discarded 4 that didn’t. It crashed zero times. It maintained an internal research log, evaluated each experiment’s results against the goal, and used those results to inform the next experiment — exactly the way a good researcher would.\n\nAt the end of those two and a half hours, Qwen3.5-397B — a model stored in 209 gigabytes on disk, with nearly 400 billion parameters — was generating coherent, correct text on a MacBook Pro with 48GB of RAM.\n\nThe speed was 0.96 tokens per second. Slow, but alive.\n\nAnd this is the part that deserves a moment of reflection: an AI agent just conducted original systems research. It didn’t copy a Stack Overflow answer. It didn’t regurgitate a tutorial. It read primary research papers, understood the underlying principles, designed experiments to test implementations of those principles, and iterated toward a working solution autonomously.\n\nThat is not a party trick. That is a preview of how software is going to be built.\n\nTo feel the weight of what Claude Code accomplished, you need to understand what it was working with.\n\nQwen3.5-397B is a model from Alibaba’s AI research lab. The “397B” means approximately 397 billion parameters — the numerical weights that encode everything the model knows, every pattern it’s learned, every capability it has. For context, GPT-3 had 175 billion parameters. GPT-4 is estimated to be substantially larger, but it runs on infrastructure that costs tens of millions of dollars.\n\nQwen3.5-397B uses a Mixture of Experts (MoE) architecture. Rather than activating the entire network for every token it generates, a MoE model routes each computation through a small subset of specialized “expert” sub-networks. In Qwen’s case, about 10 out of 512 experts activate per layer per token. This makes inference far more efficient than a dense model of equivalent parameter count.\n\nIt also creates a specific engineering challenge.\n\nBecause those experts are stored non-contiguously across disk, you can’t predict exactly which ones will be needed, and you can’t arrange them sequentially. Every token requires scattered SSD reads. And scattered reads are the enemy of speed.\n\nApple’s “LLM in a Flash” approach solved this in a clever way.\n\nOnly the experts needed for the next step are loaded into memory. Everything else stays on disk. A prediction system anticipates which experts will be needed and pre-fetches them just in time. Non-expert weights — roughly 5GB — stay pinned in memory.\n\nIn Dan’s implementation, about 1.8GB of SSD data is read per token at roughly 1.4 GB/s bandwidth. Each token becomes a tightly orchestrated cycle of storage, memory, and compute.\n\nAn LRU (Least Recently Used) expert cache improves performance over time. After just 20 tokens, cache hit rate reached 44% and kept rising.\n\nThe system gets faster the more you use it.\n\nFirst result: 0.96 tokens per second.\n\nThen came the bottleneck: Python.\n\nPython’s Global Interpreter Lock prevents true parallel execution. For a system juggling SSD streaming, caching, and compute, this becomes a hard ceiling.\n\nThe solution was simple in idea, difficult in execution: remove Python.\n\nClaude Code helped rebuild the inference engine using Apple Metal — low-level GPU programming that runs directly on hardware.\n\nThe result:\n\nModel load time dropped to 0.1 seconds.\n\nSpeed increased to around 6 tokens per second.\n\nMemory usage stayed between 6 and 10GB.\n\nAnd importantly, the hardware was still not fully utilized.\n\nSix tokens per second might not sound fast, but it matches human reading speed.\n\nThe model generates text at roughly the same speed you consume it.\n\nAnd it’s running locally.\n\nNo cloud. No cost per token. No latency from servers.\n\nA 400B parameter model, on a laptop.\n\nThis is where Karpathy’s vision becomes real.\n\nAI doesn’t have to be centralized. The gap between consumer hardware and data center capability is shrinking faster than most people realize.\n\nThe tools exist. The research exists. The hardware exists.\n\nThis experiment connected them.\n\nThere are still limitations.\n\nMoE models are worst-case for this setup due to unpredictable access patterns.\n\nDense models could run up to 4x faster.\n\nPrefill caching is missing, meaning multi-turn conversations are inefficient.\n\nAnd SSD bandwidth is not fully saturated yet.\n\nThere is still significant headroom.\n\nNow consider what this unlocks.\n\nTrue privacy. Your data never leaves your machine.\n\nZero inference cost after hardware purchase.\n\nOffline capability anywhere.\n\nAir-gapped deployments for sensitive industries.\n\nGlobal accessibility independent of internet quality.\n\nAnd resilience — no outages, no rate limits, no API changes.\n\nDan plans to open-source the code.\n\nThat matters.\n\nBecause once it’s public, this stops being one experiment and becomes infrastructure.\n\nOthers will improve it. Optimize it. Extend it.\n\nThis is how ecosystems start.\n\nThe deeper story is not just about running a massive model locally.\n\nIt’s about how it happened.\n\nAn AI read research papers, designed experiments, ran them, and built a working system.\n\nThen it helped redesign the system at a lower level when it hit a bottleneck.\n\nThat is a shift.\n\nWe are moving from AI as tool to AI as collaborator.\n\nFrom assistant to builder.\n\nThe line is blurring.\n\nThe AI industry focuses on spectacle.\n\nBut the real shifts often look like this.\n\nA developer. A laptop. A terminal window.\n\nAnd an AI working quietly for two and a half hours.\n\nThe laptop is becoming the data center.\n\nAI doesn’t have to live in the cloud.\n\nAnd one of the most important AI breakthroughs this week didn’t come from a lab or a press release.\n\nIt came from a MacBook, at night, in a terminal window."},"adhxContext":{"savedByCount":1,"publicTags":[],"previewUrl":"https://adhx.com/Suryanshti777/status/2034663049584693489"}}