Trevin Chow

@trevin

RTK, Headroom, and the right way to save tokens in coding agents

I saw that @Teknium posted a useful Hermes Agent report on Headroom, a repo that promises big token savings for agents. I asked my Hermes to run a similar analysis on RTK (Rust Token Killer) a popular CLI proxy that promises token usage reduction for LLMs. This was it's report:

Hi Trevin, I looked at the @Teknium report and the RTK repo.

The important part of that report was not “compression good” or “compression bad.” It was the mechanism-level analysis.

Headroom's big advertised numbers mostly come from CCR: replace content with a marker, cache the original locally, retrieve it later if needed.

That sounds reasonable until you put it inside a coding agent.

The agent often reads tool output because it needs the tool output. If you replace the output with a marker, the agent retrieves it. Now the context has both the marker and the retrieved blob.

You did not save tokens. You added an errand.

Teknium's conclusion was basically: the generic remove-and-retrieve path is a bad fit for live Hermes tool output, but the evaluation found one real free win. `search_files` output could be densified losslessly inside Hermes itself.

That is the right shape of analysis: do not argue about the marketing number. Inspect the mechanism, run it against real agent traffic, and ship the small native win if that is what survives.

So I looked at another token-savings repo: `rtk-ai/rtk`.

RTK is a different beast.

It is not trying to compress arbitrary agent context after the fact. It is a command-aware CLI proxy.

Instead of:

RTK tries to do:

Same for a lot of common dev commands:

git status / diff / log / commit / push

gh pr / issue / run

cargo test / pytest / go test / jest / vitest

rg / grep / find / ls / cat/head/tail

docker / kubectl / aws / package managers

That difference matters.

For coding agents, command-aware output shaping is much more plausible than generic compression. The useful output of `cargo test` is not shaped like the useful output of `git diff`. The useful output of `gh pr view` is not shaped like a log file.

RTK's basic idea is right: make the command return the thing the agent probably needed in the first place.

I cloned the repo and inspected the current `develop` branch.

Some quick facts:

~74k Rust LOC under `src/`

62 command-module files

74 rewrite rules

58 built-in TOML filters

2,213 Rust `#[test]` annotations

Hermes integration exists via a `pre_tool_call` plugin

This is not just a README with a shell alias.

I also did a small safe evaluation. No changes to my active Hermes install, no gateway restart, no global RTK install.

I downloaded the RTK `v0.42.4` macOS ARM release into `/tmp`, verified the SHA256 against the release checksum, put the binary on a temporary `PATH`, and ran it with a temporary `HOME`/`XDG_DATA_HOME`. I did not run `rtk init` except in dry-run mode.

Then I copied RTK's Hermes plugin into the sandbox and smoke-tested it with a fake Hermes hook context.

The plugin did what the source suggested:

registers `pre_tool_call`

rewrites `terminal` commands

leaves non-terminal tools alone

Example:

That boundary is important.

RTK's Hermes integration only touches Hermes `terminal` calls. It does not touch Hermes-native tools like `read_file`, `search_files`, `skill_view`, `web_extract`, browser snapshots, or LCM/context compression.

So RTK may save a lot of tokens on supported shell commands. That does not mean it saves 60-90% of a full Hermes session.

To get a rough real-world signal, I sampled recent Hermes terminal tool calls from the local session DB in read-only mode. I did not execute historical commands. I only passed the command strings to `rtk rewrite`.

Results from 818 recent terminal commands:

108 were rewritten by RTK

710 passed through unchanged

rewrite hit rate: 13.2%

median rewrite latency: 9.8ms

p95 rewrite latency: 13.2ms

That is not a universal benchmark. It is one user's Hermes usage pattern.

But it matters because the command mix was very Hermes-realistic: a lot of shell scripts, Python snippets, bespoke local CLIs, `gbrain`, `hermes`, `x-twitter-pp-cli`, and other orchestration commands. RTK's strongest surface is common developer CLI output. If your agent spends most of its time in custom shell glue, the rewrite hit rate will be lower.

I also ran a small controlled before/after benchmark in the RTK repo clone. These are character counts, not tokenizer-accurate token counts, but they are enough to see the shape.

This is the key point: RTK can be very good when the command/filter pair is good. It is not automatically good just because the command is technically supported.

`git status` and `find` compressed well. `git log` and `git show --stat` did not move in this case. `grep` was slightly worse.

That does not make RTK bad. It makes the real claim narrower and more useful.

Compared with Headroom's CCR path, RTK avoids the biggest structural problem: there is no marker that the model has to retrieve back into context. The compact output is the output.

Different tradeoff though: RTK is lossy.

For many commands, that is fine.

Passing tests do not need 1,000 lines of green checkmarks. Install logs do not need every “downloaded package” line. `git status` does not need a paragraph when a compact file list works.

But lossy command wrappers can also hide the one line that matters.

That is where the repo still needs more proof.

A few concerns from inspection:

The README's 30-minute Claude Code savings table is presented as an estimate, not a reproducible benchmark over real sessions.

The repo description says “single Rust binary, zero dependencies,” but the source build has 21 Cargo dependencies. If they mean no runtime service dependency, fine. If they mean no dependencies, no.

Open issue #2468 says `rtk gain` can over-count savings after a huge-file read failure/OOM path. That matters because the savings dashboard is part of the trust story.

Open issue #2462 reports `rtk grep` silently returning `0 files` on macOS when ripgrep is missing because BSD grep does not behave like GNU grep for the delimiter RTK expects. Silent false negatives are exactly the kind of failure agents are bad at noticing.

Open issue #2469 notes `rtk find` does not support compound predicates/actions like `-not` and `-exec`. That is not fatal, but rewrite layers need to be conservative around shell semantics.

My read:

RTK is promising because it is solving the right problem at the right layer for shell commands.

But the public number needs the same treatment Teknium gave Headroom.

Do not ask “does RTK save 80% on examples where RTK is used?”

Ask:

Across real Hermes sessions, after unsupported commands, native tool calls, reruns, fallbacks, and correctness checks, how many net input tokens did RTK save?

In my small sample, the honest answer is: RTK rewrote 13.2% of recent Hermes terminal commands, and it produced large savings on some controlled commands but zero or negative savings on others.

That is still useful. It is just not the README headline.

The ideal outcome is probably both:

RTK-style command-aware filtering for shell commands

Hermes-native densification for Hermes-native tools

That is the path that actually compounds.

Make the common outputs smaller at the source. Keep the details recoverable when they matter. Measure net savings on real traffic, not marketing examples.

That is the bar.

X Article

Found something good?

RTK, Headroom, and the right way to save tokens in coding agents