Fable 5 as the Driver: When to Let One AI Instruct Another

Fable is a scalpel. I kept picking it up as a gavel.

Claude Fable 5 shipped this week, pitched as Anthropic’s best model yet for long, autonomous work: the kind of task that runs for hours without a human watching. Within days I was using it for something I’d never handed a model before. Not writing code: instructing the model that writes it. Fable in the strategist seat, Opus at the keyboard.

I ran that setup on five projects in one week, about as different as projects get. A client’s reporting dashboard. An almost-autonomous rebuild of an analytics tool. A trading strategy. A backyard deck I was designing to meet building code. Same shape every time. One model drives. One model builds.

One rule came out of all five, and I paid for it in real mistakes: Fable is a brilliant strategist and a dangerous judge. Ask it to shape a problem and it’s worth every token. Ask it to rule on a result, and it hands you a guess dressed up as a verdict.

Here is where the line sits.

The lineup, and what each costs

Putting one model in charge of another only makes sense because they’re priced in tiers. Fable is the sharpest of the four and the most expensive by a wide margin. You don’t want it typing out every line. You want it making the calls that a cheaper model then carries out.

Model	ID	Context	Max output	Price (in / out, per Mtok)	Seat I put it in
Fable 5	`claude-fable-5`	1M	128K	$10 / $50	Strategist (the driver)
Opus 4.8	`claude-opus-4-8`	1M	128K	$5 / $25	Builder (at the keyboard)
Sonnet 4.6	`claude-sonnet-4-6`	1M	64K	$3 / $15	Fast, cheaper implementation
Haiku 4.5	`claude-haiku-4-5`	200K	64K	$1 / $5	Bulk, parallel grunt work

Two numbers hide inside that table. Fable costs twice as much as Opus per token ($10/$50 against $5/$25), and its new tokenizer counts about 30% more tokens for the same text. Put those together and the same prompt costs well over twice as much on Fable, before you count its longer, more deliberate answers. That math is the whole reason to use it as a driver and not a typist.

What it owns: the shape

The most impressive thing Fable did all week doesn’t show up on any feature list. It helped plan the analytics rebuild, a job that then ran mostly on its own: hundreds of automated edits over a few days. The code never drifted off course. At that volume, drift is usually the default. This is the long-horizon work Anthropic built Fable 5 for, so part of the win is the model. Only part.

It held together because the load-bearing decisions got made up front, before most of the code existed. The production contracts were frozen as fixtures, so every later change was checked against what production actually returned, not what anyone remembered it returning. The clean parts of the old engine were copied over untouched instead of rewritten. The project’s state lived in a STATUS.md file in the repo, so it survived the model running out of context. And one data layer fed both the dashboard and the agent, with a test asserting the two could never disagree. Fable supplied the long-horizon thinking to design all of that. The structure is what made it safe to leave running.

A capable model without scaffolding produces fast drift. Scaffolding without the model produces nothing. The plan is where they meet.

The same strength showed up on smaller jobs. Before I’d written a line of a trading backtest, I asked Fable to poke holes in the plan. It found four, and all four were real. The clearest: I was about to price each trade at the instant I spotted the signal, not the instant I could actually act on it. Left alone, that one flaw would have made the whole backtest look profitable when it wasn’t. Later, a single sentence from Fable reframed the problem in a way that unstuck it, and that reframe led straight to the result I’d been chasing.

Every time I asked Fable what am I missing, it found something I’d stopped being able to see. That’s the work it’s good at. It isn’t attached to my idea the way I am, so it catches what I’ve talked myself past.

Where it’s dangerous: the verdict

Now the part that cost me.

One of those projects scores a long list of candidates. The scan came back: zero out of 250 passed. I pasted that into Fable and asked the obvious question. Should I kill this approach?

Fable didn’t hesitate. Three versions of the idea had now hit the same wall, it said, and three failures pointing the same way isn’t bad luck, it’s a finding. Walk away. It sounded right. I wrote up the post-mortem and moved on.

Then I actually looked at the scan. Of those 250 candidates, only about four had enough data to score at all. The other 246 were noise. My “0 of 250” was really “0 of 4,” which proves nothing. Fable never saw the underlying data, so it couldn’t have known that. It took my framing and handed it back to me with more confidence than I’d put into it.

I ignored the verdict and kept going. Within the hour, the approach turned up two clear winners. When I asked Fable again later that same session, it argued the opposite case just as fluently.

That’s the trap, and it’s built in. Fable is told to be decisive. It has no access to your data and no memory of the last thing it told you. So whatever you hand it, you get back a clean, confident answer shaped almost entirely by how you asked. The problem isn’t that it’s sometimes wrong. It’s that it’s more sure of itself than your doubt was, so the doubt loses an argument it should have won.

The mechanism that separates the two

The deck build made the line impossible to miss.

I ran Fable as the safety check on an 18-foot pool deck, three revisions in a row. Each time it came back ship_safe: true. Then I handed it the actual span-table PDF and asked for a real review. The same plan it had just blessed used 4×4 posts on deck blocks where code wants 6×6 on frost footings, a 2×6 guard rail where code wants 2×8 or heavier, and no pool gate at all. That last one is a drowning risk, and it had waved it through three times.

The model didn’t get smarter between blessing the plan and tearing it apart. It got a PDF.

Ungrounded: ask the model to check the work from memory. It will confidently approve things that are wrong or missing.

Grounded: hand it the real document and make it choose. Now it catches the mistakes, including its own.

That changes how you should use it. “Have Fable check this for correctness” is the wrong job. “Have Fable reason over a document I put in front of it” is the right one. It’s the same idea as the Ground Truth Rule: the real value beats the remembered one, every single time.

Fake independence, and the bias mirror

Two traps hide inside “I’ll just have a second model check it.”

The first is fake independence. Having a second AI check the first feels like a real second opinion. It isn’t. They were trained on the same material, so they tend to invent the same wrong answer with the same confidence. A real check has to come from outside both of them: the actual document, the running code, the real number.

The second is the bias mirror. I asked Fable roughly ten things in one session and slowly stopped treating it as a sounding board and started treating it as the decider. With only one voice, I had no counterweight, so when Fable got pessimistic, I got pessimistic with it. The best moment all week was the one time I put the same question to two tools. Fable said cut a data source entirely. The other said keep two sources, since one was biased on its own. Their disagreement told me more than either verdict could have. One confident advisor doesn’t sharpen your judgment. It quietly becomes your judgment.

Match the cost to the call

Fable is heavy. Each consult ran around 23k tokens and a wait. A throwaway test call cost $0.43 in setup tokens alone. A real review ran past five minutes. When I wired one into a tight build loop, a single check stalled the whole pipeline, which sent me down a wasted detour before I landed on the obvious fix: a plain rules-based checker that did the same job in 73 milliseconds for free.

Per consult: ~23k tokens, a wait, $0.43 for even a trivial “ping”

Per token: twice Opus pricing, plus ~30% more tokens for the same text under the new tokenizer

In a loop: 5+ minutes per check, against 73ms for a deterministic checker that did the same job for $0

Slow, expensive models belong at rare decision points. Never inside a loop. If a check runs more than a handful of times, it should be code, not a model.

The tell, and the honest caveat

The clearest sign I’m using it wrong: I’m reaching for Fable and I already know what I want it to say. That isn’t analysis. That’s shopping for permission.

When to call the instructor

Reframe a stuck problem: “what am I missing?”
Stress-test a plan before you build it
Produce a concrete directive from a real artifact you hand it
Pair it with a different adversary when the stakes are real

Don't

Ask it to decide whether to give up
Ask it to validate a result; that’s what the forward data is for
Call it inside a loop

One caveat I won’t skip. I never ran a clean head-to-head against Opus doing the same driver work, and Opus makes the same mistakes when it judges from memory. My gut says Fable’s directions came back a notch sharper and more committed, but that’s a gut read, not a measurement. The real finding is simpler: grounding the instructor mattered far more than which model sat in the seat.

It’s the same lesson seventeen retrospectives keep teaching from a new angle, and it’s why the hard enforcement in AI Change Control holds weight a confident advisor never will. Let the model shape the work. Check the facts against something real. And never let a confident answer stand in for a true one.

Related:

AI Trusts Your Docs. That’s the Problem.: the Ground Truth Rule, which grounding the instructor is a special case of
What 17 Retrospectives Taught Me About Coding With AI: the retro habit this lesson came out of
AI Change Control: the structural verification a model verdict can’t replace