Almost Timely News: 🗞️ 4 Angles on Local AI (2026-06-14)A fable is a short fictitious story designed to teach a specific moral lesson.Almost Timely News: 🗞️ 4 Angles on Local AI (2026-06-14) :: View in Browser The Big Plug👉 My new course, GEO 201 on competitive GEO measurement, is now for sale. Content Authenticity Statement100% of this week’s newsletter was made by me, the human, and boy does it show. Learn why this kind of disclosure is a good idea and might be required for anyone doing business in any capacity with the EU in the near future. Watch This Newsletter On YouTube 📺Click here for the video 📺 version of this newsletter on YouTube » Click here for an MP3 audio 🎧 only version » What’s On My Mind: 4 Angles on Local AIThis week’s been one of those weeks where so much has happened and it’s so messy that there isn’t a theme, so I’m going to foam at the keyboard, write down my thoughts, and see what comes out on the other side. Part 1: A Fable of Fable“A fable is a short fictitious story designed to teach a specific moral lesson.” This week, Anthropic’s fifth generation model family, Fable, became available for a short period of time. It debuted on Tuesday and was blocked by the US government on Friday for ambiguous reasons without clear evidence, despite the fact that its larger, more dangerous version, Mythos, has been available to large corporations for a few weeks now. In my tests of Fable while it was still available, it was excellent at what it did, very, very expensive, and has clear use cases. What I found interesting was the level of amazement people had at it - folks were raving over its abilities to discern intent from ambiguous, poorly written prompts and turn them into real results. While that’s admirable of Fable’s capabilities, it speaks more towards people’s unwillingness to take the time and plan and think things through. Opus 4.8 on xHigh or Ultra settings can accomplish most of what Fable can do - if you prompt it well and let it iterate and think and review. Here’s my general process for this, something Katie Robbert taught me. First, never just go off and do something. Take the time to build out the requirements, ideally with a framework like the 5P Framework by Trust Insights™. Once you’ve got requirements nailed down, build a specification. This can be a design spec, a writing spec, a code spec, something that says “here’s what we’re doing”. Then build a workplan from your spec. Finally, and only after you have requirements, spec, and plan, do you have AI go off and do the thing. If you follow this general recipe, not only do you get great results, you also don’t have to use the biggest, heaviest, most costly AI model to do it with. By the time you’ve reached the workplan and reconciled it with the requirements and spec, you’ll have anticipated the majority of things that could go wrong. Where I found Fable quite powerful was in review; I had it delegate lighter tasks to smaller sub-models but have it be a master review agent at the beginning and ending of a process, uncovering lots of bugs and issues that previous QA runs hadn’t picked up or seen as a coherent picture. In that regard, it impressed the heck out of me. But the fable here is something I wrote about earlier this year - if you want guaranteed access to AI, you absolutely must have a private, local version running somehow. It can be on your own hardware if you have a bespoke machine like an Asus GX10 or NVIDIA DGX Spark, or a well appointed Mac. It can be on your company’s hardware. It can be on a bespoke hosting service, ideally outside your local jurisdiction, but no matter what solution you pursue, you had better have a fallback. The events of this week proved that any government can unilaterally cut you off from AI services and because cloud providers must adhere to lawful requests, your access to cloud AI is contingent on your government’s approval. Having your own as backup isn’t a nice to have, not if your business relies on generative AI now. The AI policy nerds call this AI sovereignty; every country should have its own AI so that no one country or government has control over it. More on this in a bit. Part 2: Mini-token-maxxingA couple of weeks ago, Minimax M3 came out and the company changed its billing for their token plan. Previously, for the M2.7 model, they billed by request, which was ideal for AI agents like OpenClaw and Hermes Agent. When M3 came out, they switched it to token billing, allotting 1.7 billion tokens per month on their Plus plan. That sounds like a lot, but it’s really not. How much not? When I was doing some work with my agent earlier this week, I kept hitting the 5 hour usage limit wall. Curious, I switched its operation from Minimax to Qwen 3.6 running on my MacBook (yay local AI). What I found shocked me - Hermes Agent was churning through about 13 million tokens an hour. Now, most of those were cached, meaning that the processing load was relatively small, but Minimax bills by token whether it’s cached or not. If you haven’t worked with pay-as-you-go AI services (like Claude Enterprise, for example, or the many API versions of common AI tools), there’s usually a difference between new tokens and cached tokens, in terms of pricing. Cached tokens usually cost significantly less than new ones, because if you can re-use cached tokens, you cut your costs. Here’s a simple example. Imagine you commission a ghostwriter to write a blog post for you. The ghostwriter bills by the hour. After the first paragraph, you have a chat because you want the post to go in a different direction. If the ghostwriter has to start from scratch, you’re going to pay full price for their output. If, on the other hand, the ghostwriter can reuse the paragraph they’ve already written - a cached version, then you don’t have to pay for that time and output again. That’s what cached tokens means - AI already has some output it can reuse. Normally, in API versions of AI, this is a discount, but under Minimax’s new plan, cached tokens are the same as regular ones. The ghostwriter charges you again for a paragraph they’ve already written and you’ve agreed it can reuse. So my little agent churning away at 13 million tokens an hour - when I plugged it into Qwen, I saw that almost 90% of the tokens were cached. If I was on a regular pay-as-you-go service, I would be spending very little. But instead, on the current token plan from Minimax, I’m using up my quota whether or not the tokens are new. Why are they doing this? Because token plans - and by these I mean any flat rate plan, like Claude Max, ChatGPT, etc. - are money losers for AI companies, often substantial ones. If you look at Claude Max 20 and how much output you get for $200, you’re getting roughly $8,000 worth of tokens. That’s a 97.5% discount - and there’s no way that’s sustainable in the long term. Minimax likely ran into token economics faster than Anthropic, so the new plan with the new model is more realistic in terms of what you pay versus what you get. Expect this to be the case across the industry in the coming months and years. No business can reliably sell its product at a 97.5% discount for long. Minimax’s token plan used to be the best deal in AI; today, it’s a good deal but not a great one - and what you can run on your machine can match it. Also in the news this week, Minimax M3 became open weights; anyone with enough hardware can download it and run it for free, but you need lots and lots of hardware. It’s a beefy model, 428 bill |