Claude Opus 4.7 - The Autonomy Jump
Anthropic's Opus 4.7 isn't just a benchmark bump. It's the first model that genuinely feels like you can hand off hard work and trust it to come back with something solid.
Anthropic just released Claude Opus 4.7, and the signal in the noise is this: it’s the first model where early testers consistently use the word “trust.” Not “impressed by.” Not “performs well on.” Trust. As in, you can hand it hard coding work and not feel the need to supervise every step.
That’s a different category of upgrade than what we usually see in model releases.
The headline numbers
The benchmarks tell a familiar story on the surface: Opus 4.7 beats Opus 4.6 across the board. But dig into the specifics and the gaps are meaningful, not marginal:
- SWE-bench Verified: State of the art, with particularly strong performance on the hardest tasks
- CursorBench: 70% vs Opus 4.6’s 58% - that’s not incremental, that’s a different tier
- Finance Agent eval: 0.813 vs 0.767 on the General Finance module, with the best data discipline in the group
- Rakuten-SWE-Bench: 3x more production tasks resolved than Opus 4.6
- XBOW visual acuity: 98.5% vs 54.5% - a single pain point effectively disappearing
The pattern is consistent: Opus 4.7 isn’t just slightly better at things Opus 4.6 could already do. It’s resolving entire categories of work that previously needed human intervention.
What actually changed
Three things matter more than the benchmarks:
Instruction following got strict. Opus 4.7 takes instructions literally rather than loosely. Where previous models would skip parts of a prompt or interpret instructions charitably, Opus 4.7 does what you asked. This is great when your prompts are precise. It’s jarring when you’ve been relying on models to fill in gaps. Anthropic explicitly recommends re-tuning prompts and harnesses for this model. If you’ve been lazy with prompt engineering, Opus 4.7 will expose that.
Vision went high-res. Images up to 2,576 pixels on the long edge, more than 3x the previous limit. For computer-use agents reading dense screenshots, data extraction from complex diagrams, and anything that needs pixel-level detail, this is the difference between “works sometimes” and “works reliably.” The XBOW visual acuity jump from 54.5% to 98.5% isn’t a typo - it’s what happens when a model can actually see what’s on screen.
Long-horizon consistency. Multiple testers highlight the same quality: Opus 4.7 keeps going when earlier models would give up. It works through tool failures, recovers from errors, and carries context across extended sessions. Devin reports it “works coherently for hours” and “pushes through hard problems rather than giving up.” That’s the autonomy leap.
The new stuff around it
Anthropic shipped more than just a model:
xhigh effort level. A new tier between high and max, giving finer control over the reasoning-latency tradeoff. Claude Code now defaults to xhigh for all plans. For agentic workloads, this is the sweet spot - enough reasoning to handle complex tasks without the latency penalty of max.
Task budgets (public beta). Developers can now guide Claude’s token spend across longer runs. This is the infrastructure piece that makes long-horizon agents viable in production. Without budget controls, a model that thinks deeply is a model that spends deeply.
/ultrareview in Claude Code. A dedicated review session that reads through changes and flags bugs and design issues. Three free reviews for Pro and Max users. This is Anthropic building workflow tooling directly into the coding experience, not just shipping a smarter model and hoping the ecosystem figures out the rest.
Auto mode for Max users. Claude makes decisions on your behalf, fewer interruptions, longer unattended runs. The framing is careful: “less risk than if you had chosen to skip all permissions.” It’s a calibrated trust escalation, not a blank cheque.
The token tax
Opus 4.7 thinks more. The new tokenizer maps the same input to 1.0-1.35x more tokens depending on content type. And at higher effort levels, it produces more output tokens too. Anthropic’s own testing shows the net effect is favourable - better results per token overall - but the raw token count will go up.
Pricing stays the same as Opus 4.6: $5/M input, $25/M output. The question is whether the efficiency gains on the task side offset the token increases on the cost side. For agentic workflows where you’re paying for outcomes rather than tokens, the math probably works. For bulk processing where cost per token matters, you’ll want to benchmark.
The Mythos shadow
Anthropic mentions Mythos Preview again - still limited release, still with advanced cyber capabilities that need more safeguards. Opus 4.7 is explicitly positioned as the “test bed” for those safeguards before Mythos goes broad. The cyber capabilities in Opus 4.7 are “not as advanced” as Mythos, and new automatic detection blocks are in place.
The Cyber Verification Program for legitimate security work is interesting. It’s Anthropic building a controlled channel for dual-use capability rather than either restricting it entirely or letting it loose. Whether this works depends on how smooth the application process is.
Bottom line
Opus 4.7 is the model that makes the “autonomous coding agent” pitch credible. Not because it’s slightly smarter, but because it’s reliable enough to trust with unsupervised work. The instruction-following upgrade, the vision improvement, and the long-horizon consistency all point in the same direction: this is a model designed for the agent era, not the chat era.
If you’re already on Opus 4.6, upgrade. The migration guide is worth reading, especially the token usage section. If you’re on a different provider, the CursorBench numbers and the early tester feedback suggest this is worth a serious look for any coding-heavy workflow.
The gap between “AI assistant that helps you code” and “AI agent that codes for you” just got a lot narrower.