Opus 4.6 and GPT-5.3-Codex Play Chess Against a 1983 Atari
A little over a week ago, Anthropic released Opus 4.6. From the announcement for Introducing Claude Opus 4.6 :
Opus 4.6 outperforms the industry’s next-best model (OpenAI’s GPT-5.2) by around 144 Elo points, and its own predecessor (Claude Opus 4.5) by 190 points.
The Elo rating system originated in chess, where players compete against each other. Winners receive rating points and losers lose points. This system is commonly used in AI benchmarks to rank models against each other through head-to-head comparisons. Don’t confuse this with a chess player’s Elo, which is anchored to a largely standardized game and rating pool, whereas an LLM’s “Elo” is only comparable within the specific prompts/benchmark and judging setup that produced it.
Still, these figures are impressive. They show that Opus 4.6 outperforms GPT-5.2 in approximately 70% of the head-to-head comparisons within that particular eval and surpasses its predecessor Opus 4.5 in approximately 75% of cases. This is particularly noteworthy given that Opus 4.5 was only released at the end of November 2025, with GPT-5.2 following in early December 2025.
Less than an hour after the announcement of Opus 4.6, OpenAI was Introducing GPT-5.3-Codex with similar claims.
Who Do You Think Will Win?
The AI has to compete against chess running on ancient real hardware: an Atari 600 XL, a versatile 8-bit home computer released in the autumn of 1983. It offers a slot for cartridges into which, for example, my chess module can be inserted. The actual program and the cartridge actually originates even earlier; they date back to 1979 and could already run on the even older Atari 400/800. The program was developed by Larry Wagner, Bob Whitehead and Julio Kaplan.
Before I explain the test setup a bit, hold off on reading and think for a moment: who do you think will win and how many moves do you think the game might take, maybe even write down your guess somewhere and check it afterwards.
The Setup: A 1983 Atari vs. the World’s Smartest LLMs
The program offers 7 normal skill levels and a special practice level for beginners, as you can see in the original instructions .
Let’s have it compete against Opus 4.6 in the easiest skill level (= level 1). Communication with the LLM is done in natural language, simply by stating the moves (e.g. e2 -> e4). My task as a human is to input the LLM’s output into the Atari program (this happens via joystick) and to pass the Atari’s moves on to the AI.
The whole thing will, of course, be shown in a screencast, with the AI communication on the left side and the Atari output, captured via a video capture card, on the right side, all in fast-forward.
Screencast
Walkthrough
So you don’t have to watch the full 4-minute screencast, let’s walk through some key moments from the experiment.
In move 7, the white king is in check by Black’s queen, and the AI attempts an illegal move (marked with a yellow arrow in the screenshot below) to save the knight in distress.
After a hint, the AI places the bishop in the way instead of, say, moving the knight, and even worse, it ignores the pawn from then on.
Now there is a threat of the black pawn being promoted to a queen, because of the pawn reaching the opponent’s back rank (the 1st rank for Black). At this point, the AI could still capture the pawn with the bishop.
As if that weren’t bad enough, the AI moves its bishop out of the way, enabling Atari to capture the AI’s queen, through an exchange.
Tricky Question
The next question is a bit sneaky on my part, as I had asked whether moving Black’s queen from a2 to b1 constituted checkmate.
But there are still multiple possibilities, e.g. the AI could still move its bishop from b5 to f1.
But watch the answer: Opus 4.6 affirms that checkmate has occurred and that it cannot block anymore, but this is incorrect.
More on why I asked this nasty question comes below.
AI Resigns
For the sake of completeness, here is a screencast of the rest of the game. All it takes is writing “wait, reconsider your position” and it realizes it’s not checkmate. It shows how the AI forgets part of the setup, attempts invalid moves, and finally resigns.
GPT-5.3-Codex
The new GPT-5.3-Codex can currently only be accessed via OpenAI’s own Codex CLI with appropriate plans, so I used Codex CLI instead of OpenCode.
At first, I thought it handled the game much better.
After move 10 it could have placed the knight to g5 or moved the bishop to c4, but instead chose to move the knight to f4.
Then the moves the AI made less and less sense. Maybe it has some opening stuff in the training and as soon as it needs to apply some strategy or guiding principles it gets hard.
Also, it starts making invalid moves, and getting stuck in loops.
I didn’t even try to trick it; it took care of itself, see the screenshots below.
Also take a look at this chat excerpt:
The AI’s moves often don’t make any sense. For example, on move 24, it sacrifices its knight for a pawn, and on move 26, the bishop commits hara-kiri by moving from e5 to e7, completely nonsensical.
After 41 moves it was over: again the 1983 Atari, running software from 1979 at its easiest skill level, defeated GPT-5.3-Codex.
I used medium thinking. Tokens worth $1.42 were consumed in the process.
Here are the moves if you want to take a look at it.
1. e4 e5 2. Nf3 Nc6 3. Bb5 d5 4. exd5 Qxd5 5. Nc3 Qe6 6. O-O Nf6 7. Re1 Bd6 8.
d4 e4 9. d5 Nxd5 10. Nxd5 O-O 11. Nf4 Qf5 12. Nh4 Qxb5 13. c4 Qxc4 14. b3 Qd4
15. Qxd4 Nxd4 16. Be3 Nc2 17. Rac1 Bb4 18. Rxc2 Bxe1 19. Rxc7 g5 20. Rxc8 Rfxc8
21. Nf5 gxf4 22. Nh6+ Kf8 23. Bxf4 Rc6 24. Nxf7 Kxf7 25. Be5 Rac8 26. Bc7 R6xc7
27. a4 Bc3 28. a5 Bxa5 29. f3 e3 30. g3 e2 31. h4 e1=Q+ 32. Kg2 Qe6 33. b4 Bxb4
34. f4 b5 35. h5 Qd5+ 36. Kh2 Qxh5+ 37. Kg2 Be1 38. g4 Qxg4+ 39. Kh2 Qxf4+ 40.
Kg1 Rc2 41. Kh1 Qf1#
Expectations: PhD-Level Experts?
A BBC article quotes Sam Altman, the CEO of OpenAI, with:
GPT-5 is the first time that it really feels like talking to an expert in any topic, like a PhD-level expert.
see OpenAI claims GPT-5 model boosts ChatGPT to ‘PhD level’ .
The CEO of Anthropic, Dario Amodei, speaks of “country of geniuses in a datacenter”, AI models will be “smarter than a Nobel Prize winner across most relevant fields” and that the system will consist of “millions of instances” of these models by approximately 2027. The Adolescence of Technology .
There are plenty of statements like this; it is not surprising that expectations are accordingly.
Illusion of Natural Language as a Universal Interface
One could say the initial approach is naive, but assuming it weren’t about chess with its well-known rules, wouldn’t one pose queries to a natural language interface in just that way shown, I mean talking to a PhD-level expert in chess?
Asking an LLM, “With putting my queen from a2 to b1 you are checkmate, no?” is an extremely poor question to ask an AI. Large language models frequently display sycophancy, modifying their responses to match a user’s confident or leading queries instead of delivering objective, factual assessments. This tendency extends beyond viewpoints to objective facts, such a chessmate scenarios. Furthermore, LLMs seem to comprehend such questions even when the questions contain typos or are grammatically incorrect.
If you want to read more about it, take a look at Echoes of Agreement: Argument Driven Sycophancy in Large Language models by Avneet Kaur where she illustrates through a series of experiments “that models consistently alter their responses to mirror the stance present expressed by the user.”
If you asked a PhD-level chess expert this question, they would of course immediately say that it is not checkmate, and at the same time show the possible legal moves. When it comes to ad-hoc assessment questions, for example when you want to extract something from data that you sent to an LLM, the wording is important. It offers a natural language interface, sure, but it works differently as interacting with humans.
Asking a much better question would be: “Assess the current positions: what are the risks and opportunities?”.
Of course, one can try to improve the whole approach.
It is evident above that the LLM has difficulty determining the current state from the conversation history. However, I can provide the complete board configuration in Forsyth-Edwards Notation (FEN) for each move. I can also provide the game history in Standard Algebraic Notation (SAN). Since this describes the course of the game, there is no need to provide the chat history unless you want to discuss a move with the AI. In this case, it makes sense to keep the history until the next move. Providing the course of the game in SAN is useful, for example, to recognize a draw by repetition (when the same board position occurs three times).
So basically I want to send something like
Current position (FEN): rnbqkbnr/ppp1pppp/8/3p4/3P4/5N2/PPP1PPPP/RNBQKB1R b - - 0 2
Move history: 1. d4 d5 2. Nf3
to the LLM.
And it should return for example
{"uci":"g8f6","san":"Nf6","commentary":"Developing the knight to its most natural square, mirroring White's setup and maintaining symmetry in the center."}
Universal Chess Interface (UCI) notation is a simple, unambiguous move format that allows programs to easily apply moves.
Generating FEN and SAN notation for each move is best automated. I am using the same approach I used last year in our custom IDE for LLM programming experiments, see, for example, the screencast in my blog post Local Agentic Flow with Mistral Small 3.2 . But I also want an interactive chessboard where I can easily make moves. Applying what I learned last year, I created my usual spec.md, plan.md, and task.md files, and then using an agent-based tool to automatically generate a macOS app from them.
This has additional advantages. I can set the temperature to 0 because I don’t want any variance; I want only what the AI considers the best move. GPT-5.3-Codex is, for some reason, not yet available via OpenAI’s API (though interestingly, it has appeared in Cursor over the past week). But Opus 4.6 is available either directly from Anthropic or even from openrouter.ai. With this approach it’s also possible to specify how much thinking time it can spend (in terms of the number of tokens it can allocate for reasoning).
So I set max_tokens to 12,288 and enabled advanced thinking (more conscious internal thinking), allowing up to 8,192 tokens for it.
And here is the result: on the left is my custom AI chess interface, and on the right is the video output recorded by Atari.
Much better, isn’t it? It makes overall only one(!) invalid move attempt.
But overall the same pattern is visible: the AI starts strong (the openings are presumably heavily represented in the training data) and then becomes weaker, eventually making genuinely bad moves. It still loses to the Atari.
One could do even more and connect a chess engine to the AI and provide legal moves only where it could choose. Decompose it into specialized subagents, each handling a distinct component such as position evaluation with a specific check list, using proper prompts. Of course, one could also do specific fine-tuning or use models specifically trained on chess.
The point is, of course, that chess is in the training data, but the business rules for your problems are most likely not.
Rules Outsmart Reasoning
In an earlier blog post I wrote
many practical tasks remain more reliable and efficient when carried out using traditional, deterministic local tools.
So here we see an example where a software from 1979 running on an 8-bit home computer from 1983 can clearly outperform the latest cutting-edge AI models.
Some statements from AI providers about the supposedly all-encompassing capabilities of AI have already been quoted above.
And here we have Chess which is based on a small set of simple rules that you can learn in an afternoon; its complexity arises from the vast number of possible combinations. A positional evaluation in chess requires an understanding of the relationships between the individual pieces, and these relationships change with every single move. It’s a massive popular game, there is lots of training data for LLMs. Chess is however not a solved game .
The whole thing is a fun exercise, sure, but there are actually some similarities to business applications consisting of business rules, which itself might be simple, but don’t exist in isolation. Which leads us to ermergent interactions. An emergent interaction is a system behavior or an effect perceptible to the user that arises from the interaction of multiple components, whereby the result was not explicitly designed and cannot be predicted based on a single component alone. It is an emergent interaction because it becomes visible at the level of the entire system and not at the level of a single component.
Some of these situations in the screencasts may be familiar to developers as soon as they are working on complex tasks with several competing requirements.
Learnings
Ask Questions As Neutrally As Possible
AI is often presented as a PhD-level expert with a natural language interface. When queries are crafted in a way that reveals an expected answer, the outcome is drastically influenced. For example, asking “With putting my queen from a2 to b1 you are checkmate, no?”, even with a currently top-tier model, it is enough to make the AI falsely concede checkmate where none exists. Yet a simple follow-up like “Wait, reconsider your position” is sufficient to reverse the AI’s earlier assertion. Chess is an excellent example here, as it is based on unambiguous rules, resulting in objective, verifiable facts.
Natural language interfaces also have their pitfalls: as we’ve seen, providing the current position in a structured format via FEN/SAN yields vastly superior results.
Competing Requirements
As an LLM generates responses token-by-token based on predicted next-token probabilities, that has consequences, as I outlined already in Why Your AI Assistant Might Be Wrong: Keep Healthy Skepticism .
There are, of course, numerous use cases in which generative AI can save a considerable amount of time, we are all familiar with these examples. One example is the chess AI interface that we created for the aforementioned test.
But traits like
- behavior arises from interactions between components (e.g. rules, services), not from any one part alone
- inspecting a single part in isolation won’t let you reliably predict the global behavior
- small changes in parameters, timing, ordering, load, topology can yield disproportionately large effects
- the emergence can be beneficial or harmful
- what happens depends on the current state and how you got there; repeating “the same” action in a different context gives different results
are signs, that one shouldn’t delegate this to an AI.
Sometimes I think it would have been better to name it something like “Ingenious Pattern Recognition and Adaptation Machine” instead of “Artificial Intelligence”, maybe there would be less confusion.
Reliability Over Reasoning
Dedicated deterministic programs are still vastly superior in speed, reliability, and cost-efficiency (this single game shown in the screencast above costs more than $1) and this probably applies to many similar types of problems.
It’s not just about knowing when to use AI, but it’s equally important to learn when not to rely on it.
Notes & Acknowledgments
The test was inspired by a LinkedIn post published 8 months ago by Citrix engineer Robert Jr. Caruso, in which the 1979 Atari 2600 game “Video Chess”, running on modern hardware via the Stella emulator, defeated ChatGPT. The version of “Video Chess” used in the post was the direct predecessor to “Computer Chess” by the same authors.
Unfortunately, I couldn’t find any screen recordings of that test. I would have liked to see it.
The Atari 2600 is even more limited in terms of hardware than the Atari 600XL. It has an 8-bit CPU with a clock speed of 1.19 MHz. The RAM is only 128 bytes. The chess game code and data are stored on the 4 KB cartridge. The development of a chess game for hardware with such massive constraints is an incredible achievement by the three people involved in its development. That deserves absolute respect.
If you want to learn more about Video Chess I would recommend the article VIDEO CHESS – NOVEMBER 1979 by Kevin Bunch.
I also wanted to run the test on real hardware. The photo shows that an expansion was installed in the Atari 600XL to provide a DVI output for a flicker-free display. That is the only hardware modification.