Show HN: I taught LLMs to play Magic: The Gathering against each other
86 points - today at 4:22 PM
I've been teaching LLMs to play Magic: The Gathering recently, via MCP tools hooked up to the open-source XMage codebase. It's still pretty buggy and I think there's significant room for existing models to get better at it via tooling improvements, but it pretty much works today. The ratings for expensive frontier models are artificially low right now because I've been focusing on cheaper models until I work out the bugs, so they don't have a lot of games in the system.
Comments
That said, I reviewed a few of the Legacy games (the format I'm most familiar with and also the hardest by far), and the level of play was so low that I don't think any of the results are valid. It's very possible for Legacy they would need some assistance for playing Blue decks, but they seem to not be able to know the most basic of concepts - Who's the beatdown?.
IMO the most important pars of current competitive Magic is mulligans and that's something an LLM should be extremely good at but none of the games I'm seeing had either player starting with less than 7 cards... in my experience about 75% of games in Legacy have at least one player mulligan their opener.
The agents also constantly seem to evaluate if they're "behind" or "ahead" based on board state, which is a weird way of thinking about most games and often hard to evalaute, especially for decks like control which card more about resources like mana and card advantage, and always plan on stabalizing late game.
I don't think there's a perfect way to do this, but I think trying to play 100 games with a deck and getting basic info like this would be super valuable.
The issue I see is that you'd need a huge amount of games to tell who's better (you need that between humans too, the game is very high variance.)
Another problem is that giving a positional evaluation to count mistakes is hard because MtG, in addition to having randomness, has private information. It could be rational for both players to believe they're currently winning even if they're both perfect bayesians. You'd need to have something that approximates "this is the probability of winning the game from this position, given all the information I have," which is almost certainly asymmetric and much more complicated than the equivalent for a game with randomness but not private information such as backgammon.
>The anxiety creeps in: What if they have removal? Should I really commit this early?
>However, anxiety kicks in: What if they have instant-speed removal or a combat trick?
It's also interesting that it doesn't seem to be able to understand why things are happening. It attacks with Gran-Gran (attacking taps the creature), which says, "Whenever Gran-Gran becomes tapped, draw a card, then discard a card." Its next thought is:
>Interesting — there's an "Ability" on the stack asking me to select a card to discard. This must be from one of the opponent's cards. Looking at their graveyard, they played Spider-Sense and Abandon Attachments. The Ability might be from something else or a triggered ability.
Once you get solid rankings for the different LLMs, I think a huge feature of a system like this would be to allow LLMs to pilot user decks to evaluate changes to the deck.
I'm guessing the costs of that would be pretty big, but if decent piloting is ever enabled by the cheaper models, it could be a huge change to how users evaluate their deck construction.
Especially for formats like Commander where cooperation and coordination amongst players can't be evaluated through pure simulation, and the singleton nature makes specific card changes very difficult to evaluate as testing requires many, many games.
From the little I have seen they are different beasts (hidden information, number and complexity of rules...).
PS: Does this count as nerdsniping?
This is also something I think the MTG community needs in many ways. I have been a relatively happy XMage user, although it has a bit to go, and before that was using GCCG which was great too!
The MTG community overall can benefit a lot from the game having a more entertaining competitive landscape, which has grown stale in many ways and Wizards has done a poor job since the Hasbro acquisition of doing much else besides shitting out product after product too fast with poor balance.
I have to imagine that Wizards is already running simulations, but they obviously aren't working well or they are choosing to disregard them. Hopefully it they are just had at doing simulations something like this can make it easier for them, and if not it will make the response time from the community better.
Can we automate the unpleasantries in life instead of the pleasures?