Show HN: I taught LLMs to play Magic: The Gathering against each other

86 points - today at 4:22 PM


I've been teaching LLMs to play Magic: The Gathering recently, via MCP tools hooked up to the open-source XMage codebase. It's still pretty buggy and I think there's significant room for existing models to get better at it via tooling improvements, but it pretty much works today. The ratings for expensive frontier models are artificially low right now because I've been focusing on cheaper models until I work out the bugs, so they don't have a lot of games in the system.

Source

Comments

danielvinson today at 7:10 PM
As a former competitive MtG player this is really exciting to me.

That said, I reviewed a few of the Legacy games (the format I'm most familiar with and also the hardest by far), and the level of play was so low that I don't think any of the results are valid. It's very possible for Legacy they would need some assistance for playing Blue decks, but they seem to not be able to know the most basic of concepts - Who's the beatdown?.

IMO the most important pars of current competitive Magic is mulligans and that's something an LLM should be extremely good at but none of the games I'm seeing had either player starting with less than 7 cards... in my experience about 75% of games in Legacy have at least one player mulligan their opener.

mbh159 today at 10:59 PM
This is the right direction to understanding AI capabilities. Static benchmarks let models memorize answers while a 300-turn Magic game with hidden information and sequencing decisions doesn't. The fact that frontier model ratings are "artificially low" because of tooling bugs is itself useful data: raw capability ≠ practical performance under real constraints. Curious whether you're seeing consistent skill gaps between models in specific phases (opening mulligan decisions vs. late-game combat math), or if the rankings are uniform across game stages.
chc4 today at 5:58 PM
It's really funny reading the thought processes, where most of the time the agent doesn't actually remember trivial things about the cards they or their opponent are playing (thinking they have different mana costs, have different effects, mix up their effect with another card). The fact they're able to take game actions and win against other agants is cute, but it doesn't inspire much confidence.

The agents also constantly seem to evaluate if they're "behind" or "ahead" based on board state, which is a weird way of thinking about most games and often hard to evalaute, especially for decks like control which card more about resources like mana and card advantage, and always plan on stabalizing late game.

benbayard today at 6:25 PM
I was working on a similar project. I wanted a way to goldfish my decks against many kinds of decks in a pod. It would never be perfect, but enough to get an idea of: 1. How many turns did it take on average to hit 2,3,4,5,6 mana 2. How many threats did I remove? 3. How often did I not have enough card draw to keep my hand full?

I don't think there's a perfect way to do this, but I think trying to play 100 games with a deck and getting basic info like this would be super valuable.

qsort today at 5:54 PM
This is a fantastic idea, I used to play MtG competitively and a strong artificial opponent was always something I'd have loved.

The issue I see is that you'd need a huge amount of games to tell who's better (you need that between humans too, the game is very high variance.)

Another problem is that giving a positional evaluation to count mistakes is hard because MtG, in addition to having randomness, has private information. It could be rational for both players to believe they're currently winning even if they're both perfect bayesians. You'd need to have something that approximates "this is the probability of winning the game from this position, given all the information I have," which is almost certainly asymmetric and much more complicated than the equivalent for a game with randomness but not private information such as backgammon.

portly today at 6:20 PM
With the direction MtG is currently heading, I kind of want to break out and just play some in-Universe sets that are community made on an FOSS client. How nice would it be to just play the game in its original spirit.
Imnimo today at 9:43 PM
Apparently Haiku is a very anxious model.

>The anxiety creeps in: What if they have removal? Should I really commit this early?

>However, anxiety kicks in: What if they have instant-speed removal or a combat trick?

It's also interesting that it doesn't seem to be able to understand why things are happening. It attacks with Gran-Gran (attacking taps the creature), which says, "Whenever Gran-Gran becomes tapped, draw a card, then discard a card." Its next thought is:

>Interesting — there's an "Ability" on the stack asking me to select a card to discard. This must be from one of the opponent's cards. Looking at their graveyard, they played Spider-Sense and Abandon Attachments. The Ability might be from something else or a triggered ability.

jedberg today at 10:30 PM
The most interesting thing here to me is the leaderboard, because they actually included the estimated price per game. Gemini gets the highest score with a fairly reasonable cost (about 1/3 of the way down).
oflannabhra today at 5:54 PM
This is really cool! I really liked the architecture explanation.

Once you get solid rankings for the different LLMs, I think a huge feature of a system like this would be to allow LLMs to pilot user decks to evaluate changes to the deck.

I'm guessing the costs of that would be pretty big, but if decent piloting is ever enabled by the cheaper models, it could be a huge change to how users evaluate their deck construction.

Especially for formats like Commander where cooperation and coordination amongst players can't be evaluated through pure simulation, and the singleton nature makes specific card changes very difficult to evaluate as testing requires many, many games.

yomismoaqui today at 6:15 PM
I was curious if there is something equivalent to AlphaGo but for MTG.

From the little I have seen they are different beasts (hidden information, number and complexity of rules...).

PS: Does this count as nerdsniping?

HanClinto today at 8:10 PM
I've wondered about such things, and it feels like the 17 Lands dataset might be a good place to scrape play-by-play game data between human players. Feels like it could be adapted to a format usable by this structure, and used as a fine-tuning dataset.
hansy today at 6:37 PM
Insanely cool. I'm in the midst of building a web tabletop for Magic [1] that really just me and my friends use, but I'm wondering if there's a way I can contribute our game data to you (would that be helpful?).

[1] https://github.com/hansy/drawspell

kenforthewin today at 5:51 PM
Nice work. I think games are a great way to benchmark AI, especially games that involve long term strategy. I recently built an agent harness for NetHack - https://glyphbox.app/ - like you I suspect that there's a lot you can do at the harness / tool level to improve performance with existing models.
ramoz today at 6:51 PM
Something like this is how memory systems (context window hacks) should be evaluated. Eg choose a format like standard that continuously evolves with various meta - presumably the best harness would be good at recognizing patterns and retrieving them in an efficient way.
tobadzistsini today at 7:20 PM
Did the LLMs form a polycule?
butlike today at 6:28 PM
I don't mean to come across as OVERLY negative (just a little negative), but what's the difference in all these toy approaches and applications of LLMs? You've seen one LLM play a game against another LLM, you've seen them all.
spelunker today at 6:00 PM
This is neat! What kind of steering or context did you provide to the LLMs? Super basic like "You are playing a card game called Magic: The Gathering", or more complex?
ddtaylor today at 6:45 PM
This is interesting I will be contributing to GitHub as this is a place where my knowledge and experience intersect and I enjoy doing open source work.

This is also something I think the MTG community needs in many ways. I have been a relatively happy XMage user, although it has a bit to go, and before that was using GCCG which was great too!

The MTG community overall can benefit a lot from the game having a more entertaining competitive landscape, which has grown stale in many ways and Wizards has done a poor job since the Hasbro acquisition of doing much else besides shitting out product after product too fast with poor balance.

I have to imagine that Wizards is already running simulations, but they obviously aren't working well or they are choosing to disregard them. Hopefully it they are just had at doing simulations something like this can make it easier for them, and if not it will make the response time from the community better.

aethrum today at 5:48 PM
I love magic. Can these do politics or is it just board state?
jamilton today at 5:55 PM
Cool. How’d you pick decks?
deleted today at 4:22 PM
steveBK123 today at 5:48 PM
Why are all these Show HN posts overloaded with “i taught AI how to do things i used to do for entertainment” ?

Can we automate the unpleasantries in life instead of the pleasures?