GLM-5.1: Towards Long-Horizon Tasks

617 points - last Tuesday at 4:32 PM

Comments

gertlabs last Wednesday at 3:37 AM

We're still adding samples, but some early takeaways from benchmarking on https://gertlabs.com:

Contrary to the model card, its one-shot performance is more impressive than its agentic abilities. On both metrics, GLM 5.1 is competitive with frontier models.

But keeping in mind this is an open source model operating near the frontier, it's nothing short of incredible.

I suspect 2 issues with the model are keeping it from fully realizing its potential in agentic harnesses: - Context rot (already a common complaint). We are still working on a metric to robustly test and visualize this on the site. - The model was most likely overtrained on standardized toolsets and benchmarks, and isn't as adaptive in using arbitrary tooling in our custom harness simulations. We've decided to commit to measuring intelligence as the ability to use custom, changing tools, instead of being trained to use specific tools (while still always providing a way to run local bash and other common tools). There are arguments to be made for either, but the former is more indicative of general intelligence. Regardless, it's a subtle difference and GLM 5.1 still performs well with tooling in our environments.

Crazy week for open source AI. Gemma 4 has shown that large model density is nowhere near optimized. Moats are shrinking.

If there are more representations of model performance you'd like to see, I'm actively reading your feedback and ideas.

simonw last Tuesday at 9:25 PM

Not only did this one draw me an excellent pelican... it also animated it! https://simonwillison.net/2026/Apr/7/glm-51/

Yukonv last Tuesday at 5:06 PM

Unsloth quantizations are available on release as well. [0] The IQ4_XS is a massive 361 GB with the 754B parameters. This is definitely a model your average local LLM enthusiast is not going to be able to run even with high end hardware.

[0] https://huggingface.co/unsloth/GLM-5.1-GGUF

alex7o last Tuesday at 5:25 PM

To be honest I am a bit sad as, glm5.1 is producing mich better typescript than opus or codex imo, but no matter what it does sometimes go into shizo mode at some point over longer contexts. Not always tho I have had multiple session go over 200k and be fine.

dvt last Tuesday at 11:35 PM

Every single day, three things are becoming more and more clear:

    (1) OpenAI & Anthropic are absolutely cooked; it's obvious they have no moat
    (2) Local/private inference is the future of AI
    (3) There's *still* no killer product yet (so get to work!)

johnfn last Tuesday at 6:16 PM

GLM-5.0 is the real deal as far as open source models go. In our internal benchmarks it consistently outperforms other open source models, and was on par with things like GPT-5.2. Note that we don't use it for coding - we use it for more fuzzy tasks.

minimaxir last Tuesday at 6:26 PM

The focus on the speed of the agent generated code as a measure of model quality is unusual and interesting. I've been focusing on intentionally benchmaxxing agentic projects (e.g. "create benchmarks, get a baseline, then make the benchmarks 1.4x faster or better without cheating the benchmarks or causing any regression in output quality") and Opus 4.6 does it very well: in Rust, it can find enough low-level optimizations to make already-fast Rust code up to 6x faster while still passing all tests.

It's a fun way to quantify the real-world performance between models that's more practical and actionable.

winterqt last Tuesday at 6:13 PM

Comments here seem to be talking like they've used this model for longer than a few hours -- is this true, or are y'all just sharing your initial thoughts?

kamranjon last Tuesday at 7:42 PM

I'm crossing my fingers they release a flash version of this. GLM 4.7 Flash is the main model I use locally for agentic coding work, it's pretty incredible. Didn't find anything in the release about it - but hoping it's on the horizon.

XCSme last Wednesday at 12:10 AM

GLM 5.1 does worse than GLM 5 in my tests[0] (both medium reasoning OR no reasoning).

I think the model is now tuned more towards agentic use/coding than general intelligence.

[0]: https://aibenchy.com/compare/z-ai-glm-5-medium/z-ai-glm-5-1-...

DeathArrow last Tuesday at 6:08 PM

I am already subscribed to their GLM Coding Pro monthly plan and working with GLM 5.1 coupled with Open Code is such a pleasure! I will cancel my Cursor subscription.

clark1013 last Wednesday at 8:30 AM

I’ve been using GLM 5.1 instead of GPT 5.4 for a few days now, and it’s working smoothly.

RickHull last Tuesday at 5:17 PM

I am on their "Coding Lite" plan, which I got a lot of use out of for a few months, but it has been seriously gimped now. Obvious quantization issues, going in circles, flipping from X to !X, injecting chinese characters. It is useless now for any serious coding work.

kirby88 last Tuesday at 5:58 PM

I wonder how that compare to harness methods like MAKER https://www.cognizant.com/us/en/ai-lab/blog/maker

gavinray last Tuesday at 6:14 PM

I find the "8 hour Linux Desktop" bit disingenuous, in the fine print it's a browser page:

  > "build a Linux-style desktop environment as a web application"

They claim "50 applications from scratch", but "Browser" and a bunch of the other apps are likely all <iframe> elements.

We all know that building a spec-compliant browser alone is a herculean task.

epolanski last Tuesday at 8:09 PM

I was very satisfied with GLM5, I'm not gonna lie.

Excited to test this.

mark_l_watson last Tuesday at 8:19 PM

I can’t wait to try it. I set up a new system this morning with OpenClaw and GLM-5, and I like GLM-5 as the backend for Claude Code. Excellent results.

blazespin last Tuesday at 10:00 PM

Anthropic's reply? A model you can't use.

philipwhiuk last Wednesday at 12:05 AM

This is the flip side of the Project Glasswing stuff...

Everyone else isn't that far behind and they aren't all gonna just wall off their new model.

A reason that Anthropic will eventually give is 'the competition can do what Glasswing can do so what's the point limiting it'.

bigyabai last Tuesday at 5:02 PM

It's an okay model. My biggest issue using GLM 5.1 in OpenCode is that it loses coherency over longer contexts. When you crest 128k tokens, there's a high chance that the model will start spouting gibberish until you compact the history.

For short-term bugfixing and tweaks though, it does about what I'd expect from Sonnet for a pretty low price.

8dazo last Wednesday at 12:30 AM

Just saw the Claude Mythos post. Not sure when it’s going public, but this feels like a real jump, not just incremental progress. Also waiting for the next GLM release coz specs are looking kind of insane.

dryarzeg last Tuesday at 10:08 PM

A bit off-topic, but for some reason, even though I don't use LLMs for my job or for my hobbies, or in daily life frequently (and when I do, it's mostly some kind of "rubber duck brainstorm"), when I see open-weight releases like this one or the recent Gemma 4 (which is very good for local models); the first time was with DeepSeek-R1 (this one, despite being blamed for "censorship", was heavily censored only via DeepSeek API, the local model - full-weight 685B, not the distilled ones - was pretty much unhinged regarding censorship on any topic)... there's always one song coming to mind and I simply can't get rid of it no matter how hard I try.

"I am the storm that is approaching, provoking..." : )

bdeol22 last Wednesday at 9:14 AM

Long-horizon demos are fun; the product test is still interrupted real life—can it pick up three days later without you re-teaching context?

tgtweak last Tuesday at 6:44 PM

Share the harness for that browser linux OS task :)

deleted last Thursday at 2:27 PM

jaggs last Tuesday at 6:11 PM

How does it compare to Kimi 2.5 or Qwen 3.6 Plus?

maxdo last Tuesday at 7:28 PM

One of the bench maxed models . Every time I tried it , it’s not on par even with other open source models .

EITB_2026 last Wednesday at 4:57 AM

Good One Though

Ms-J last Wednesday at 5:01 AM

Z.ai and their GLM models are pretty low quality.

I've been testing it for awhile now since it seemed to have potential as a local model.

With this new update it still cannot parse simple, test PDFs correctly. It inconsistently tells me that the value in the name field in the document is incorrect, and has the name reversed to put the last name first. Or that a date is wrong as it's in the past/future, when it is not. Tons of fundamental errors like that.

Even when looking at the thinking process there are issues:

I used a test website for it to analyze and it says that the sites copyright year states 2026 which is in the future and to investigate as it could be an attack, but right after prints today's correct date.

I'm in the process of trying to get it uncensored. Hopefully that will create some use out of z.ai

Edit: by the way, which is the best uncensored model at the moment?

dang last Tuesday at 4:55 PM

[stub for offtopicness]

[[you guys, please don't post like this to HN - it will just irritate the community and get you flamed]]

claud_ia last Wednesday at 10:05 AM

[dead]

EddyAI last Tuesday at 6:53 PM

[dead]

Manchitsanan last Wednesday at 12:39 PM

[dead]

meidad_g last Wednesday at 12:13 AM

[dead]

andrewmcwatters last Tuesday at 5:24 PM

[dead]

redoh last Wednesday at 4:33 PM

[dead]

aplomb1026 last Tuesday at 5:32 PM

[dead]

aryehof last Wednesday at 5:30 AM

[dead]