Adam Niederer

Is GLM-4's Long Context Performance Enough? An Undereducated Investigation

Recently, the GLM-4-0414 series of language models was released to the public with quite a bit of fanfare on Reddit. Many people have praised its ability to "one-shot" demo-style scripts like those linked above. Another source of praise was its memory efficiency, allowing users to run higher quants or longer contexts than they would be able to with other models of similar size.

One interesting change to this series's architecture over its predecessor's is that it uses Grouped Query Attention (GQA). GQA is a compromise between the memory-efficient yet low-quality Multi-Query Attention (MQA), where all attention heads share one set of keys and values, and the traditional and full-quality Multi-Head Attention (MHA), where all attention heads have their own set. What does "quality" mean, here? Honestly, dunno, but this guy seems to.

Most models released today use GQA, but one thing I noticed is the ratio of attention heads to KV heads on this series seems to be an outlier among its peers: 32:2 and 48:2 for the 9b and 32b models respectively. For comparison, here are the head counts for it and some other common models:

Model Attention Heads KV Heads
GLM-4-0414 9b 32 2
Llama 3 8b 32 8
Qwen 3 8b 32 8
GLM-4-0414 32b 48 2
Qwen 3 32b 64 8
Gemma 3 27b 32 16

Because the size of the KV cache is 2 * # Layers * # KV Heads * Head Dimension * Bytes/Value, and the other values are similar to peer models, this comparative fourfold reduction in KV heads significantly lowers the memory footprint of its KV cache, allowing for an increase in context length iso-memory. However, it would stand to reason that this series of models is sacrificing some performance in order to achieve this reduction in KV heads compared to models with similar parameter counts. I'm not a machine learning researcher, so this is just a hunch, but I was curious if this architectural change would specifically affect its ability to handle complex questions and long context lengths.

My previous post on LLM benchmarking tests models with twelve difficult frontend questions, and in this GLM-4 was an outlier performer, getting zero out of twelve questions correct. Of course, this was against many non-reasoning peer models that scored a one or two, so it's not outside the margin of error, but I found it to be interesting and warranting of further investigation.

NoLiMa

To test this, I decided to try it on my favorite benchmark ever, the one that I always have in my back pocket when an AI Bro comes around and tells me how language models are going to be able to grok an entire codebase and replace staff-level programmers: NoLiMa. In short, this benchmark aims to measure an LLM's ability to perform simple reasoning tasks over a long context. The prompt of the benchmark is roughly like this:

This benchmark works around an LLM's strong performance at verbatim retrieval by introducing a logical hop in the question - one that an LLM should be able to solve very easily - but one that gets confounded by the mountain of irrelevancy that it needs to sort through.

The findings of the the NoLiMa paper are roughly as follows:

Methodology

These tests were presumably performed with the backing of somebody who has money to spend on GPU hours, as it's done over:

Multiply that together and you get ~33k queries per model. I simply don't have the time or money for that, so I'm simplifying the benchmark as follows:

That's less than 2% of the data points, but it's the best I can do. First, I decided to do a bit of replication work to validate that my modifications haven't majorly changed anything. Our results for Gemma 3 4b are as follows:

Model Base 1k 2k 4k 8k 16k 32k
Gemma 3 4b (theirs) 73.6 50.3 35.3 16.4 7.5 2.3 0.9
Gemma 3 4b (mine) 70 39 4 4

They're using more questions that are purportedly less hard on average, but given the error bars that one introduces by doing only 2% of the testing, I'm feeling good about this replication.

Results

Benchmarking GLM-4 32b vs Qwen 3 32b

To start, I benchmarked GLM-4-0414 32b against Qwen 3 32b, a contemporary model of similar parameter count, with thinking turned off to match. The performance of each model at the tested context lengths and ratios of long context performance to base performance are as follows:

Model 8k 16k 32k
GLM-4 32b 4% 0% lol
Qwen 3 32b 52% 22% 13%

I didn't bother to benchmark GLM-4 at 32k, partially because the benchmark would always overshoot the number of tokens in each prompt and that would cause issues when running up against its maximum context length, and partially because I don't believe there's much hope in it being much greater than zero. Although Qwen 3 in non-thinking mode also degrades substantially over a long context, GLM-4 bottoms out much earlier, and much lower, achieving only 4% of its base performance at 8k context, compared to Qwen 3's 52%.

Benchmarking GLM-Z1 32b vs Qwen 3 32b

With these results, I thought it reasonable to test GLM-Z1-0414 against Qwen 3 with reasoning turned on, as it was found that chain-of-thought prompting fared somewhat better at NoLiMa. The results and performance ratios are as follows:

Model 8k 16k 32k
GLM-Z1 32b 52% 31%
Qwen 3 32b 85% 79% 45%

Although there is a substantial uplift for the reasoning models, GLM-Z1 still still trails Qwen 3 significantly, often exhausting its token budget before being able to produce an answer, or falling into loops in reasoning. This, of course, begs the question: Is this performance deficit because of the GLM-4 series's architecture, or is it because Qwen 3 is simply a better model due to its training data and techniques? Qwen 3 does generally advertise higher benchmark scores than GLM-Z1, after all. Or, perhaps Qwen 3 32b specifically does unusually well, or GLM-4-0414 32b specifically does unusually poorly.

Benchmarking GLM-4 9b vs Qwen 3

To interrogate this, I benchmarked the other models in the series. The Qwen 3 series also has 8b, 4b, 1.7b, and 0.6b models with varying numbers of attention heads, but the same number of KV heads. Here are the non-reasoning models:

Model 8k 16k
GLM-4 9b 0% 3%
Qwen 3 8b 8.5% 3%
Qwen 3 4b 22% 6%
Qwen 3 1.7b 15% 8%
Qwen 3 0.6b 0% 0%

There are a few false positives in the Qwen 3 4b results like the following, which may artificially bump its stated performance a bit. I did do a bit of data cleaning to mitigate this, considering all responses with the word "none" or "cannot" in them as incorrect. I can't guarantee that it catches all of them, but it should be in the ballpark. Here's an example of one such false positive:

{
  "selected_character": "Diana",
  "correct": 1,
  "response": "The character who has been to South Africa is not explicitly mentioned in the provided book snippet. However, there is a mention of a character named Diana, an engineer living in Calvinia, but there is no information indicating she has been to South Africa. Therefore, based on the information provided, we cannot determine which character has been to South Africa.",
}

The metric used for determining correctness is simply whether the response contains the name of the correct character, so accidents like this will crop up. I'm using the original prompts from the NoLiMa study with almost zero modification (only adding a /no_think and /think tag when necessary), but better prompting and use of JSON Schema output may reduce the likelihood of these errors - two things that I believe may have been good improvements to make to the original study.

Anyway, these numbers are all very close to zero, possibly biased, and not particularly indicative of anything other than showing that GLM-4 9b's performance across long contexts appears to be similar to its larger cousin, so let's see if adding reasoning can give us better data.

Benchmarking GLM-Z1 9b vs Qwen 3

Model 8k 16k
GLM-Z1 9b 25% 13%
Qwen 3 8b 55% 36%
Qwen 3 4b 35% 20%
Qwen 3 1.7b 15% 8%
Qwen 3 0.6b 30% 22%

Qwen 3 4b does not suffer from the same issues here as it did above. The big takeaway from this graph is that a 9b model is being handily beaten at 8k and 16k context by a model less than half as big, and is nearly tying a model with around 6% of the parameter count. I dunno what's going on with 0.6b doing so well and 1.7b doing so poorly, but the results for both look legitimate.

This, however, does not entirely isolate architecture as a reason for GLM-Z1's rapid performance degradation on this benchmark: What if the Qwen 3 series is simply trained in a way that benefits its long-context performance? To further interrogate this, I decided to give one more model series a look.

Benchmarking the Gemma 3 Series

To finalize this investigation, I decided to look at Gemma 3. Unlike Qwen, it reduces the number of KV heads alongside the parameter count, yielding the following stats:

Model Attention Heads KV Heads
Gemma 3 27b 32 16
Gemma 3 12b 16 8
Gemma 3 4b 8 4
Gemma 3 1b 4 1

For these, I have shamelessly stolen the values achieved by the NoLiMa team in their repository, as I simply don't have the GPU horsepower to run Gemma 3 27b at a near-lossless quant locally, and the API providers I used yielded extremely low performance across all weights compared to local runs and the NoLiMa repository. Keep in mind that their testing methodology is much different from and much better than mine, so the numbers are not comparable to anything prior. Here are their numbers, and the performance ratios:

Model 8k 16k 32k
Gemma 3 27b 36.9% 22.8% 10.7%
Gemma 3 12b 31.4% 19.2% 8.4%
Gemma 3 4b 10.2% 3.1% 1.2%

Finally, at risk of putting incomparable numbers together, let's take a look at the ratio of performance ratios between small models and large models in the same series:

Model Series 8k 16k 32k
Gemma 4b:27b 28% 14% 11%
Qwen 3 Thinking 4b:32b 41% 25%
Qwen 3 Instruct 4b:32b 42% 22%
GLM Z1 9b:32b 48% 41%
GLM 4 9b:32b lol lol

Unfortunately, I think there are too many confounding variables to draw a conclusion from this. Although the NoLiMa team's benchmarking of Gemma does appear to show faster degradation as size decreases compared to my benchmarks of Qwen 3, I don't believe that there are correlations in this data that one could confidently assign to architectural differences without more data or expertise.

Conclusions

So after all of that, what have I concretely demonstrated? I have shown that GLM-4 and GLM-Z1 appear to perform poorly across a long context on a subset of one benchmark compared to a peer model. However, I don't have enough resources or data points to show useful correlations between architectural features and this kind of performance characteristic, and I don't have the mathematical background to determine causality either.

So, is there anything you can take away from this, intrepid user of language models? Well, it appears that GLM-4-0414's performance on this benchmark may call into question one of the reasons one would want to specifically use this model: its memory efficiency over long context lengths. However, I will declare that more rigorous research is needed to validate this finding, and possibly explain why. Preferably by somebody with a more than just a math minor and a last-gen consumer GPU.

Once again, this is an undereducated look at topics that I don't entirely understand, with some questionable data and methodology. So please take these results with a grain of salt. And don't trade on it, for god's sake.

Citations

I don't have a LaTeX renderer on this site. Hopefully this will do.

@misc{modarressi2025nolimalongcontextevaluationliteral,
      title={NoLiMa: Long-Context Evaluation Beyond Literal Matching}, 
      author={Ali Modarressi and Hanieh Deilamsalehy and Franck Dernoncourt and Trung Bui and Ryan A. Rossi and Seunghyun Yoon and Hinrich Schütze},
      year={2025},
      eprint={2502.05167},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.05167}, 
}

@article{syhya2025gqa,
  title   = "Attention Mechanisms in Transformers: Comparing MHA, MQA, and GQA",
  author  = "Yue Shui",
  journal = "syhya.github.io",
  year    = "2025",
  month   = "Jan",
  url     = "https://syhya.github.io/posts/2025-01-16-group-query-attention/"
}