A Small Frontend Benchmark for LLMs
Perhaps one of the most over-saturated fields of LLM knowledge is frontend development, because a lot of money lies in being able to accelerate writing the most common code in the world. Everything that isn't used by the Old Wizards, bankers, or the deep recesses of government has an HTML interface nowadays, and gorillions of man-hours are being poured into keeping them largely bug-free and sort-of performant every year.
That said, I'm pretty good at frontend development, and I wanted to see if these new-fangled robots could match the knowledge I've gained from nine years in the mines. So, I set out to do some quick benchmarking of a large subset of the leading LLMs - both proprietary and open-weight.
Benchmark Questions
For this benchmark, I devised twelve questions, each five-answer multiple choice in the mold of a college exam, which showcase some edge case or quirk of our beloved web standards that has seared itself into my brain over the past while. The topic of each question is as follows:
- 0. A basic retrieval question to validate the format; not included in scoring
- 1. Proxies and inheritance
- 2. Default stacking orders
- 3. Containing blocks
- 4. How containing blocks influence elements' coordinate systems
- 5. CSS inheritance's interaction with Shadow DOMs and slots
- 6. Changing the mode of a shadow root
- 7. Property-attribute reflection
- 8. Composed events, bubbling, and the event capture phase
- 9. The separation of attributes and props
- 10. Stacking contexts
- 11. Element connection and disconnection
- 12. Declarative Shadow DOMs
There's strong a tilt towards web standards as they relate to custom elements, Shadow DOMs, and layout, as that's primarily what I've been working in and know the best. The questions range from the hard side of easy, something you'd expect a good midlevel to know, to very challenging - something a good senior/staff who's worked in the field for a while might forget if they haven't had their coffee yet.
Methodology
My benchmarking methodology was as follows: I used OpenRouter's free tier for every open-weight model that I could, and then dropped a little under ten bucks to get the major API-only models, except for o3, because that needs a separate API key for some reason. I asked the same question three times, and chose the worst of the three answers - that is, if it didn't answer correctly three times in a row, I marked it as wrong. I feel that this is a fair scoring mechanism, because it reduces the possibility of randomly guessing a correct answer, and if an LLM hallucinates 33% of the time I probably wouldn't want to use it very much anyway.
In addition, if an LLM provided a truly unusable answer, such as a blank answer, garbage without any hint of one of the five answers, or the API simply timed out or returned an error, I re-ran that specific answer until I got one that could be used, with the single exception of GLM-Z1 on one question due to provider flakiness. I'm giving the models some grace on these, since in my experience, they largely don't behave this way locally.
Finally, this testing has some limitations. The answer key and questions were created by hand, so there's the possibility of errors creeping in during that process. Also, in the interest of not burning money with multiple full re-runs, and due to the general unreliability of the free models' endpoints due to rate limiting, timeouts, empty completions, hitting provider-imposed token limits etc., the responses were assembled in a somewhat patchwork manner with various transient modifications to the quizzing scripts to re-run various subsets or exclude specific flaky or limited providers, so there's a nonzero chance of some clerical errors seeping into these scores. I feel pretty good about the questions and the final rankings roughly agree with my personal experiences with these models, but this was done quickly over the course of an evening or two. Grain of salt.
Results
With the aforementioned caveats out of the way, here are the results:
Model | Score |
---|---|
google/gemini-2.5-pro-preview-03-25 | 8/12 |
qwen/qwen3-32b | 8/12 |
qwen/qwq-32b | 7/12 |
google/gemini-2.5-flash-preview:thinking | 6/12 |
qwen/qwen3-235b-a22b | 6/12 |
anthropic/claude-3.7-sonnet:thinking | 5/12 |
deepseek/deepseek-r1 | 5/12* |
openai/gpt-4.5-preview | 5/12 |
thudm/glm-z1-32b | 5/12*** |
openai/gpt-4.1 | 4/12* |
openai/o4-mini | 4/12 |
qwen/qwen3-30b-a3b | 4/12 |
anthropic/claude-3.7-sonnet | 3/12 |
meta-llama/llama-4-scout | 3/12** |
nvidia/llama-3.1-nemotron-ultra-253b-v1 | 3/12** |
deepseek/deepseek-chat-v3-0324 | 2/12 |
google/gemma-3-27b-it | 2/12 |
meta-llama/llama-3.3-70b-instruct | 2/12 |
meta-llama/llama-4-maverick | 2/12** |
qwen/qwen-2.5-72b-instruct | 2/12 |
cohere/command-a | 1/12 |
mistralai/mistral-small-3.1-24b-instruct | 1/12 |
openai/gpt-4o-2024-11-20 | 1/12* |
thudm/glm-4-32b | 0/12 |
*
Failed Question 0
**
Required significant amounts of cleaning by
hand due to failure to follow the format requirements
***
Significant numbers of API timeouts rendered
me unable to get an answer for Question 10, but the two answers
I did get disagreed so it would not have counted anyway.
Gemini 2.5 Pro 03-25 and Qwen3 32b come out on top, followed by QwQ, Qwen3 235b-a22b, and Gemini 2.5 Flash. Honorable mention to the team at Tsinghua and Zhipu for tying the best of Anthropic, OpenAI, and DeepSeek on this benchmark with a 32b model.
Conclusions
I think there are a few things we can take away from these results:
First, many models posted unexpected scores, and I'm similarly surprised how low the score ceiling is. These questions appear to be a bit harder than I had expected; Claude's and R1's performance seems unexpectedly low, whereas Qwen 3, a 32b model, achieves 67% correctness. A 32B model matching Gemini Pro was not in my list of predictions for this.
Second, it looks like my questions benefit heavily from reasoning, as evidenced by QwQ, Qwen 3, Sonnet Thinking over Sonnet, R1 over V3, and the Gemini models' performance. I'm not particularly surprised by this, as although my questions mostly fit into 200 tokens, they do often require more than one distinct step of thinking and knowledge of a few different concepts at the same time to solve correctly, not unlike the mathematics questions that famously benefit from it.
Third, ouch. Claude Sonnet Thinking cost me much more than I was expecting - around $5.50 to run the suite of questions, and some necessary re-tests as I discovered problems in the questions and scripts. Although it isn't the most expensive model per-token, it generates a absolute ton of tokens - five figures for some questions. The next reasoning model, by comparison, is Gemini 2.5 Pro at $0.41. GPT-4.5 also deserves an honorable mention for costing $0.71, despite outputting a single-digit number of tokens per question.
Finally, of interest, none of the LLMs questioned were able to solve Question 8. This isn't too surprising to me, as it's probably the hardest question in the dataset and I wouldn't expect most frontend engineers to know how all of the concepts within interact at the same time. I have triple-checked it, and the human-generated answer does appear to be correct.
Once again, I would like to caveat that these are only twelve questions and I did this in a pretty patchwork manner, so clerical errors may have crept in. Please don't change your enterprise deployment or do anything financial based on these benchmarks.