[ad_1]
OpenAI is the new king of vibe coding, according to a newly-released benchmark from AI evaluation startup Vals AI.
In a new benchmark named Vibe Code Bench, OpenAI’s GPT-5.1 achieved the highest level of accuracy in completing a series of software engineering tasks, narrowly beating rival Anthropic’s Claude 4.5 Sonnet. Surprisingly, Google’s brand-new flagship AI model, Gemini 3 Pro, came in fourth place out of 12.
Rayan Krishnan, Vals’ founder and CEO, says that there are several high-quality benchmarks for AI-generated coding, such as SWE-Bench, an evaluation that tasks AI models to fix a long list of real-world bugs and issues—but none actually judge AI models on their ability to develop a fully-functional web application from a single prompt. To remedy this, Krishnan and his team developed a series of tests to determine an AI model’s aptitude for taking a software engineering project from an idea to a simple, working app.
Vals has created several custom benchmarks to judge AI models’ ability across multiple sectors. The startup has created evaluations of AI’s ability to answer tax questions, and handle legal reasoning tasks.
Krishnan says Vals developed 100 unique specification sheets, each detailing the necessary features for different kinds of apps. Vals gave these spec sheets to the AI models along with a detailed system prompt, placed the models in a development sandbox and gave them tools that enable the AI to run code, browse the internet, and access internal databases.
Each model had up to five hours to work on each app. Some of the requested apps were off-brand replicas of popular consumer software, like a social media platform named Zeeter. Others were small apps that might be funded by a startup accelerator like Y-Combinator, such as a daily habits tracker and a classroom management portal.
Once the models had finished building the requested app, Vals used a separate AI agent to evaluate their work. This evaluator agent would attempt to use the app just like a human would, and assign a score based on the number of features that worked as expected.
The results reveal a hard truth: Today’s top AI models aren’t even close to perfect when it comes to vibe coding. The benchmark’s top model, GPT-5.1, only accurately created features as requested 24.6 percent of the time. According to Vals’ report on the benchmark, “no model consistently delivers applications that pass every test on the first attempt, highlighting that reliable end-to-end application development remains an open challenge.”
Still, GPT-5.1 is the clear winner here. Not only did the model perform better than Claude Sonnet 4.5, but it was less than half as expensive to use, costing an average of $2.57 per test vs Claude’s $6.66. According to Vals, GPT-5.1 isn’t just the best vibe coding model on the market, it’s also the most cost-effective. This is a major win for OpenAI, which for much of 2025 has been racing to catch up with Anthropic’s models in coding capability.
One of the biggest surprises to emerge from the testing, says Krishnan, was the length of time that Gemini 3 Pro took to complete tasks. On average, the model took 10,398 seconds per task, or over 173 minutes. In comparison, GPT-5.1 took 1,836 seconds, or just over 30 minutes. “It’s extremely slow,” Krishnan says, of the Google model. Interestingly, Gemini 3 Pro is the top model on SWE-Bench.
One thing that Vals isn’t testing for? Design sensibility. Each model’s apps look wildly different from one another, even when starting with the same prompt. However, developing an objective evaluation model for aesthetic taste is difficult; verifying if a feature works as requested is much easier.
Krishnan was also surprised to find that Grok, the AI model from Elon Musk’s xAI, “didn’t have a lot of accountability.” He says that Grok 4 and Grok 4.1, the latest versions of the model, would quickly make mistakes within their own code repository. After identifying the mistake, according to Krishnan, Grok would say “I’ve observed a mistake in this repository. This is unrecoverable, I’m going to end early.” Because of Grok’s propensity to give up before really getting started, both models landed at the bottom of the leaderboard with 0 percent accuracy.
“I think there’s an element of persistence and recoverability that the models need to have,” says Krishnan, “otherwise they just get frustrated and go in these spirals.” That ability to spot errors, analyze them, and self-correct, is the key difference that puts OpenAI and Anthropic’s models above the rest, he adds.
The final deadline for the 2026 Inc. Regionals Awards is Friday, December 12, at 11:59 p.m. PT. Apply now.
[ad_2]
Ben Sherry
Source link