Tech companies are fiercely competing to build the best AI coding tools — and for xAI, the top rival to beat seems to be Anthropic.
Elon Musk’s AI company used contractors to train Grok on coding tasks with the goal of topping a popular AI leaderboard, and explicitly told them they wanted it to outperform Anthropic’s Claude 3.7 Sonnet tool, documents obtained by Business Insider show.
The contractors, hired through Scale AI’s Outlier platform, were assigned a project to “hillclimb” Grok’s ranking on WebDev Arena, an influential leaderboard from LMArena that pits AI models against each other in web development challenges, with users voting for the winner.
“We want to make the in-task model the #1 model” for LMArena, reads one Scale AI onboarding doc that was active in early July, according to one contractor who worked on the project. Contractors were told to generate and refine front-end code for user interface prompts to “beat Sonnet 3.7 Extended,” a reference to Anthropic’s Claude model.
xAI did not reply to a BI request for comment.
In the absence of universally agreed-upon standards, leaderboard rankings and benchmark scores have become the AI industry’s unofficial scoreboard.
For labs like OpenAI and Anthropic, topping these rankings can help attract funding, new customers, lucrative contracts, and media attention.
Anthropic’s Claude, which has multiple models, is considered one of the leading players for AI coding and consistently ranks near the top of many leaderboards, often alongside Google and OpenAI.
Anthropic cofounder Benn Mann said on the “No Priors” podcast last month that other companies had declared “code reds” to try to match Claude’s coding abilities, and he was surprised that other models hadn’t caught up. Competitors like Meta are using Anthropic’s coding tools internally, BI previously reported.
The Scale AI dashboard and project instructions did not specify which version of Grok the project was training, though it was in use days before the newest model, Grok 4, came out on July 9.
On Tuesday, LMArena ranked Grok 4 in 12th place for web development. Models from Anthropic ranked in joint first, third, and fourth.
The day after Grok 4’s launch, Musk posted on X claiming that the new model “works better than Cursor” at fixing code, referring to the popular AI-assisted developer tool.
You can cut & paste your entire source code file into the query entry box on and @Grok 4 will fix it for you!
This is what everyone @xAI does. Works better than Cursor.
— Elon Musk (@elonmusk) July 10, 2025
In a comment to BI, Scale AI said it does not overfit models by training them directly on a test set. The company said it never copies or reuses public benchmark data for large language model training and told BI it was engaging in a “standard data generation project using public signals to close known performance gaps.”
Anastasios Angelopoulos, the CEO of LMArena, told BI that while he wasn’t aware of the specific Scale project, hiring contractors to help AI models climb public leaderboards is standard industry practice.
“This is part of the standard workflow of model training. You need to collect data to improve your model,” Angelopoulos said, adding that it’s “not just to do well in web development, but in any benchmark.”
The race for leaderboard dominance
The industry’s focus on AI leaderboards can drive intense — and not always fair — competition.
Sara Hooker, the head of Cohere Labs and one of the authors of “The Leaderboard Illusion,” a paper published by researchers from universities including MIT and Stanford, told BI that “when a leaderboard is important to a whole ecosystem, the incentives are aligned for it to be gamed.”
In April, after Meta’s Llama 4 model shot up to second place on LM Arena, developers noticed that the model variant that Meta used for public benchmarking was different from the version released to the public. This sparked accusations from AI researchers that Meta was gaming the leaderboard.
Meta denied the claims, saying the variant in question was experimental and that evaluating multiple versions of a model is standard practice.
Although xAI’s project with Scale AI asked contractors to help “hillclimb” the LMArena rankings, there is no evidence that they were gaming the leaderboard.
Leaderboard dominance doesn’t always translate into real-world ability. Shivalika Singh, another author of the paper, told BI that “doing well on the Arena doesn’t result in generally good performance” or guarantee strong results on other benchmarks.
Overall, Grok 4 ranked in the top three for LMArena’s core categories of math, coding, and “Hard Prompts.”
However, early data from Yupp, a new crowdsourced leaderboard and LMArena rival, showed that Grok 4 ranked 66 out of more than 100 models, highlighting the variance between leaderboards.
Nate Jones, an AI strategist and product leader with a widely read newsletter, said he found Grok’s actual abilities often lagged behind its leaderboard hype.
“Grok 4 crushed some flashy benchmarks, but when the rubber met the road in my tests this week Grok 4 stumbled hard,” he wrote in his Substack on Monday. “The moment we set leaderboard dominance as the goal, we risk creating models that excel in trivial exercises and flounder when facing reality.”
Read the full article here