AI agents are bad freelancers

Even better artificial intelligence Agents They’re rather desperate for online freelance work, according to an experiment that challenges the idea of artificial intelligence replacing office workers en masse.

The Remote Work Index, a new benchmark developed by researchers at data annotation company Scale AI and the Center for AI Safety (CAIS), a non-profit organization, measures the ability of frontier AI models to automate economically valuable work.

Researchers gave several leading AI agents a set of simulated freelance work, and found that even the best could perform less than 3% of the work, earning $1,810 out of a possible $143,991. The researchers looked at several tools and found that the most capable was Manus from a Chinese startup of the same name, followed by Grok from xAI, Claude from Anthropic, ChatGPT from OpenAI, and Gemini from Google.

“I hope this gives more accurate impressions of what’s going on in terms of AI capabilities,” says Dan Hendricks, director of CAIS. He adds that while some agents have improved significantly over the past year or so, that doesn’t mean this will continue at the same rate.

The amazing progress in artificial intelligence has led to speculation that artificial intelligence will soon surpass human intelligence and be replaced by large numbers of workers. In March, Dario Amodei, CEO of Anthropic, suggested that 90 percent of programming is work It will be automatic Within months.

Previous waves of AI have inspired misplaced predictions about job displacement, for example regarding jobs Imminent replacement of radiologists With artificial intelligence algorithms.

The researchers created a pool of freelance tasks using verified Upwork workers. Tasks include a range of work including graphic design, video editing, game development and administrative tasks such as data mining. They combined a description of each task with a guide to the files needed to perform the work and an example of a final human-generated project.

AI models have improved, Hendricks says In codingAnd mathematics Logical reasoning In recent years, they still struggle to use different tools and perform complex tasks that involve many steps. “They don’t have long-term memory storage, and they can’t continually learn from experiences. They can’t acquire skills on the job like humans,” he says.

The analysis provides a counterpoint to the Economic Action Benchmark introduced by OpenAI in September GDP valuewhich aims to measure work of economic value. According to GDPval, leading AI models such as GPT-5 are approaching human capabilities in 220 tasks across a range of office jobs. OpenAI did not provide comment.

Leave a ReplyCancel Reply