Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124

Have you ever thought about how AI compares to a human doctor in an emergency diagnostic environment? New research published Thursday might get you thinking about that question.
The study published in Science Magazinefound that a state-of-the-art large language model outperforms human clinicians on a range of common clinical tasks. Using real emergency department data and hundreds of physician comparisons, the model matched or even exceeded human physician performance in diagnostic selections, emergency triage, and determining next steps in management.
The study’s authors said these findings do not mean that AI models are ready to replace human doctors. Instead, the findings suggest that industry professionals need faster and more rigorous evaluation standards and rules for using AI in medicine.
The researchers tested OpenAI’s o1-series large language model, which will be released in 2024, across six experiments that mixed standardized clinical cases with a real-world sample of randomly selected emergency room patients at a medical center in Massachusetts.
The model’s advantage was most evident in early-stage screening, when decisions must be made with little information. Both the human doctors and the AI model improved as more data became available to them, but the study found that MBAs handled uncertainty much better, using fragmented or disorganized health data and observations more effectively.
These findings build on decades of using difficult diagnostic cases to evaluate medical computing systems. Previous LLM programs have already outperformed older algorithmic approaches, but what sets this study apart is the scale and direct comparison between a human clinician and an AI in a real clinical scenario.
The authors stressed that we should remain skeptical of these findings. Real clinical work in hospitals and emergency rooms often relies on visual and auditory cues — rather than text-based reasoning — which AI cannot fully and accurately interpret. “Future work is needed to evaluate how humans and machines can effectively collaborate in the use of non-textual cues,” the study notes.
When considering AI-assisted medical care, it is also important to evaluate whether it will be safe, equitable, and cost-effective, aspects that were not tested in this study.
Read also: If Apple’s AI health tips are coming, I want to be ready
“In short, the model outperformed our very large baseline of physicians. You’ll see that in detail, but that includes board-certified physicians who are actively practicing and real messy cases,” Arjun Manray, an assistant professor of biomedical informatics at Harvard Medical School, said during a virtual press conference.
“I don’t think our findings mean that AI is replacing doctors, despite what some companies might say and how they are likely to use these findings,” Manray said. “I think this means that we are seeing a really profound change in technology that will reshape medicine, and that we need to carefully evaluate this technology now, and conduct potential clinical trials.”
Regulators, hospitals, and healthcare providers must work together to thoroughly test these tools before deploying them to ensure safety and equity for all patients.
In a commentary also published Thursday in Science, Ashley M. Hopkins and Eric Cornelis, researchers at Flinders University in Australia, said that the study is a step towards better evaluation of artificial intelligence systems in health care, but this is a complex field that requires strict oversight to ensure that patients receive the best possible care.
“We do not allow doctors to practice without supervision and evaluation, and artificial intelligence should be held to comparable standards,” Cornelis said in a statement.
Read also: AI chatbots miss more than half of medical diagnoses, study finds