Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124

A new study looks at how large language models perform in a variety of medical contexts, including real emergency room situations — where at least one model appears more accurate than human doctors.
It was the study Published this week in Science magazine It comes from a research team led by doctors and computer scientists at Harvard Medical School and Beth Israel Deaconess Medical Center. The researchers said they conducted a variety of experiments to measure how the OpenAI models compared to human doctors.
In one experiment, researchers focused on 76 patients who came to the emergency room at Beth Israel, and compared the diagnoses provided by two treating physicians to those generated by the OpenAI models o1 and 4o. These diagnoses were evaluated by two other doctors, who did not know which ones came from humans and which came from artificial intelligence.
“At each diagnostic touchpoint, o1 performed nominally better than or on par with the two treating physicians and 4o,” the study said, adding that the differences “were particularly pronounced at the first diagnostic touchpoint (initial emergency triage), where the least information about the patient is available and the most urgent for making the right decision.”
At Harvard Medical School press release Regarding the study, the researchers stressed that they did not “pre-process the data at all” – the AI models were presented with the same information that was available in the electronic medical records at the time of each diagnosis.
Using this information, the o1 model was able to provide an “accurate or very close diagnosis” in 67% of triage cases, compared to one doctor getting an accurate or close diagnosis in 55% of cases, and the other doctor hitting the mark in 50% of cases.
“We tested the AI model against almost every benchmark, and it outperformed previous models and our clinical baselines,” Arjun Manray, who heads the AI Laboratory at Harvard Medical School and one of the study’s lead authors, said in the press release.
TechCrunch event
San Francisco, California
|
October 13-15, 2026
To be clear, the study did not claim that AI is ready to make real life-or-death decisions in the emergency room. Instead, she said the results show “an urgent need for future trials to evaluate these technologies in real-world patient care settings.”
The researchers also noted that they only studied how the models performed when presented with text-based information, and that “existing studies suggest that current basic models are more limited in considering non-textual input.”
Adam Rudman, MD, a Beth Israel physician who is also one of the study’s lead authors, said: He told the Guardian That “there is currently no formal framework for accountability” around AI diagnoses, and that patients still “want humans to guide them through life-or-death decisions (and) to guide them through difficult treatment decisions.”
When you make a purchase through the links in our articles, We may earn a small commission. This does not affect our editorial independence.