The 27-Year-Old Who Became a Covid-19
Data Superstar
In the contest over
who could make the most accurate coronavirus forecast, it was global
institutions vs. a guy living with his parents in Santa Clara.
By
February 19, 2021, 11:00 AM GMT
Spring 2020 brought with it the arrival of the celebrity
statistical model. As the public tried to gauge how big a deal the coronavirus
might be in March and April, it was
pointed again and again to two forecasting systems: one built by Imperial
College London, the other by the Institute for Health
Metrics and Evaluation, or IHME, based in Seattle.
But the models yielded wildly
divergent predictions. Imperial warned that the U.S. might see as many as 2 million Covid-19 deaths by the
summer, while the IHME forecast was far more conservative, predicting about 60,000 deaths by August. Neither, it
turned out, was very close. The U.S. ultimately reached about 160,000 deaths by the start of
August.
The huge discrepancy in the forecasting figures that spring
caught the attention of a then 26-year-old
data scientist named Youyang Gu. The young man had a master’s degree in
electrical engineering and computer science from the Massachusetts Institute of
Technology and another degree in mathematics, but no formal training in a
pandemic-related area such as medicine or epidemiology. Still, he thought his
background dealing with data models could prove useful during the
pandemic.
In mid-April,
while he was living with his parents in Santa Clara, Calif., Gu spent a week building
his own Covid death predictor and a website to display the morbid
information. Before long, his model started producing more accurate results than
those cooked up by institutions with hundreds of millions of dollars in funding
and decades of experience.
“His model was the only one that seemed sane,” says
Jeremy Howard, a renowned data expert and research scientist at the University
of San Francisco. “The other models were shown to be nonsense time and again,
and yet there was no introspection from the people publishing the forecasts or
the journalists reporting on them. Peoples’ lives were depending on these
things, and Youyang was the one person actually looking at the data and
doing it properly.”
The forecasting model that Gu built was, in some ways,
simple. He had first considered examining the relationship among
Covid tests, hospitalizations, and other factors but found that such data
was being reported inconsistently by
states and the federal government. The most
reliable figures appeared to be the daily death counts. “Other
models used more data sources, but I decided to rely on past deaths to predict
future deaths,” Gu says. “Having that as the only input helped filter
the signal from the noise.”
The novel, sophisticated twist of Gu’s model came from his use of machine learning algorithms to
hone his figures. After MIT, Gu spent a couple years working in the
financial industry writing algorithms for high-frequency trading systems
in which his forecasts had to be accurate if he wanted to keep his job. When it
came to Covid, Gu kept comparing his predictions to the eventual reported death
totals and constantly tuned his machine
learning software so that it would lead to ever more precise
prognostications. Even though the work required the same hours as a demanding
full-time job, Gu volunteered his time and lived off his savings. He wanted his
data to be seen as free of any conflicts of interest or political bias.
While certainly not perfect, Gu’s model performed well from
the outset. In late April he predicted the U.S. would see 80,000 deaths by May
9. The actual death toll was 79,926. A similar late-April forecast from IHME
predicted that the U.S. would not surpass 80,000 deaths through all of 2020. Gu
also predicted 90,000 deaths on May 18 and 100,000 deaths on May 27, and once
again got the numbers right. Where IHME expected the virus to fade away as a
result of social distancing and other policies, Gu predicted there would be
a second, large wave of infections and deaths as many states reopened from
lockdowns.
IHME faced some criticism in March and April, when its
numbers didn’t match what was happening. Still, the prestigious center, based
at the University of Washington and bolstered by more than $500 million in
funding from the Bill & Melinda Gates Foundation, was cited on an almost
daily basis during briefings by members of President Donald Trump’s
Administration. In April, U.S. infectious-disease chief Anthony Fauci told an
interviewer that Covid’s death toll “looks more like 60,000 than the 100,000 to
200,000” once expected—a prediction that reflected IHME forecasts. And on April
19, the same day Gu cautioned about a second wave, Trump pointed to IHME’s
60,000-death forecast as an indicator that the fight against the virus would
soon be over.
IHME officials also actively promoted their numbers. “You
had the IHME on all these news shows trying to tell people that deaths would go
to zero by July,” Gu says. “Anyone with common sense could see we would be at
1,000 to 1,500 daily deaths for a while. I thought it was very disingenuous for
them to do that.”
Christopher Murray, the director of IHME, says that once the
organization got a better handle on the virus after April, its forecasts
radically improved.
But that spring, week by week, more people started to pay
attention to Gu’s work. He flagged his model to reporters on Twitter and
e-mailed epidemiologists, asking them to check his numbers. Toward the end of
April, the prominent University of Washington biologist Carl Bergstrom tweeted
about Gu’s model, and not long after that the U.S. Centers for Disease Control
and Prevention included Gu’s numbers on its Covid forecasting website. As the pandemic progressed, Gu,
a Chinese immigrant who grew up in Illinois and California, found himself
taking part in regular meetings with the CDC and teams of professional modelers
and epidemiologists, as everyone tried to improve their forecasts.
Traffic to Gu’s website exploded, with millions of people
checking in daily to see what was happening in their states and the U.S.
overall. More often than not, his predicted figures ended up hugging the line
of actual death figures when they arrived a few weeks later.
With such intense interest around these forecasts, more
models began to appear through the spring and summer of 2020. Nicholas Reich,
an associate professor in the biostatistics and epidemiology department at the
University of Massachusetts, Amherst, collected the 50 or so models and
measured their accuracy over many months at the Covid-19 Forecast Hub. “Youyang’s
model was consistently among the top,” Reich says.
In November, Gu decided to wind down his death forecast
operation. Reich had been blending the various forecasts and found that the
most accurate predictions came from this “ensemble model,” or combined data.
“Youyang stepped back with a remarkable sense of humility,”
Reich says. “He saw the other models were doing well and his work here was
done.” A month before stopping the project, Gu had predicted that the U.S. would record 231,000
deaths on Nov. 1. When Nov. 1 arrived, the U.S. reported 230,995 deaths.
The IHME’s Murray has his own take on Gu’s exit. He says
Gu’s model would not have picked up on the seasonal nature of the coronavirus
and would have missed the winter surge in cases and deaths. “He had the
epidemic going away in the winter, and we had picked up that there was
seasonality as early as May,” Murray says.
The machine learning methods used by Gu work well at
short-range predictions, Murray says, but “are not very good at understanding
what is going on” in the bigger picture. The algorithms, based on the past,
can’t account for virus variants and how well vaccines may or may not work
against them, according to Murray. For its part, IHME called the early peak of
the virus correctly, then erred when it came to predicting a steep decline in
deaths until it adjusted its model to better reflect reality. “We got it wrong
the first of April,” Murray says. “Since then we are the only group that has
gotten it right consistently.”
Reich, who compiles the list of the major models, said that
the organization’s predictions later in the pandemic were passable. “Early on,
IHME’s model didn’t do what it advertised,” Reich says. “More recently, it has
been a reasonable model. I would not say it is one of the best, but it is
reasonable.”
Gu declined to address Murray’s remarks about his model.
Instead, he offers a data scientist’s version of a backhanded compliment. “I’m
very appreciative of Dr. Chris Murray and his team for the work they did,” Gu says.
“Without them, I would not be in the position I am today.”
To the extent that we can learn from this data story, Reich
asks that people not rush to place too much faith in early individual models
the next time a pandemic arrives. He also questions whether forecasts beyond
six to eight weeks out will ever be very accurate. Ideally, the CDC and others
will be quicker to combine models and distribute the blended data in the
future. “I hope we will invest the time, energy, and money into setting up a system
that is more ready to respond with a wider array of models closer to the
get-go,” Reich says. “We have to have people ready, instead of going around and
knocking on peoples’ doors.”
After taking a bit of a break, Gu, now 27 and living in a
New York apartment, did get back into the modeling game. This time, he’s
creating figures related to how many people in the U.S. have been infected by
Covid-19, how quickly vaccines are being rolled out, and when, if ever, the
country might reach herd immunity. His forecasts suggest that about 61% of the population should have some
form of immunity—either from the vaccine or past infection—by June.
Before the pandemic, Gu hoped to start a new venture,
possibly in sports analytics. Now he’s considering sticking to public health.
He wants to find a job where he can have a large impact while avoiding
politics, bias, and the baggage that sometimes comes with large institutions.
“There are a lot of shortcomings in the field that could be improved by people
with my background,” he says. “But I still don’t know quite how I would fit
in.”
No comments:
Post a Comment