Monday, Aug. 10We are going to do some statistical sampling fun today. We usually think of the cases of COVID-19 as representative data, but they are biased. People go get tests because they have symptoms or were in contact with someone who did. If we want to know actual distributions of people with infections, we should use sampled data which is a type of data in epidemiology called surveillance.Let’s look at sampling bias. In New York City, on May 6, a surveillance sampling survey was finished. This survey also included age demographic data. I took this data, and age demographics of the population of New York City, to form a density of infections in New York. What percentage of the New Yorkers with antibodies were in each age group?Then, I compared this to the data from the exact same people as sampled by regular testing. The surveillance data is antibody data, and the case data is people who go through regular PCR testing to detect live virus. You get this plot:The confirmed cases shows a severe age sampling bias, and older people are confirmed more often. This is not surprising, because older people are vulnerable, and testing was made available. But, the surprising thing is the under 18 group. Twenty-two percent of all positives were under 18. We can try to estimate how many more cases were not identified by multiplying the 65+ data by the ratio between the 65+ antibody and surveillance data. This next plot answers the question “How many more infections are we missing in the younger age groups than we are missing in the 65+ age group?”Again, it is a little frightening to me that we were missing 15 times more infections in the 18-44 age group than in the 65+ age group.Antibody sampling is funny. Unlike the confirmed cases data, when testing ramped up, the antibody age demographics really did not shift much. You heard in the popular press that “younger people were catching novel coronavirus," but what was mostly happening was the sampling bias was shifting. Here are two antibody age demographics, from on-demand antibody testing, in Florida from late May and early August.The later data only includes the new tests that week:You can see there is not a prominent shift. OK. Now, let’s go back to New York City data, and look at the age demographics of new infections between June and August, a period in which testing is really high:You can see that testing mostly fixed our problems with sampling. Now the distributions are almost the same shape. Let’s calculate that multiplier again:And you can see the multiplier is between 1 and 2 for all age groups except children. We are actually reasonably capturing the same fraction of positives across all age groups except children. For children, we capture only 22 percent as many infections as we capture in older age groups.That brings us to the title of today’s blog, “The problem with primary and secondary schools is no surveillance testing.”With children back in school, we will see four out of five infections being missed. Actually, it is worse than that, because we are already not capturing a lot of the older person infections, so it is more like nine out of 10. We do not have surveillance testing to see this. By the time there are two to three kids in a classroom with symptoms, the entire class will have been infected.Is this bad? We know symptomatic children shed virus at least as well as adults do. In adults, we know some asymptomatic cases also shed virus as well as symptomatic adults.We are performing a large-scale natural experiment in which we risk every student being infected, and most of the teachers too.Data links:Round Two, NYChttps://www.cdc.gov/coronavirus/2019-ncov/cases-updates/commercial-lab-surveys.htmlNYC Age datahttps://github.com/nychealth/coronavirus-data/commits/master/by-age.csvSource for Florida data, see the files starting “serology”http://ww11.doh.state.fl.us/comm/_partners/covid19_report_archive/?C=N;O=A Contact Dave Blake.