I have three sisters. They all live within 20 minutes of the I-495 loop that surrounds Boston.
- My older sister had one hell of a nasty upper respiratory infection in late January and early February. She never went to a clinic and therefore was never diagnosed with anything.
- My middle sister had a nasty two week flu in mid-February. She self-isolated at home. She went to her PCP and got a flu-swab that came back positive. She recovered and was about to be back in the office and seeing her friends just before her concerned older brother suggested that physical distancing would be a good idea for COVID-19.
- My youngest sister had a great winter and early spring regarding her health.
Why do I mention this?
Let us imagine that a researcher wants to conduct a serology study with a convenience sample to identify the prevalence of COVID-19 in Eastern Massachusetts, and more importantly, the prevalence of individuals who are now immune after having a low to no symptom infection. That is a damn good question where we need good answers in order to inform policy responses. However methodology matters.
If the researcher has a limited number of tests and wants a fast response on a small budget, they could put out an ads that announce COVID-19 immune/infection history tests are available. They researchers plan to use population weights to correct for observed demographic imbalances of the first 500 people who show up to get tested.
Is there a problem with this method?
My three sisters have very different probabilities of responding to that ad. My middle sister knows she has been physically isolated for almost two months now and that her notable winter illness was diagnosed as flu. My youngest sister is feeling great and has been physically isolated since the start of March.
However my older sister had one hell of a nasty disease course this winter. There is a good probability that it was an non-diagnosed flu. There is a decent probability it was not flu and not COVID-19 but some other viral infection. There is a non-zero but fairly small probability that she had COVID-19 in late January.
Which sister is most likely to respond to an advertisement for free serology testing to see if they had COVID-19?
People like my older sister could very plausibly be far more responsive to an ad for free COVID-19 infection history/immunity/serology testing than either of my younger sisters. They would have a higher prior value on the new information that a good (albeit imperfect) test could give them.
A convenience sample where the participants effectively self-recruit is highly likely to have lots of people who are systemically different than the general population. Self-selection means generalizability of the results is extremely limited. A good researcher can say that whatever they saw in the sample is relevant for the self-selected sample but not the general population. It might establish a boundary of plausible estimates but the point estimate is highly likely to be biased and uncorrectable for unobserved self-selection tilts.
If we want generalizability, we either need complete population sampling or random sampling of a population so that the probability of all three of my sisters being selected for a test is the same.