Selection Bias
The Psychology Behind It
To know the truth about a group (e.g., "Do Americans like pizza?"), you need a random sample. If you only ask people inside a pizza restaurant, your data is garbage. That is selection bias.
It happens because true randomness is hard. We naturally sample what is convenient, available, or willing. This creates a "distorted mirror" where we think we are seeing the world, but we are only seeing a specific slice of it.
Real-World Examples
The 1936 Literary Digest Poll
The magazine polled 2.4 million people and predicted Alf Landon would beat FDR in a landslide. FDR won. Why? The magazine polled its subscribers, car owners, and telephone users. In 1936 (the Great Depression), these were the rich. They selected a wealthy sample that hated FDR, ignoring the poor majority.
Online Reviews
Product reviews are heavily biased. Who writes a review? People who loved it (5 stars) or people who hated it (1 star). The vast middle ground of people who thought it was "okay" don't bother writing. Thus, reviews show a polarized world that doesn't exist.
Medical Studies
If a study on a new weight-loss drug recruits volunteers, it gets people who are motivated to lose weight. The drug might work for them but fail for the unmotivated general population.
Consequences
Selection bias can lead to:
- False Conclusions: We believe things are true that are only true for a specific subgroup.
- Bad Policy: Laws are passed based on the loud voices of a selected few (lobbyists, activists) rather than the silent majority.
- Algorithm Bias: AI trained on biased data (e.g., resumes of mostly men) will learn to replicate that bias (hiring only men).
How to Mitigate It
Randomize, randomize, randomize.
- Random Sampling: Ensure every member of the population has an equal chance of being selected.
- Check the Source: Ask, "Who is in this dataset? Who is excluded? Why?"
- Weighting: If you know your sample is biased (e.g., too many men), mathematically weight the data to represent the true population.
Conclusion
Selection bias reminds us that "data" is not "truth." Data is only as good as the method used to collect it. If the net is flawed, the catch will be flawed.