“If you torture the data long enough, it will confess.” Ronald Coase, Nobel Prize Winner in Economics, 1991.
Imagine a below-average shooter (the original story has a Texan) randomly firing at the side of a barn, and then painting bull’s-eyes around the tightest clusters of holes made by his gunshots. He then showcases his ‘marksmanship’ to the world, and gets widely appreciated for his skills. This scenario is known as the Texas Sharpshooting Fallacy. In another version of this fallacy, the shooter first paints multiple bull’s-eyes at the side of a barn, and then fires a bullet at them. It is obvious that he would hit one of these various targets. Post that, he erases all the other bull’s-eyes that he had painted, and proudly displays the one with the bullet as proof of his sharpshooting skills.
If your first impression is that this scenario has nothing to do with the Data Science practice, think again.
This fallacy is one of the most widely prevalent Data Science practitioner mistakes. The problem here is two-fold. The first version of the fallacy (also related to the Clustering Illusion) is about establishing specific patterns after weak or even negligible data analysis, and then ‘processing or transforming’ the available data and ‘structuring’ new theories to force-fit them into those patterns. The rationality for such action is generally attributed to the absence of adequate data (for analysis), behavioural biases, over-reliance on past results & experiences, and intellectual-laziness. The second version of the fallacy (also known as P-hacking) pertains to the act of conducting multiple tests to prove or disprove certain hypotheses, but reporting the results of only those tests with favourable or low p-values, while largely ignoring the results of the others. The rationality for such action is generally attributed to intellectual-fraud, behavioural biases and, at times, honest errors.
Both new and experienced Data Scientists are susceptible to the two aspects of the Texas Sharpshooter Fallacy, and must take adequate measures to guard against the same. There has been a lot of research and innovation, particularly in this decade, to address the effects of this fallacy. As the field of Data Science receives greater scrutiny over time (which, I think, is essential for both technological and process maturity), practitioners are bound to exercise more caution to prevent the adverse influence of this fallacy in their Data Science projects.
Advertisers often use this fallacy to their advantage by cherry-picking data clusters to suit their arguments, or by establishing patterns to fit existing perceptions.
A 1996 study on the effects of smoking on women revealed higher mortality rates for non-smokers. This is obviously counter-intuitive and extremely surprising. However, upon segregating the data into different age-groups, the results revealed that smokers in all-but-one categories had higher mortality rates. The graphs are shown below.
The above study is a classic case of a phenomenon called the Simpson’s Paradox where the trends in different groups of data are reversed after the data is aggregated. An important implication of this paradox is that causal inferences from observational data can potentially lead to flawed analysis and insight generation. Hence, effective validation mechanisms must be enforced to rule out the presence of this paradox. I have observed that even experienced Data Scientists sometimes find this paradox challenging to grasp while addressing real business cases (even though they may have theoretically understood the phenomenon.)
In 1969, the addition of a new road in Stuttgart, Germany had an adverse effect on the traffic situation of that area, and things were normalized only after the new road was closed. In 1990, the closure of a street in New York City led to a reduction in traffic congestion in that area. The same phenomenon was witnessed when a road was closed in Seoul, South Korea, around 2005. These are surprising observations because the addition of a new road is expected to de-congest the traffic of the surrounding area, and the closure of an existing road is expected to increase the congestion on the other roads. These cases reflect a phenomenon from Game Theory called the Braess’ Paradox. While this paradox has traditionally influenced areas like electric power grids and traffic management, it also affects an emerging area of Data Science called Reinforcement Learning. Even experienced Data Scientists find it challenging to address this paradox while building multi-agent reinforcement learning models for autonomous systems.
Another practitioner mistake is the use of misleading (or distorted) graphs that lead to flawed data representations and inaccurate conclusions. These distortions are generally in the form of biased labels, improper scaling, excessive complexity, truncated data, and poor graph construction. The misrepresentation can be due to technical limitations in expressing the related data points as graphs, genuine errors, or even intentional distortions (for instance, false advertising). Furthermore, new software visualizations are sometimes leveraged without adequately validating their fitment with the specific requirements of the analyses.
An online search for misleading graphs will throw up multiple examples every year. Here is one from 2018 where the low market share of a particular company is made to look bigger (than what it is), and this false representation is difficult to detect unless the graph is studied minutely. Whether this was intentional or otherwise is out of the scope of this article, but the fact remains that this graph gives a wrong first impression. More details can be found here.
As decision-makers increase their reliance on visual data/visual analyses, the effects of misleading graphs can be enormous.
Another major practitioner mistake, particularly in recent years, is the application of algorithms without understanding their true boundaries. Algorithms, whether traditional or new-age, are generally built on certain assumptions and mathematical formulations. This means that their utility and scope are usually bounded. However, it is increasingly observed that limited attention is paid to the evaluation of these assumptions and mathematical underpinnings before adopting the algorithms for model development. Moreover, with the accelerated pace of growth in new algorithms, it is not uncommon to see a frantic rush in deploying the ‘latest algorithms’ without adequately understanding their inherent limitations and mathematical boundaries. This increases the risk of incorrect algorithm selection, thereby leading to inefficiencies in Data Science design and execution. For instance, a search on any Machine Learning topic on Arxiv Sanity throws up tens of new innovations & developments (in the form of papers) each month. Companies need structured, institutionalized mechanisms to explore, qualify and test these innovations, adopt the ones that are found to be relevant, and discard the others. In reality, many Data Science teams approach this exercise in an ad-hoc, individual-driven and inefficient manner.
Finally, to conclude, the practitioner mistakes highlighted above are only a partial list. Data Science is a complex and evolving field. There are many other critical issues, including the significant problem of Cognitive Biases affecting this field. It is essential for Data Scientists to be aware of these prevalent mistakes, especially as the field develops, and more and more new professionals enter the space. It is equally important for companies and their leaders to build institutional structures that, while providing the space for experimentation, also enable the long-term remediation of current and future practitioner mistakes.