Data Dredging

50.5K reads

Data dredging, also called as data snooping or data fishing, refers to the practice of misusing data mining techniques to show misleading scientific ‘research’.

This article is a part of the guide:

Discover 24 more articles on this topic

Browse Full Outline

Data dredging is usually followed by the researcher who wants to try and ‘prove’ a point of view that might not hold or might not be shown in by the actual data. There are a number of reasons for data snooping and it is a matter of grave concern as it uses statistical principles for the purpose of drawing misleading and false conclusions.

The Way Data Snooping Works is as Follows:

Suppose there is a given data set and there are a huge number of hypotheses about this data set. If the data is totally random, then say all the hypotheses are actually false. However, owing to the sheer number of hypotheses on a limited data set, it is possible to see some very highly correlated data that are statistically significant. In such cases, data dredging is said to have taken place. Industries that are heavy on data mining are many times involved in data dredging

For example, a drug company might spend millions of dollars on a drug but it may not show the kind of results that were initially expected. However, it needs to market the drug in order to make profits from it. Therefore the company uses data snooping to project claims that are not actually true, even though the data confirms the claim. This is done by taking a representative sample and collecting huge number of parameters related to the test subjects, so that the drug can be claimed and correlated to the problem in some form or the other.

Data fishing can also be done by narrowing down the sample size to include those results that bear out the intended hypothesis. Thus the drug might be tested on 1000 patients and the results might not show a statistically significant positive result for a given problem. However, by narrowing down the study to 500 people and using a selection bias towards those who showed favorable results by using the drug, the company can claim something that is not actually true.

If there is no effect between variables and your confidence level is at .05 (5%), 1 of 20 tests will show that there is an effect even though this is not true, due to random error.

However, most data dredging is intentional. Many times, researchers are simply misled by the apparent correlations that they see. This happens most frequently when the researchers themselves are not sure what exactly they are looking for. Therefore it is important to form a hypothesis before starting and conducting the experiment in order to prevent any accidental cases of data dredging.

If not, the researchers might stumble upon some correlation that doesn’t actually exist but shows strongly in their data. Thus researchers working in data mining need to be aware of this as it can be a serious mislead and divert valuable resources to some claims that are not really true.

Full reference: 

(Oct 16, 2010). Data Dredging. Retrieved Jun 16, 2024 from

You Are Allowed To Copy The Text

The text in this article is licensed under the Creative Commons-License Attribution 4.0 International (CC BY 4.0).

This means you're free to copy, share and adapt any parts (or all) of the text in the article, as long as you give appropriate credit and provide a link/reference to this page.

That is it. You don't need our permission to copy the article; just include a link/reference back to this page. You can use it freely (with some kind of link), and we're also okay with people reprinting in publications like books, blogs, newsletters, course-material, papers, wikipedia and presentations (with clear attribution).

Want to stay up to date? Follow us!