This section of the statistics tutorial is about understanding how data is acquired and used.
The results of a science investigation often contain much more data or information than the researcher needs. This data-material, or information, is called raw data.
To be able to analyze the data sensibly, the raw data is processed into "output data". There are many methods to process the data, but basically the scientist organizes and summarizes the raw data into a more sensible chunk of data. Any type of organized information may be called a "data set".
Then, researchers may apply different statistical methods to analyze and understand the data better (and more accurately). Depending on the research, the scientist may also want to use statistics descriptively or for exploratory research.
What is great about raw data is that you can go back and check things if you suspect something different is going on than you originally thought. This happens after you have analyzed the meaning of the results.
The raw data can give you ideas for new hypotheses, since you get a better view of what is going on. You can also control the variables which might influence the conclusion (e.g. third variables). In statistics, a parameter is any numerical quantity that characterizes a given population or some aspect of it.
Central Tendency and Normal Distribution
This part of the statistics tutorial will help you understand distribution, central tendency and how it relates to data sets.
The central tendency may give a fairly good idea about the nature of the data (mean, median and mode shows the "middle value"), especially when combined with measurements on how the data is distributed. Scientists normally calculate the standard deviation to measure how the data is distributed.
To create the graph of the normal distribution for something, you'll normally use the arithmetic mean of a "big enough sample" and you will have to calculate the standard deviation.
However, the sampling distribution will not be normally distributed if the distribution is skewed (naturally) or has outliers (often rare outcomes or measurement errors) messing up the data. One example of a distribution which is not normally distributed is the F-distribution, which is skewed to the right.
So, often researchers double check that their results are normally distributed using range, median and mode. If the distribution is not normally distributed, this will influence which statistical test/method to choose for the analysis.
Caution though, data dredging, data snooping or fishing for data without later testing your hypothesis in a controlled experiment may lead you to conclude on cause and effect even though there is no relationship to the truth.
Depending on the hypothesis, you will have to choose between one-tailed and two tailed tests.
Often there is a publication bias when the researcher finds the alternative hypothesis correct, rather than having a "null result", concluding that the null hypothesis provides the best explanation.
If applied correctly, statistics can be used to understand cause and effect between research variables.
It may also help identify third variables, although statistics can also be used to manipulate and cover up third variables if the person presenting the numbers does not have honest intentions (or sufficient knowledge) with their results.
Here is another great statistics tutorial which integrates statistics and the scientific method.
Reliability and Experimental Error
Statistical tests make use of data from samples. These results are then generalized to the general population. How can we know that it reflects the correct conclusion?
Contrary to what some might believe, errors in research are an essential part of significance testing. Ironically, the possibility of a research error is what makes the research scientific in the first place. If a hypothesis cannot be falsified (e.g. the hypothesis has circular logic), it is not testable, and thus not scientific, by definition.
If a hypothesis is testable, to be open to the possibility of going wrong. Statistically this opens up the possibility of getting experimental errors in your results due to random errors or other problems with the research. Experimental errors may also be broken down into Type-I error and Type-II error. ROC Curves are used to calculate sensitivity between true positives and false positives.
Replicating the research of others is also essential to understand if the results of the research were a result which can be generalized or just due to a random "outlier experiment". Replication can help identify both random errors and systematic errors (test validity).
What you often see if the results have outliers, is a regression towards the mean, which then makes the result not be statistically different between the experimental and control group.
Here we will introduce a few commonly used statistics tests/methods, often used by researchers.
Relationship Between Variables
The relationship between variables is very important to scientists. This will help them to understand the nature of what they are studying. A linear relationship is when two variables varies proportionally, that is, if one variable goes up, the other variable will also go up with the same ratio. A non-linear relationship is when variables do not vary proportionally. Correlation is a a way to express relationship between two data sets or between two variables.
Measurement scales are used to classify, categorize and (if applicable) quantify variables.
Pearson correlation coefficient (or Pearson Product-Moment Correlation) will only express the linear relationship between two variables. Spearman rho is mostly used for linear relationships when dealing with ordinal variables. Kendall's tau (τ) coefficient can be used to measure nonlinear relationships.
A Z-Test is similar to a t-test, but will usually not be used on sample sizes below 30.
A Chi-Square can be used if the data is qualitative rather than quantitative.
Comparing More Than Two Groups
An ANOVA, or Analysis of Variance, is used when it is desirable to test whether there are different variability between groups rather than different means. Analysis of Variance can also be applied to more than two groups. The F-distribution can be used to calculate p-values for the ANOVA.