Null Hypothesis & Alternative Hypothesis
When looking at 2 or more groups that differ based on a treatment or risk factor, there are two possibilities:
- Null Hypothesis (Ho) = no difference between the groups. The different groups are the same with regard to what is being studied. There is no relationship between the risk factor/treatment and occurrence of the health outcome. By default you assume the null hypothesis is valid until you have enough evidence to support rejecting this hypothesis.
- Alternative Hypothesis (Ha) = there is a difference between groups. The groups are different with regard to what is being studied. There is a relationship between the risk factor/treatment and occurrence of the health outcome
Obviously, the researcher wants the alternative hypothesis to be true. If the alternative hypothesis is true it means they discovered a treatment that improves patient outcomes or identified a risk factor that is important in the development of a health outcome. However, you never prove the alternative hypothesis is true. You can only reject a hypothesis (say it is false) or fail to reject a hypothesis (could be true but you can never be totally sure). So a researcher really wants to reject the null hypothesis, because that is as close as they can get to proving the alternative hypothesis is true. In other words you can’t prove a given treatment caused a change in outcomes, but you can show that that conclusion is valid by showing that the opposite hypothesis (or the null hypothesis) is highly improbable given your data.
Type 1 and Type 2 Error
Anytime you reject a hypothesis there is a chance you made a mistake. This would mean you rejected a hypothesis that is true or failed to reject a hypothesis that is false.
- Type 1 Error = incorrectly rejecting the null hypothesis. Researcher says there is a difference between the groups when there really isn’t. It can be thought of as a false positive study result. Type I Error is related to p-Value and alpha. You can remember this by thinking that α is the first letter of the alphabet
- Type 2 Error = fail to reject null when you should have rejected the null hypothesis. Researcher says there is no difference between the groups when there is a difference. It can be thought of as a false negative study result. The probability of making a Type II Error is called beta. You can remember this by thinking that β is the second letter in the greek alphabet.
Usually we focus on the null hypothesis and type 1 error, because the researchers want to show a difference between groups. If there is any intentional or unintentional bias it more likely exaggerates the differences between groups based on this desire.
Power & Beta
Power is the probability of finding a difference between groups if one truly exists. It is the percentage chance that you will be able to reject the null hypothesis if it is really false. Power can also be thought of the probability of not making a type 2 error. In equation form, Power equals 1 minus beta.Where power comes into play most often is while the study is being designed. Before you even start the study you may do power calculations based on projections. That way you can tweak the design of the study before you start it and potentially avoid performing an entire study that has really low power since you are unlikely to learn anything.
Power increases as you increase sample size, because you have more data from which to make a conclusion. Power also increases as the effect size or actual difference between the group’s increases. If you are trying to detect a huge difference between groups it is a lot easier than detecting a very small difference between groups. Increasing the precision (or decreasing standard deviation) of your results also increases power. If all of the results you have are very similar it is easier to come to a conclusion than if your results are all over the place.
p-value is the probability of obtaining a result at least as extreme as the current one, assuming that the null hypothesis is true. Imagine we did a study comparing a placebo group to a group that received a new blood pressure medication and the mean blood pressure in the treatment group was 20 mm Hg lower than the placebo group. Assuming the null hypothesis is correct the p-value is the probability that if we repeated the study the observed difference between the group averages would be at least 20.
Now you have probably picked up on the fact that I keep adding the caveat that this definition of the p-value only holds true if the null hypothesis is correct (AKA if is no real difference between the groups). However, don’t let that throw you off. You just assume this is the case in order to perform this test because we have to start from somewhere. It is not as if you have to prove the null hypothesis is true before you utilize the p-value.
The p-value is a measurement to tell us how much the observed data disagrees with the null hypothesis. When the p-value is very small there is more disagreement of our data with the null hypothesis and we can begin to consider rejecting the null hypothesis (AKA saying there is a real difference between the groups being studied). In other words, when the p-value is very small it is less likely that the groups being studied are the same. Therefore, when the p-value is very low our data is incompatible with the null hypothesis and we will reject the null hypothesis. When the p-value is high there is less disagreement between our data and the null hypothesis. In other words, when the p-value is high it is more likely that the groups being studied are the same. In this scenario we will likely fail to reject the null hypothesis.
Using Alpha (α) to Determine Statistical Significance
You may be wondering what determines whether a p-value is “low” or “high.” That is where the selected “Level of Significance” or Alpha (α) comes in. Alpha is the probability of making a Type I Error (or incorrectly rejecting the null hypothesis). It is a selected cut off point that determines whether we consider a p-value acceptably high or low. If our p-value is lower than alpha we conclude that there is a statistically significant difference between groups. When the p-value is higher than our significance level we conclude that the observed difference between groups is not statistically significant.
Alpha is arbitrarily defined. A 5% (0.05) level of significance is most commonly used in medicine based only on the consensus of researchers. Using a 5% alpha implies that having a 5% probability of incorrectly rejecting the null hypothesis is acceptable. Therefore, other alphas such as 10% or 1% are used in certain situations.
Misconceptions About p-Value & Alpha
Statistical significance is not the same thing as clinical significance. Clinical Significance is the practical importance of the finding. There may be a statistically significant difference between 2 drugs, but the difference is so small that using one over the other is not a big deal. For example, you might show a new blood pressure medication is a statistically significant improvement over an older drug, but if the new drug only lowers blood pressure on average by 1 more mm Hg it won’t have a meaningful impact on the outcomes that are important to patients.
It is also often incorrectly stated (by students, researchers, review books etc.) that “p-Value is the probability that the observed difference between groups is due to chance (random sampling error).” In other words, “if my p-Value is less than alpha then there is less than a 5% probability that the null hypothesis is truer.” While this may be easier to understand and perhaps may even be enough of an understanding to get test questions right it is a misinterpretation of p-value. For a number of reasons p-Value is a tool that can only help us determine the observed data’s level of agreement or disagreement with the null hypothesis and cannot necessarily be used for a bigger picture discussion about whether our results were caused by random error. The p-Value alone cannot answer these larger questions. In order to make larger conclusions about research results you need to also consider additional factors such as the design of the study and the results of other studies on similar topics. It is possible for a study to have a p-value of less than 0.05, but also be poorly designed and/or disagree with all of the available research on the topic. Statistics cannot be viewed in a vacuum when attempting to make conclusions and the results of a single study can only cast doubt on the null hypothesis if the assumptions made during the design of the study are true.
A simple way to illustrate this is to remember that by definition the p-value is calculated using the assumption that the null hypothesis is correct. Therefore, there is no way that the p-Value can be used to prove that the alternative hypothesis is true.
Another way to show the pitfalls of blinding applying p-Value is to imagine a situation where a researcher flips a coin 5 times and gets 5 heads in a row. If you performed a one-tailed test you would get a p-value of 0.03. Using the standard alpha of 0.05 this result would be deemed statically significant and we would reject the null hypothesis. Based solely on this data our conclusion would be that there is at least a 95% chance on subsequent flips of the coin that heads will show up significantly more often than tails. However, we know this conclusion is incorrect, because the studies sample size was too small and there is plenty of external data to suggest that coins are fair (given enough flips of the coin you will get heads about 50% of the time and tails about 50% of the time). In actuality the chance of the null hypothesis being true is not 3% like we calculated, but is actually 100%.
Statistical Hypothesis Tests:
Statistical hypothesis testing is how we test the null hypothesis. For the USMLE Step 1 Medical Board Exam all you need to know when to use the different tests. You don’t need to know how to actually perform them.
Continuous (numerical) values:
- T Test = compares the mean of 2 sets of numerical values
- ANOVA (Analysis of Variance) = compares the mean of 3 or more sets of numerical values
Categorical (disease vs. no disease, exposed vs. not exposed) Values:
- Chi-Squared = compares the percentage of categorical data for 2 or more groups