Usually, there is some confusion on the topic of Hypothesis Testing. A standard undergraduate class on this topic involves terms such as “null hypothesis”, “alternative hypothesis”, “power of the test”, “statistically significant”, “p<0.05”, etc., which can be misleading or misunderstood. Most of this is due to the fact that the Neyman-Pearson paradigm of Hypothesis Testing has been stipulated in science and engineering as the standard mindset when data-based decisions have to be made.
This has to change. The use of this paradigm without justification has led to a lot of confusion not only in students but in practitioners. Why should scientific findings have to rely on “p<0.05”? What does “statistically significant” really mean? Do we need to accept the “null hypothesis”? The relevance of these questions led the American Statistical Association to publish (at least) two papers giving statements and best practices of Hypothesis Testing: 1) The ASA Statement on p-Values: Context, Process, and Purpose, and 2) Moving to a World Beyond “p < 0.05”.
In what follows, I’ll present a straightforward definition of a hypothesis test and p-values which will be reinforced with a simple example. After that, there will be a short review of the three major paradigms of Hypothesis Testing.
A very simple approach
Let me put the definition of a hypothesis test in a nutshell
A hypothesis test is proof by contradiction with data.
What we have is a statement, or hypothesis, about the world (every time it is cloudy it rains afterward) and data to contrast, or test, this hypothesis (I’ve seen cloudy days without rain), and with that, we can decide if the statement is supported by the data (not every time that it is cloudy it rains).
Observe that the data is under no obligation of contradicting the hypothesis (I haven’t seen a cloudy day without rain) but that doesn’t lead us to accept the hypothesis as being true (I have to keep watching the sky, maybe there will be a cloudy day without rain in the future).
As you can see a hypothesis test is as natural as the process of thinking: observe, stipulate, and test your beliefs with data.
Unfortunately, there is no consensus on the formal mathematical definition of Hypothesis Testing. And that’s maybe why the Neyman-Pearson paradigm has been taken as a default procedure since it has formal definitions and theorems (see for example the famous Neyman-Pearson Lemma.)
Instead of following a formal definition, we can understand Hypothesis Testing by explaining its elements, as is done in “Statistical Inference in Science’’ by D.A. Sprott. As it’s mentioned in Chapter 6 of this book, the purpose of a hypothesis test, or test of significance, is to measure the strength of the evidence provided by the experimental data against a hypothesis H. The two ingredients for performing a test are:
- A non-negative discrepancy measure or statistic, that we denote by T.
- Assuming that H is true, the probability that T is greater than or equal to some interesting value t. This can be calculated if we have the distribution of T under the hypothesis H. But in general, we can approximate this probability instead of determining the distribution of T under H, i.e., assuming that H is true.
Note: In the second point, T is a theoretical random variable, and t is the value observed from T in one realization of an experiment.
The probability mentioned in the second point, that is,
is famously called the p-value of the test.
The idea here is to construct a statistic T, which is a function of the data collected in the experiment, in such a way that greater (positive) values represent a discrepancy with the hypothesis.
Even more, if we know the distribution of T under the hypothesis of interest H, and t is the observed value of T in the experiment, then a p-value close to 0 will indicate that the event “T greater than or equal to t” is very improvable.
Let me state a straightforward example: Imagine that we have a coin and we want to determine if it is a fair coin, that is, we want to test if p = ½ , where p is the probability that it comes heads whenever we flip the coin. In a nutshell, we want to test the hypothesis H: p=½.
Let X be the result of one flipping. So X=1 if it comes head, and X=0 if it comes tails. Denote by Y the sum of the X’s in 10 independent flipping. That is, the result of flipping is the sequence
Assuming that H is true, after the 10 trials we expect that the value of Y is around 5 heads. So if we observe a value that is far from 5 we have some evidence that p is not equal to ½, i.e., H is not true, or more correctly the data doesn’t support the hypothesis H. Therefore, we define the discrepancy measure T as
Remember that T is a theoretical random variable, and t is the observed value of T in the experiment. Assuming that H is true, it is easy to show that
Note that T is between 0 (when Y=5) and 5 (when Y=10), so we can only evaluate values of t inside the interval [0,5].
Then, if the result of the experiment is
we get that t=8 and the p-value is
Since this p-value can be considered low, this gives us evidence that the hypothesis H: p=½ is not supported by the data.
Observe that we are not “rejecting” or even worse “accepting” the hypothesis H, we only developed a methodology supported by a probability distribution to measure how much the data is aligned with hypothesis H.
In this sense, we must see a hypothesis testing procedure as the construction of a thermometer (the p-value) that can give us an insight into the truthness of a scientific hypothesis. It’s up to the criterion of the person performing the test (or some outside person who wants to use the results) if the hypothesis can be considered true or false for future applications.
The three paradigms
In Ronald Christensen’s paper “Testing Fisher, Neyman, Pearson, and Bayes”, the author presents the three major paradigms of Hypothesis Testing:
The approach presented in the previous section it’s the Fisher or Fisherian paradigm, developed by Ronald Fisher at the beginning of the 20th century.
The objective of this article is not to state that the Fisher paradigm is better than the other two. But rather to emphasize that there is more than one procedure to perform a statistical test and that the person performing it has to decide which procedure is best suited for the task at hand.
The reason for highlighting the Fisher paradigm is to state that Hypothesis Testing is sometimes misunderstood and that the concept is actually more easy than it is often presented.
To keep things short, here are the common scenarios in which it is more appropriate to use each paradigm:
Fisher. Use it whenever you have a scientific or business-related hypothesis for which there is no clear action to take when arriving at a conclusion of the testing procedure.
An example of this situation is the one presented previously. In that case, the conclusion of the fairness of the coin is just scientific curiosity and it is not related to any action that has to be taken afterward.
Neyman-Pearson. This is the most famous approach and is the one that defines the concepts of “null hypothesis”, “alternative hypothesis”, “significance level”, “power of the test”, etc. Use the Neyman-Pearson paradigm when you have to make a decision and take action from the conclusion of the test. This is why we have the concept of “null” and “alternative”, since we need to decide between two competing hypotheses.
Also, the concept of “significance level” emerges because you can stipulate that making a certain mistake is worse than other mistakes. These mistakes are called “type 1” and “type 2” and are defined as rejecting the null hypothesis when it is true (false positive) and accepting the null hypothesis when it is false (false negative).
An example of this scenario is in a manufacturing factory in which you have to decide if the quality of the products is within the desired standards or not. If the products are good, you continue with the process, otherwise, you stop the production. You need to stipulate what is worse here: stopping the process when the production is good or continuing with the process when the production is not good enough.
Bayes. This approach is fundamentally different from the previous two in the sense that you use Bayesian Statistics to calculate the probability of a hypothesis being true. This is a paradigm shift not just in the hypothesis procedure but in the conceptualization of the statistics involved since you are assuming that the parameters of interest are themselves random variables. With that assumption, you can obtain a probability for the hypothesis rather than a p-value.
The Bayes approach has more interpretability and can be used when you have to take action on a decision or not. When comparing two hypotheses you just obtain two probabilities and take the hypothesis with higher probability.
The only two requirements to apply the Bayes procedure are 1) to come up with a prior distribution of the parameters, and 2) obtain (or simulate from) the posterior distribution. These two steps are not always obvious but when reached all the analysis is easier to interpret.
Testing a hypothesis is like performing proof by contradiction. You establish a hypothesis and obtain data to see if it is consistent with the hypothesis. If you observe something weird that contradicts the hypothesis you have evidence that it is false. On the contrary, if the data is consistent with the hypothesis, you don’t accept it right away, but you can conclude there is no evidence of the falseness of the hypothesis.
To perform a testing procedure you need a test statistic and its distribution under the hypothesis of interest. With that, you can calculate a p-value that is just a measure of the consistency of the data with the hypothesis.
There are three major paradigms of hypothesis testing: Fisher, Neyman-Pearson, and Bayes. You use each of them depending on the situation at hand and the information that you have about the problem.