Scientific Papers

Addressing researcher degrees of freedom through minP adjustment | BMC Medical Research Methodology


Researcher degrees of freedom as a multiple testing problem

In the remainder of this paper, we will focus on analyses that consist of statistical tests. We consider a researcher investigating a—possibly vaguely defined—research hypothesis such as “paO2 has an impact on post-operative complications”, as opposed to the null- and alternative hypotheses of a formal statistical test, which are precisely formulated in mathematical terms. From now on, we assume that the research hypothesis the researcher wants to establish corresponds to the formal alternative hypothesis of the performed tests.

In this context, the term “analysis strategy” refers to all steps performed prior to applying the statistical test as well as to the features of the test itself. The following aspects can be seen as referring to preprocessing uncertainty in the terminology by Hoffmann et al. [8]: transformation of continuous variables, handling of outliers and missing values, or merging of categories. Aspects related to the test itself refer to model and method uncertainty in the terminology of Hoffmann et al. [8]. They include, for example, the statistical model underlying the test, the formal hypothesis under consideration, or the test (variant) used to test this null-hypothesis.

In the context of testing, an analysis strategy can be viewed as a combination of such choices. Obviously, different analysis strategies will likely yield different p-values and possibly different test decision (reject the null-hypothesis or not). Applying different analysis strategies successively to address the same research question amounts to performing multiple tests. From now on, we denote m as the number of analysis strategies considered by a researcher. The null-hypotheses tested through each of the m analyses are denoted as \(H_{0}^{i}\), \(i=1,\dots ,m\).

These null-hypotheses and the associated alternative hypotheses can be seen as—possibly different—mathematical formalizations of the vaguely defined research hypothesis—“paO2 has an impact on post-operative complications” in our example. One may decide to formalize this research hypothesis as “\(H_0:\) the mean paO2 is equal in the groups with and without post-operative complications versus \(H_1:\) the mean paO2 is not equal in these two groups”. But it would also be possible to formalize it as “\(H_0\): the post-operative complication rates are equal for patients with \(paO2 < 200\)mmHg and those with \(paO2\ge 200\)mmHg” versus “\(H_1:\) the post-operative complication rates are not equal for patients with \(paO2 < 200\)mmHg and those with \(paO2\ge 200\)mmHg”. Analysis strategies may thus differ in the exact definition of the considered null- and alternative hypotheses.

They may, however, also differ in other aspects, some of which were mentioned above (for example the handling of missing values or outliers). If two analysis strategies \(i_1\) and \(i_2\) (with \(1\le i_1 < i_2\le m\)) consider exactly the same null-hypothesis, we have \(H_0^{i_1}=H_0^{i_2}\). Of course, it may also happen that the research hypothesis is not vaguely defined but already formulated mathematically as null- and alternative hypotheses, and that the m analysis strategies thus only differ in other aspects such as the handling of missing values or outliers. In this case the m null-hypotheses would all be identical.

Regardless whether the hypotheses \(H_0^{i}\) (\(i=1,\dots ,m\)) are (partly) distinct or all identical, a typical researcher who exploits the degree of freedom by “fishing for significance” performs the m testing analyses successively. They hope that at least one of them will yield a significant result, i.e. that the smallest p-value, denoted as \(p_{(1)}\), is smaller than the significance level \(\alpha\). If it is, they typically report it as convincing evidence in favor of their vaguely defined research hypothesis. It must be noted that in this hypothetical setting the researcher is not interested in identifying the “best” model or analysis strategy but only in reporting the lowest p-value that supports the hypothesis at hand.

Considering this scenario from the perspective of multiple testing, it is clear that the probability to thereby make at least one type 1 error, denoted as Family Wise Error Rate (FWER), is possibly strongly inflated. In particular, even if all tested null-hypotheses are true, we have a probability greater than \(\alpha\) that the smallest p-value \(p_{(1)}\) is smaller than \(\alpha\); this is precisely the result researchers engaged in fishing for significance will report. This problem can be seen as one of the explanations as to why the proportion of false positive test results among published results is substantially larger than the considered nominal significance level of the performed tests [5].

A related concept that has often been discussed in the context of the replication crisis is “HARKing”, standing for Hypothesing After Results are Known [38]. Researchers engaged in HARKing also perform multiple tests, but to test (potentially strongly) different hypotheses rather than several variants of a common vaguely defined hypothesis. While related to the concept of researcher degrees of freedom, HARKing is fundamentally different in that the rejection of these different null-hypotheses would have different (scientific, practical, organizational) consequences. In the sequel of this article, we consider sets of hypotheses that can be seen as variants of a single vaguely defined hypothesis, whose rejections would have the same consequences in a broad sense.

Controlling the Family-Wise Error Rate (FWER)

Following the formalization of researcher degrees of freedom as a multiple testing situation, we now consider the problem of adjusting for multiple testing in order to control the FWER. More precisely, we want to control the probability \(P(\text {Reject at least one true}\, H_0^{i})\) to make at least one type 1 error when testing \(H_0^{1},\dots ,H_0^{m}\), i.e. the FWER.

More precisely, we primarily want to control the FWER in case all null-hypotheses are true. Imagine a case where some of the null-hypotheses are false and there is at least one false positive result. On one hand, if \(p_{(1)}\) is not among the falsely significant p-values, the false positive test result(s) typically do(es) not affect the results ultimately reported by the researchers (who focus on \(p_{(1)}\)). This situation is not problematic.

On the other hand, if \(p_{(1)}\) is falsely significant, \(H_0^{(1)}\) is wrongly rejected, and strictly speaking a false positive result (“\(p_{(1)} < \alpha\)”) is reported. However, some of the \(m-1\) remaining null-hypotheses, which are closely related to \(H_0^{(1)}\) (because they formalize the same vaguely defined research hypothesis), are false. Thus, rejecting \(H_0^{(1)}\) is not fundamentally misleading in terms of the vaguely defined research hypothesis. As assumed at the end of Researcher degrees of freedom as a multiple testing problem section, the rejection of \(H_0^{(1)}\) has the same consequence as the rejection of the hypotheses that are really false.

For example, in a two-group setting when studying a biomarker B, we may consider the null-hypotheses “\(H_0^{1}\): the mean of B is the same in the two groups” and “\(H_0^{2}\): the median of B is the same in the two groups”. \(H_0^{1}\) and \(H_0^{2}\) are different, but both of them can be seen as variants of “there is no difference between the two groups with respect to biomarker B”, and rejecting them would have similar consequences in practice (say, further considering biomarker B in future research, or—in a clinical context—being vigilant when observing a high value of B in a patient).

If biomarker B features strong outliers, the result of the two-sample t-test (addressing \(H_0^{1}\)) and the result of the Mann-Whitney test (addressing to \(H_0^{2}\)) may differ substantially. However, rejecting \(H_0^{2}\) if it is in fact true and only \(H_0^{1}\) is false would not be dramatic (and vice-versa). This is because, if \(H_0^{1}\) is false, there is a difference between the two groups, even if not in terms of medians. The practical consequences of a rejection of \(H_0^{1}\) and a rejection of \(H_0^{2}\) are typically the same (as opposed to the HARKing scenario).

To sum up, in the context of researcher degrees of freedom, false positives have to be avoided primarily in the case when all null-hypotheses are true. In other words, we need to control the probability \(P(\text {Reject at least one true}\, H_0^{i} | \cap _{i=1}^mH_0^{i})\) to have at least one false positive result given that all null-hypotheses are true, i.e. we want to achieve a weak control of the FWER. Various adjustment procedures exist to achieve strong or weak control of the FWER; see Dudoit et al. [39] for concise definitions of the most usual ones (including those mentioned in this section).

The most well-known and simple procedure is certainly the Bonferroni procedure. It achieves strong control of the FWER, i.e. it controls \(P(\text {Reject at least one true}\, H_0^{i})\) under any combination of true and false null hypotheses. This procedure adjusts the significance level to \(\tilde{\alpha } = \alpha /m\); or equivalently it adjusts the p-values \(p_i\) (\(i=1,\dots ,m\)) to \(\tilde{p_i} = \min (mp_i,1)\). However, the Bonferroni procedure is known to yield low power in rejecting wrong null-hypotheses in the case of strong dependence between the tests. The so-called Holm stepwise procedure, which is directly derived from the Bonferroni procedure, has a better power. However, the Holm procedure adjusts the smallest p-value \(p_{(1)}\) exactly to the same value as the Bonferroni procedure. It implies that, if none of the m tests lead to rejection with the Bonferroni procedure, it will also be the case with the Holm procedure. The latter can thus not be seen as an improvement over Bonferroni in terms of power in our context, where the focus is on the smallest p-value \(p_{(1)}\).

The minP-procedure

The permutation-based minP adjustment procedure for multiple testing [9] indirectly takes the dependence between tests into account by considering the distribution of the minimal p-value out of \(p_1,\dots ,p_m\). This increases its power in situations with high dependencies between the tests, and thus makes it a suitable adjustment procedure to be applied in the present context. In the general case it controls the FWER only weakly, but as outlined above we do not view this as a drawback in the present context.

The rest of this section briefly describes the single-step minP adjustment procedure based on the review article by Dudoit et al. [39]. The following description is not specific to researcher degrees of freedom considered in this paper. However, for simplicity we further use the notations (\(p_i\), \(H_0^i\), for \(i=1,\dots ,m\)) already introduced in Researcher degrees of freedom as a multiple testing problem section in this context.

In the single-step minP procedure, the adjusted p-values \(\tilde{p}_i\), \(i=1,\dots ,m\) are defined as

$$\begin{aligned} \tilde{p}_i = P \left( \underset{1 \le \ell \le m}{\min} P_\ell \le p_i \mid \cap _{i=1}^mH_0^{i}\right) , \end{aligned}$$

(1)

with \(P_\ell\) being the random variable for the unadjusted p-value for the \(\ell ^{th}\) null-hypothesis \(H_0^\ell\) [39]. The adjusted p-values are thus defined based on the distribution of the minimal p-value out of \(p_1,\dots ,p_m\), hence the term “minP”. In the context of researcher degrees of freedom considered here, the focus is naturally on \(\tilde{p}_{(1)}= P \left( \min _{1 \le \ell \le m} P_\ell \le p_1 \mid \cap _{i=1}^mH_0^{i}\right)\).

In many practical situations, including the one considered in this paper, the distribution of \(\min _{1 \le \ell \le m} P_\ell\) is unknown. The probability in Eq. (1) thus has to be approximated using permuted versions of the data that mimic the global null-hypothesis \(\cap _{i=1}^mH_0^{i}\). More precisely, the adjusted p-value \(\tilde{p}_i\) is approximated as the proportion of permutations for which the minimal p-value is lower or equal to the p-value \(p_i\) observed in the original data set. Obviously, the number of permutations has to be large for this proportion to be estimated precisely. In the example described in Motivating example section involving only two variables (paO2 and post-operative complications), permuted data sets are simply obtained by randomly shuffling one of the variables. More complex cases will be discussed in Discussion section.



Source link