Introduction, conceptualizing hypothesis testing via bayes factors, empirical example 1: is a coin fair or tail-biased, empirical example 2: do health warnings for e-cigarettes increase worry about health, conclusions, declaration of interests.
Sabeeh A Baig, Bayesian Inference: An Introduction to Hypothesis Testing Using Bayes Factors, Nicotine & Tobacco Research , Volume 22, Issue 7, July 2020, Pages 1244–1246, https://doi.org/10.1093/ntr/ntz207
Monumental advances in computing power in recent decades have contributed to the rising popularity of Bayesian methods among applied researchers. This series of commentaries seeks to raise awareness among nicotine and tobacco researchers of Bayesian methods for analyzing experimental data. The current commentary introduces statistical inference via Bayes factors and demonstrates how they can be used to present evidence in favor of both alternative and null hypotheses.
Bayesian inference is a fully probabilistic framework for drawing scientific conclusions that resembles how we naturally think about the world. Often, we hold an a priori position on a given issue. On a daily basis, we are confronted with facts about that issue. We regularly update our position in light of those facts. Bayesian inference follows this exact updating process. Formally stated, given a research question, at least one unknown parameter of interest, and some relevant data, Bayesian inference follows three basic steps. The process begins by specifying a prior probability distribution on the unknown parameter that often reflects accumulated knowledge about the research question. Next, the observed data, summarized using a likelihood function, are conditioned on the prior distribution. Finally, the resulting posterior distribution represents an updated state of knowledge about the unknown parameter and, by extension, the research question. Simulating data many times from the posterior distribution will ideally yield representative samples of the unknown parameter that we can interpret to answer the research question.
In an experimental context, we are often interested in evaluating two competing positions or hypotheses in light of data and making a determination about which to accept. In the context of Bayesian inference, hypothesis testing can be framed as a special case of model comparison where a model refers to a likelihood function and a prior distribution. Given two competing hypotheses and some relevant data, Bayesian hypothesis testing begins by specifying separate prior distributions to quantitatively describe each hypothesis. The combination of the likelihood function for the observed data with each of the prior distributions yields hypothesis-specific models. For each of the hypothesis-specific models, averaging (ie, integrating) the likelihood with respect to the prior distribution across the entire parameter space yields the probability of the data under the model and, therefore, the corresponding hypothesis. This quantity is more commonly referred to as the marginal likelihood and represents the average fit of the model to the data. The ratio of the marginal likelihoods for both hypothesis-specific models is known as the Bayes factor.
The Bayes factor is a central quantity of interest in Bayesian hypothesis testing. A Bayes factor has a range of near 0 to infinity and quantifies the extent to which data support one hypothesis over another. Bayes factors can be interpreted continuously so that a Bayes factor of 30 indicates that there is 30 times more support in the data for a given hypothesis than the alternative. They can also be interpreted discretely so that a Bayes factor of 3 or higher supports accepting a given hypothesis, 0.33 or lower supports accepting its alternative, and values in between are inconclusive. 1 , 2 Intuitively, the Bayes factor is the ratio of the odds of observing two competing hypotheses after examining relevant data compared to the odds of observing those hypotheses before examining the data. Therefore, the Bayes factor represents how we should update our knowledge about the hypotheses after examining data. We present two empirical examples with simulated data to demonstrate the computation and use of Bayes factors to test hypotheses.
Deciding whether a coin is fair or tail-biased is a simple, but useful example to illustrate hypothesis testing via Bayes factors. Let the null hypothesis be that the coin is fair, and let the alternative hypothesis be that the coin is tail-biased. We further intuit that coins, fair or not, can exhibit a considerable degree of variation in their head-tail biases depending on quality control issues during the minting process. Therefore, we use a Beta(5, 5) prior distribution to describe the null hypothesis. This distribution places the bulk of the probability density at or around 0.5 (ie, equal probability of heads or tails). Similarly, we use a Beta(3.8, 6.2) prior distribution to describe the alternative hypothesis. This skewed distribution places the bulk of the density at or around 0.35 (ie, lower probability of heads) and places less density on values greater than 0.4. The Beta prior is appropriate to describe hypotheses about a coin (and other binary variables) because it is continuously defined on the interval between 0 and 1 that the bias of a coin is also defined on; has hyperparameters that can be interpreted as the number of heads and tails; and provides flexibility in describing hypotheses because it does not have to be symmetric.
To test these hypotheses, we conduct a simple experiment by flipping the coin 20 times, recording 5 heads and 15 tails. We summarize this data using a Bernoulli(5, 15) likelihood function. After computing the marginal likelihoods of the models for both hypotheses, we find that the Bayes factor comparing the alternative hypothesis to the null is 2.65. This indicates that the data supports the alternative hypothesis that the coin is tail-biased over the null hypothesis that it is fair only by a factor of 2 or so. We further note that the Bayes factor falls into the range of inconclusive values. Therefore, we conclude that we need more experimental data to determine whether the coin is fair or tail-biased with greater certainty.
A more pertinent illustrative example of hypothesis testing via Bayes factors is deciding whether health warnings for e-cigarettes increase worry about one’s health. Let the null hypothesis be that health warnings have exactly no effect on worry. Let the first alternative hypothesis be one-sided that health warnings increase worry, and let the second alternative hypothesis also be one-sided that health warnings decrease worry. Bayes factors with the Jeffreys-Zellner-Siow (JZS) default prior can be used to evaluate these hypotheses. 3 In comparison to other priors, default priors have mathematical properties that simplify the computation of Bayes factors. The JZS default prior describes hypotheses in terms of possible effect sizes (ie, Cohen’s d ). As such, under the null hypothesis that health warnings have exactly no effect on worry, the prior distribution places the entire density on an effect size of 0 ( Figure 1 ). Given that effect sizes in behavioral research in tobacco control are usually small, 4–6 the prior distributions for the alternative hypotheses use a scale parameter of 1/2 to distribute the density mostly over small positive or negative effect sizes.
Prior distributions quantitatively describing competing hypotheses about the effect of e-cigarette health warnings on worry about one’s own health due to tobacco product use.
To test these hypotheses, we conduct a simple online experiment with 200 adults who vape every day or some days. The experiment randomizes participants to receive a stimulus depicting 1 of 5 e-cigarette devices (eg, vape pen) with or without a corresponding health warning. After viewing the stimulus for 10 seconds, participants complete a survey that includes an item on worry, “How worried are you about your health because of your e-cigarette use?”, 7 with a response scale of 1 (“not at all”) to 5 (“extremely”). Participants who receive a health warning elicit mean worry of 2.38 ( SD = 0.87), and those who do not elicit mean worry of 2.33 ( SD = 0.84). The Bayes factors comparing the first and second alternative hypotheses to the null hypothesis are 0.16 and 0.30, respectively. These Bayes factors indicate that there is more support in the data for the null hypothesis than the alternative hypotheses. Taking the reciprocal of these Bayes factors indicates that there is approximately 3 to 6 times more support in the data for the null hypothesis that health warnings have no effect than either alternative. Therefore, we conclude that health warnings for e-cigarettes do not appear to affect worry based on the experimental data.
The hallmark of Bayesian model comparison (and other Bayesian approaches) is the incorporation of uncertainty at all stages of inference, particularly through the use of properly specified prior distributions. As a result, Bayesian model comparison has three practical advantages over conventional methods. First, Bayesian model comparison is not limited to tests of point null hypotheses. 8 , 9 In fact, the first empirical example essentially conceptualized the possibility of the coin being fair as an interval null hypothesis by permitting some unfair head-coin biases. Indeed, a great deal has already been written on how the use of point null hypotheses can lead to overstatements about the evidence for alternative hypotheses. 10 Second, Bayesian model comparison is flexible enough to permit tests of any meaningful hypotheses. 11 As a result, the second empirical example demonstrated tests of two one-sided hypotheses against the same null hypothesis. Third, Bayesian model comparison uses the marginal likelihood, which is a measure of the average fit of a model across the parameter space. 12 Doing so leads to more accurate characterizations of the evidence for competing hypotheses because they account for uncertainty in parameter values even after observing the data instead of only focusing on the most likely values of those parameters.
Bayes factors specifically have three advantages over other inferential statistics. First, Bayes factors can provide direct evidence for the common null hypothesis of no difference. 13 Second, they can reveal when experimental data is insensitive to the null and alternative hypotheses, clearly suggesting that the researcher should withhold judgment. 13 Third, they can be interpreted continuously and thus provide an indication of the strength of the evidence for the null or alternative hypothesis. While Bayesian model comparison via Bayes factors leads to robust tests of competing hypotheses, this advantage is only realized when all hypotheses are quantitatively described using carefully chosen priors that are calibrated in light of accumulated knowledge. Furthermore, two analysts may choose different priors to describe the same hypothesis. This subjectivity in the choice of prior has promoted the development of a large class of Bayes factors for common analyses (eg, difference of means as illustrated in the second empirical example) that use default priors. 14–16 Thus, the analyst only needs to choose values for important parameters, as in the second empirical example, without having to select the functional form of the prior (eg, a Beta prior) as in the first empirical example. Published Bayesian analyses will often list priors and justify why they were chosen for full transparency (see Baig et al. 17 for one succinct example). The next commentary will focus on informative hypotheses, prior specification when computing corresponding Bayes factors, and some Bayesian solutions for multiple testing. For the curious reader, the JASP package provides access to Bayes factors that use default priors for common analyses through a point-and-click interface similar to SPSS. 18
This work was supported by the Office of The Director, National Institutes of Health (award number DP5OD023064).
None declared.
Rouder JN , Morey RD , Verhagen J , Swagman AR , Wagenmakers EJ . Bayesian analysis of factorial designs . Psychol Methods. 2017 ; 22 ( 2 ): 304 – 321 .
Google Scholar
Jeon M , De Boeck P . Decision qualities of Bayes factor and p value-based hypothesis testing . Psychol Methods. 2017 ; 22 ( 2 ): 340 – 360 .
Hoijtink H , van Kooten P , Hulsker K . Why bayesian psychologists should change the way they use the Bayes factor . Multivariate Behav Res . 2016 ; 51 ( 1 ): 2 – 10 . doi:10.1080/00273171.2014.969364
Baig SA , Byron MJ , Boynton MH , Brewer NT , Ribisl KM . Communicating about cigarette smoke constituents: an experimental comparison of two messaging strategies . J Behav Med. 2017 ; 40 ( 2 ): 352 – 359 .
Brewer NT , Morgan JC , Baig SA , et al. Public understanding of cigarette smoke constituents: three US surveys . Tob Control. 2016 ; 26 ( 5 ): 592 – 599 .
Morgan JC , Byron MJ , Baig SA , Stepanov I , Brewer NT . How people think about the chemicals in cigarette smoke: a systematic review . J Behav Med . 2017 ; 40 ( 4 ): 553 – 564 . doi:10.1007/s10865-017-9823-5
Mendel JR , Hall MG , Baig SA , Jeong M , Brewer NT . Placing health warnings on e-cigarettes: a standardized protocol . Int J Environ Res Public Health . 2018 ; 15 ( 8 ): 1578 . doi:10.3390/ijerph15081578
Morey RD , Rouder JN . Bayes factor approaches for testing interval null hypotheses . Psychol Methods. 2011 ; 16 ( 4 ): 406 – 419 .
West R . Using Bayesian analysis for hypothesis testing in addiction science . Addiction . 2016 ; 111 ( 1 ): 3 – 4 . doi:10.1111/add.13053
Berger JO , Sellke T . Testing a point null hypothesis: the irreconcilability of p-values and evidence . J Am Stat Assoc . 1987 ; 82 ( 397 ): 112 – 122 . doi:10.1080/01621459.1987.10478397
Etz A , Haaf JM , Rouder JN , Vandekerckhove J . Bayesian inference and testing any hypothesis you can specify . Adv Methods Pract Psychol Sci . 2018 ; 1 ( 2 ): 281 – 295 . doi:10.1177/2515245918773087
Etz A . Introduction to the concept of likelihood and its applications . Adv Methods Pract Psychol Sci . 2018 ; 1 ( 1 ): 60 – 69 . doi:10.1177/2515245917744314
Dienes Z , Coulton S , Heather N . Using Bayes factors to evaluate evidence for no effect: examples from the SIPS project . Addiction. 2018 ; 113 ( 2 ): 240 – 246 .
Nuijten MB , Wetzels R , Matzke D , Dolan CV , Wagenmakers E-J . A default Bayesian hypothesis test for mediation . Behav Res Methods . 2014 ; 47 ( 1 ): 85 – 97 . doi:10.3758/s13428-014-0470-2
Ly A , Verhagen J , Wagenmakers E-J . Harold Jeffreys’s default Bayes factor hypothesis tests: explanation, extension, and application in psychology . J Math Psychol . 2016 ; 72 : 19 – 32 . doi:10.1016/j.jmp.2015.06.004
Rouder JN , Speckman PL , Sun D , Morey RD , Iverson G . Bayesian t tests for accepting and rejecting the null hypothesis . Psychon Bull Rev. 2009 ; 16 ( 2 ): 225 – 237 .
Baig SA , Byron MJ , Lazard AJ , Brewer NT . “Organic,” “natural,” and “additive-free” cigarettes: comparing the effects of advertising claims and disclaimers on perceptions of harm . Nicotine Tob Res . 2019 ; 21 ( 7 ): 933 – 939 .
Wagenmakers E-J , Love J , Marsman M , et al. Bayesian inference for psychology. Part ii: example applications with JASP . Psychon Bull Rev . 2018 ; 25 ( 1 ): 58 – 76 .
Month: | Total Views: |
---|---|
November 2019 | 111 |
December 2019 | 53 |
January 2020 | 55 |
February 2020 | 97 |
March 2020 | 84 |
April 2020 | 106 |
May 2020 | 88 |
June 2020 | 196 |
July 2020 | 200 |
August 2020 | 242 |
September 2020 | 362 |
October 2020 | 568 |
November 2020 | 609 |
December 2020 | 578 |
January 2021 | 593 |
February 2021 | 625 |
March 2021 | 682 |
April 2021 | 619 |
May 2021 | 663 |
June 2021 | 536 |
July 2021 | 432 |
August 2021 | 428 |
September 2021 | 495 |
October 2021 | 598 |
November 2021 | 467 |
December 2021 | 396 |
January 2022 | 448 |
February 2022 | 467 |
March 2022 | 517 |
April 2022 | 569 |
May 2022 | 617 |
June 2022 | 462 |
July 2022 | 429 |
August 2022 | 380 |
September 2022 | 381 |
October 2022 | 485 |
November 2022 | 505 |
December 2022 | 373 |
January 2023 | 453 |
February 2023 | 594 |
March 2023 | 717 |
April 2023 | 592 |
May 2023 | 596 |
June 2023 | 462 |
July 2023 | 359 |
August 2023 | 403 |
September 2023 | 430 |
October 2023 | 657 |
November 2023 | 562 |
December 2023 | 390 |
January 2024 | 468 |
February 2024 | 544 |
March 2024 | 610 |
April 2024 | 696 |
May 2024 | 485 |
June 2024 | 301 |
Citing articles via.
Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide
Sign In or Create an Account
This PDF is available to Subscribers Only
For full access to this pdf, sign in to an existing account, or purchase an annual subscription.
Chat with Paper: Save time, read 10X faster with AI
Content maybe subject to copyright Report
Does high variability training improve the learning of non-native phoneme contrasts over low variability training a replication, supplemental materials for preprint: optional stopping and the interpretation of the bayes factor, diverging implicit measurement of sense of agency using interval estimation and libet clock, unconscious knowledge of rewards guides instrumental behaviors via conscious judgments., development and validation of brief measures of positive and negative affect: the panas scales., estimating the reproducibility of psychological science, conjectures and refutations: the growth of scientific knowledge, the design of experiments, bayesian t tests for accepting and rejecting the null hypothesis, related papers (5), bayesian hypothesis testing for proportions, why bayesian “evidence for h1” in one condition and bayesian “evidence for h0” in another condition does not mean good-enough bayesian evidence for a difference between the conditions:, the strength of statistical evidence for composite hypotheses: inference to the best explanation, properties of hypothesis testing techniques and (bayesian) model selection for exploration-based and theory-based (order-restricted) hypotheses., posterior odds ratios for regression hypotheses: general considerations and some specific results, frequently asked questions (1), q1. what are the contributions in this paper.
This article provides guidance on interpreting and reporting Bayesian hypothesis tests, in order to aid their understanding. The paper will provide guidance in specifying effect sizes of interest ( which also will be of relevance to those using frequentist statistics ).
Strategies that reduce stroop interference, related papers.
Showing 1 through 3 of 0 Related Papers
Chapter 13 bayesian hypothesis testing with bayes factors.
In this chapter, we will discuss how to compute Bayes Factors for a variety of General Linear Models using the BayesFactor package ( Morey and Rouder 2023 ) . The package implements the “default” priors discussed in the SDAM book.
The BayesFactor package implements Bayesian model comparisons for General Linear Models (as well as some other models for e.g. contingency tables and proportions) using JZS-priors for the parameters, or fixing those parameters to 0. Because Bayes Factors are transitive , in the sense that a ratio of Bayes Factors is itself another Bayes factor: \[\begin{align} \text{BF}_{1,2} &= \frac{p(Y_1,\ldots,Y_n|\text{MODEL 1})}{p(Y_1,\ldots,Y_n|\text{MODEL 2})} \\ &= \frac{p(Y_1,\ldots,Y_n|\text{MODEL 1})/p(Y_1,\ldots,Y_n|\text{MODEL 0})} {p(Y_1,\ldots,Y_n|\text{MODEL 2})/p(Y_1,\ldots,Y_n|\text{MODEL 0})} \\ &= \frac{\text{BF}_{1,0}}{\text{BF}_{2,0}} , \end{align}\] you can compute many other Bayes Factors which might not be immediately provided by the package, by simply dividing the Bayes factors that the package does provide. This makes the procedure of model comparison very flexible.
If you haven’t installed the BayesFactor package yet, you need to do so first. Then you can load it as usual by:
A Bayesian alternative to a \(t\) -test is provided via the ttestBF function. Similar to the base R t.test function of the stats package, this function allows computation of a Bayes factor for a one-sample t-test or a two-sample t-tests (as well as a paired t-test, which we haven’t covered in the course). Let’s re-analyse the data we considered before, concerning participants’ judgements of the height of Mount Everest. The one-sample t-test we computed before, comparing the judgements to an assumed mean of \(\mu = 8848\) , was:
The syntax for the Bayesian alternative is very similar, namely:
This code provides a test of the following models:
\[\begin{align} H_0\!&: \mu = 8848 \\ H_1\!&: \frac{\mu - 8848}{\sigma_\epsilon} \sim \textbf{Cauchy}(r) \end{align}\]
After computing the Bayes factor and storing it in an object bf_anchor , we just see the print-out of the result by typing in the name of the object:
This output is quite sparse, which is by no means a bad thing. It shows a few important things. Under Alt. (which stands for the alternative hypothesis), we first see the scaling factor \(r\) used for the JZS prior distribution on the effect size. We then see the value of the Bayes Factor, which is “extreme” (>100), showing that the data increases the posterior odds ratio for the alternative model over the null model by a factor of 46,902,934,288. Quite clearly, the average judgements differed from the true height of Mount Everest! After the computed value of the Bayes factor, you will find a proportional error for the estimate of the Bayes factor. In general, the marginal likelihoods that constitute the numerator (“top model”) and denominator (“bottom model”) of the Bayes factor cannot be computed exactly, and have to be approximated by numerical integration routines or simulation. This results in some (hopefully small) error in computation, and the error estimate indicates the extend to which the true Bayes factor might differ from the computed one. In this case, the error is (proportionally) very small, and hence we can be assured that our conclusion is unlikely to be affected by error in the approximation.
As we didn’t set the scaling factor explicitly, the default value is used, which is the “medium” scale \(r = \frac{\sqrt{2}}{2} = 0.707\) . Note that this is actually different from the default value of \(r=1\) proposed in Rouder et al. ( 2009 ) , which first introduced this version of the Bayesian \(t\) -test to a psychological audience, and the one used to illustrate the method in the SDAM book. Whilst reducing the default value to \(r=0.707\) is probably reasonable given the effect sizes generally encountered in psychological studies, a change in the default prior highlights the subjective nature of the prior distribution in the Bayesian model comparison procedure. You should also realise that different analyses, such as t-tests, ANOVA, and regression models, use different default values for the scaling factor. As shown in the SDAM book, the value of the Bayes factor depends on the choice for the scaling factor. Although the default value may be deemed reasonable, the choice should really be based on a consideration of the magnitude of the effect sizes you (yes, you!) expect in a particular study. This is not always easy, but you should pick one (the default value, for instance, if you can’t think of a better one) before conducting the analysis. If you feel that makes the test too subjective, you may may want to check the robustness of the result for different choices of the scaling factor. You can do this by computing the Bayes factor for a range of choices of the scaling factor, and then inspecting whether the strength of the evidence is in line with your choice for a reasonable range of values around your choice. The code below provides an example of this:
Given the scale of the \(y\) -axis (e.g., the first tick mark is at 1e+10 = 10,000,000,000), there is overwhelming evidence against the null-hypothesis for most choices of the scaling factor. Hence, the results seem rather robust to the exact choice of prior.
To compare the means of two groups, we can revisit the Tetris study, where we considered whether the number of memory intrusions is reduced after playing Tetris in combination with memory reactivation, compared to just memory reactivation by itself. The ttestBF function allows us to provide the data for one group as the x argument, and the data for the other group as the y argument, so we can perform our model comparison, by subsetting the dependent variable appropriately, as follows:
Which shows strong evidence for the alternative hypothesis over the null hypothesis that the means are identical (i.e. that the difference between the means is zero, \(\mu_1 - \mu_2 = 0\) ), as the alternative model is 2.82 times more likely than the null model, which sets the difference between the means to exactly \(\mu_1 - \mu_2 = 0\) , rather than allowing different values of this difference through the prior distribution.
A two-sample t-test should really be identical to a two-group ANOVA model, as both concern the same General Linear Model (a model with a single contrast-coding predictor, with e.q. values of \(-\tfrac{1}{2}\) and \(\tfrac{1}{2}\) ). Before fully discussing the way to perform an ANOVA-type analysis with the BayesFactor package, let’s just double-check this is indeed the case:
The results are indeed identical. Note that this is because both the ttestBF and anovaBF function use the same prior distribution for the effect.
More general ANOVA-type models can be tested though the anovaBF function. This function takes the following important arguments:
The anovaBF function will (as far as I can gather) always use contr.sum() contrasts for the factors . So setting your own contrasts will have no effect on the results. The exact contrast should not really matter for omnibus tests, and sum-to-zero are a reasonable choice in general ( contr.sum implements what we called effect-coding before). 3 While the anovaBF function always uses the JZS prior for any effects, it allows you to specify exactly which scaling factor to use for every effect, if so desired. One perhaps confusing thing is that effect-sizes for ANOVA designs (as far as I can gather) are based on standardized treatment-effects, whilst those for the t-test designs are based on Cohens- \(d\) effect sizes. Hence, the values of the scaling factor \(r\) for “medium”, “wide”, and “ultrawide” are different for the Bayesian \(t\) -test and ANOVA models (whilst they provide the same results for models with two conditions).
Let’s see what happens when we use a Bayesian ANOVA-type analysis for the data on experimenter beliefs in social priming. First, let’s load the data, and turn the variables reflecting the experimental manipulations into factors:
We can now use the anovaBF function to compute the Bayes factors:
A main thing to note here is that the comparisons of different versions of MODEL G are against the same MODEL R, which is an intercept-only model. We can see that all models which include experimenterBelief receive strong evidence against the intercept-only model, apart from the model which only includes primeCond , which has less evidence than the intercept-only model. Although this indicates that the primeCond effect might be ignorable, the comparisons are different from comparing reduced models to the general MODEL G with all effects included. We can obtain these Type 3 comparisons by setting the whichModels argument to `top``:
It is very important to realise that the output now concerns the comparison of the reduced model (in the numerator, i.e. the “top model”) against the full model (in the denominator, i.e. the “bottom model”), as is stated in the Against denimonator part of the output. So these are \(\text{BF}_{0,1}\) values, rather than \(\text{BF}_{1,0}\) values. That means that low values of the Bayes factor now indicate evidence for the alternative hypothesis that an effect is different from 0. As we find a very low \(\text{BF}_{0,1}\) value for the experimenterBelief effect, this thus shows strong evidence that this effect is different than 0. The \(\text{BF}_{0,1}\) values for the other effects are larger than 1, which indicate more support for the null hypothesis than for the alternative hypothesis.
We can change the output from a \(\text{BF}_{0,1}\) value to a \(\text{BF}_{1,0}\) value by simply inverting the Bayes factors, as follows:
As we noted before, we again see strong evidence for the effect of experimenterBelief when we remove it from the full model, but not for the other effects.
The transitivity of the Bayes factor means that we can also obtain some of these results through a ratio of the Bayes factors obtained earlier. For instance, a Type 3 test of the effect of experimenterBelief:primeCond interaction can be obtained by comparing a model with all effects included to a model without this interaction. In the analysis stored in bf_expB , we compared a number of the possible models to an intercept-only model. By comparing the Bayes factors of the model which excludes the interaction to a model which includes it, we can obtain the same Bayes factor of that interaction as follows. In the output of bf_expB , the fourth element compared the full model to the intercept-only model, whilst in the third element, a model with only the main effects of experimenterBelief and primeCond are compared to an intercept-only model. The Type 3 test of the interaction can then be obtained through the ratio of these two Bayes factors:
which indicates evidence for the null hypothesis that there is no moderation of the effect of experimenterbelief by primeCond , as the Bayes factor is well below 1. We cannot replicate all Type 3 analyses with the results obtained earlier, unless we ask the function to compare every possible model against the intercept-only model, by specifying whichModels = "all" :
For instance, we can now obtain a Type 3 test for experimenterBelief by comparing the full model (the 7th element in the output) to a model which just excludes this effect (i.e. the 6th element):
which reproduces mostly the result we obtained by setting whichModel = "top" before.
Apart from different default values of the scaling factor \(r\) in the scaled-Cauchy distribution, the BayesFactor package works in the same way for models which include metric predictors. In a multiple regression model with only metric predictors, we can use the convenience function regressionBF . If you want to mix metric and categorical predictors, as in an ANCOVA model, you will have to use the generalTestBF function. All functions discussed so far are really just convenience interfaces to the generalTestBF , which implements Bayes factors for the General Linear Model. These convenience functions are used to determine an appropriate scaling factor for the different terms in the model, but not much else of concern, so you can replicate all the previous analyses through the generalTestBF function, if you’d like.
An analysis similar to a repeated-measures ANOVA can also be obtained. Just like the afex package, the BayesFactor package requires data in the long format. Let’s first prepare the data of the Cheerleader-effect experiment:
The way the BayesFactor package deals with repeated-measures designs is a little different from how we treated repeated-measures ANOVA. Rather than computing within-subjects composite variables, the package effectively deals with individual differences by adding random intercepts (like in a linear mixed-effects model). To do this, we add Participant as an additive effect, and then classify it as a random effect through the whichRandom argument. To obtain Type-3 comparisons, we again set whichModels to top :
In this case, the proportional errors of the results may be deemed to high. We can get more precise results by obtaining more samples (for these complex models, the estimation of the Bayes factor is done with a sampling-based approximation). We can do this, without the need to respecify the model, with the recompute function, where we can increase the number of sampling iterations from the default (10,000 iterations) to something higher:
This provides somewhat better results, although it would be better to increase the number of iterations even more.
As before, the Bayes Factors are for the reduced model compared to the full model and we can get more easily interpretable results by computing the inverse value
We can see that we obtain “extreme” evidence for the main effect of Item. For the other effects, the evidence is more in favour of the null-hypothesis.
By default, the Bayes Factor objects just provide the values of the Bayes Factor. We don’t get estimates of the parameters.
To get (approximate) posterior distributions for the parameters, we can first estimate the general MODEL G with the lmBF function. This function is meant to compute a specific General Linear Model (rather than a set of such models). For example, for the Social Priming example, we can estimate the ANOVA model with lmBF as:
We can then use this estimated model to obtain samples from the posterior distribution over the model parameters. This is done with the posterior function of the Bayesfactor package. We can determine the number of samples through the iterations argument. This should generally be a high number, to get more reliable estimates:
The post_samples object can be effectively treated as a matrix, with columns corresponding to the different parameters, and in the rows the samples. So we can obtain posterior means as the column-wise averages:
Here, mu corresponds to the “grand mean” (i.e. the average of averages), which is is the intercept in a GLM with sum-to-zero contrasts. The next mean corresponds to the posterior mean of the treatment effect of the high-power prime condition ( primeCond-HPP ). I.e., this is the marginal mean of the high-power prime conditions, compared to the grand mean. The second effect is the posterior mean of the treatment effect of the low-power prime condition ( primeCond-LPP ). As there are only two power-prime conditions, this is exactly the negative value of the posterior mean of the high-power prime treatment effect (the grand mean is exactly halfway between these two treatment effects). We get similar treatment effects for the main effect of experimenter belief, and the interaction between power prime and experimenter belief. The posterior mean labelled sig2 is an estimate of the error variance. The columns after this are values related to the specification of the prior, which we will ignore for now.
We can do more than compute means for these samples from the posterior distribution. For instance, we can plot the (approximate) posterior distributions as well. For example, we can plot the posterior distribution of the high-power prime treatment effect as:
A convenient way to obtain highest-density intervals, is by using the hdi function from the HDInterval package ( R-HDInterval? ) . This function is defined for a variety of objects, including those returned by the BayesFactor::posterior() function. The function has, in addition to the object, one more argument called credMass , which specifies the width of the credible interval ( credMass = .95 is the default). For example, the 95% HDI interval for the two parameters that we plotted above are obtained as follows:
The output shows the lower and upper bound for each HDI. We see that the 95% HDI for power prime effect includes 0, whilst the 95% HDI for the experimenter belief effect does not. Again, this corresponds to what we observed earlier, that there is strong evidence for the experimenter belief effect, but not for the power prime effect.
The BayesFactor package computes Bayes Factors for a number of standard analyses, using the default priors set in the package. We may want to compute Bayes Factors to test hypotheses for more complex models. In general, we can compare any models by first computing the marginal likelihood for each model, and then computing the Bayes Factor as the ratio of these marginal likelihoods. Computing marginal likelihoods is not always straightforward, but a general procedure that often works reasonably well is called “bridge sampling” ( Quentin F. Gronau et al. 2017 ) , and has been implemented in the bridgesampling ( Quentin F. Gronau and Singmann 2021 ) package. Before discussing how to use this, we will first discuss a simpler way to compute Bayes Factors for particular comparisons within the context of a brms model.
For the following examples, we will start with the multiple regression model for the trump2016 data that we also estimated in the previous chapter, but now setting the prior distributions to more informative ones, as is advisable when computing Bayes Factors:
The results of this model are
The brms::hypothesis() function can be used to test hypotheses about single parameters of brms models. This requires to set the argument sample_prior=TRUE in the brms::brm() function. The brms::hypothesis() function will not work properly without this. The brms::hypothesis() function has the following important arguments:
The specification of the hypothesis argument is rather flexible. For example, we can test the (null) hypothesis that the slope of hate_groups_per_million equals 0 by specifying the hypothesis as "hate_groups_per_million = 0" :
This compares mod_regression to an alternative model where the prior for the slope of hate_groups_per_million is set to a point-prior at 0 (i.e. only the value 0 is allowed). The output of the function repeats some values that are also provided in the output of the summary() function (the posterior mean, standard deviation, and lower- and upper-bound of the 95% HDI). In addition, we find values of Evid.Ratio and Post.Prob . The Evidence Ratio ( Evid.Ratio ) is the value of the Bayes Factor \(\text{BF}_{01}\) comparing the model specified in the hypothesis (MODEl R) to the less restrictive MODEL G ( mod_regression ). So values larger than 1 indicate that the data provides evidence for the tested hypothesis (MODEL R) to MODEL G. Conversely, values smaller than 1 indicate evidence for MODEL G over MODEL R. The value found here ( \(\text{BF}_{01} = 0.1180559\) ) can be considered “moderate” evidence for MODEL G which allows hate_groups_per_million to have an effect, compared to MODEL R which fixes the effect to 0. The Bayes Factor in this procedure is calculated via the so-called Savage-Dickey density ratio ( Wagenmakers et al. 2010 ) . The Posterior Probability is the posterior probability that the hypothesis is true. For this point-hypothesis, this is the posterior probability of the model with the point-prior (MODEL R), assuming equal prior probabilities for this model and MODEL G.
Directional hypotheses can also be tested. For example, we can test the hypothesis that the slope of hate_groups_per_million is larger than 0 by specifying the hypothesis as "hate_groups_per_million > 0" :
Whilst the output is similar as before, for these directional tests, a different procedure is used to compute the “evidence ratio”. Here, the evidence ratio is the posterior probability that the parameter is larger than 0, divided by the posterior probability that the parameter value is smaller than 0, e.g.: \[\text{Evidence ratio} = \frac{p(\beta > 0|\text{data})}{p(\beta < 0|\text{data})}\] which is estimated simply by the proportion of posterior samples that are larger than 0 (which is also stated under Post.Prob ), divided by the proportion of posterior samples smaller than 0 (which equals \(1-\) Post.Prob ). You can also use the procedure to test some “wacky” hypotheses, such as that the slope of hate_groups_per_million is smaller than the slope of percent_bachelors_degree_or_higher :
As the scales of hate_groups_per_million and percent_bachelors_degree_or_higher are quite different, this hypotheses does not necessarily make too much sense from a scientific viewpoint. The example was mainly meant to show the flexibility of the procedure.
Bridge sampling ( Bennett 1976 ; Meng and Wong 1996 ; Quentin F. Gronau et al. 2017 ) provides a general method to estimate the marginal likelihood from MCMC samples. We won’t go into the details of this algorithm here ( Quentin F. Gronau et al. 2017 provide a relatively readable introduction to this) , but note that it is a sampling-based approximation, and the accuracy of the method will depend on how many samples are used (and whether the MCMC algorithm has converged to sampling from the posterior distribution).
The implementation of bridge sampling in the bridgesampling , for which the brms::bridge_sampler() function provides a simple wrapper, requires that all parameters of the model are sampled at each step. This can be requested by setting the option save_pars = save_pars(all = TRUE) in the call to brms::brm() . We did not do this before, so to be able to use the brms::bridge_sampler() function, we should first re-estimate the model with:
Having set save_pars = save_pars(all = TRUE) , we can then call the brms::bridge_sampler() function on the estimated model as:
This function returns the approximate (natural) logarithm of the marginal likelihood, i.e. \[\widehat{\log p}(\text{data}|\text{MODEL})\] To compute a Bayes Factor, we will also need the (log) marginal likelihood for an alternative model. For example, we can set the prior for the slope of hate_groups_per_million to be a point-prior at 0 by specifying the prior distribution for that parameter as constant(0) :
We can now use the brms::bridge_sampler() function to compute the (log) marginal likelihood for this estimated model as:
We now have two (log) marginal likelihoods. To compute the actual Bayes Factor, we can use the factor that: \[\begin{aligned} \log \left( \frac{p(\text{data}|\text{MODEL 1})}{p(\text{data}|\text{MODEL 1})} \right) &= \log p(\text{data}|\text{MODEL 1}) - \log p(\text{data}|\text{MODEL 2}) \\ \frac{p(\text{data}|\text{MODEL 1})}{p(\text{data}|\text{MODEL 1})} &= \exp \left( \log p(\text{data}|\text{MODEL 1}) - \log p(\text{data}|\text{MODEL 2}) \right) \end{aligned}\] So we can compute the Bayes factor by taking the difference between the log marginal likelihoods, and then exponentiating:
where we have used that each object returned by the brms::bridge_sampler() is a list with the named element logml being equal to the (approximate) marginal log likelihood.
Using this explicit computation can be avoided by calling the bridgesampling::bf() function, which will provide the actual Bayes Factor from the log marginal likelihoods:
Curiously, a scaling factor of \(r = \tfrac{1}{2}\) in this case corresponds to a scaling factor of \(r = \tfrac{\sqrt{2}}{2}\) , which is something I don’t immediately understand, and will require further investigation. ↩︎
IMAGES
VIDEO
COMMENTS
To use and report a Bayesian hypothesis test, predicted effect sizes must be specified. The article will provide guidance in specifying effect sizes of interest (which also will be of relevance to those using frequentist statistics). First, if a minimally interesting effect size can be specified, a null interval is defined as the effects ...
The Bayes factor (sometimes abbreviated as BF) has a special place in the Bayesian hypothesis testing, because it serves a similar role to the p-value in orthodox hypothesis testing: it quantifies the strength of evidence provided by the data, and as such it is the Bayes factor that people tend to report when running a Bayesian hypothesis test ...
20.6.2.1 One-sided tests. We generally are less interested in testing against the null hypothesis of a specific point value (e.g. mean difference = 0) than we are in testing against a directional null hypothesis (e.g. that the difference is less than or equal to zero). We can also perform a directional (or one-sided) test using the results from ...
In Bayesian hypothesis testing, a one-sided hypothesis yields a more diagnostic test than a two-sided alternative (e.g., Jeffreys, 1961; Wetzels, Raaijmakers, Jakab, & Wagenmakers, 2009, p.283). ... Additionally, the robustness of the result to different prior distributions can be explored and included in the report. This is an important type ...
Bayesian hypothesis testing with Bayes factors is, at it's heart, a model comparison procedure. Bayesian models consist of a likelihood function and a prior distribution. A different prior distribution means a different model, and therefore a different result of the model comparison. ... Report the results. Make sure that you describe the ...
11. Bayesian hypothesis testing. This chapter introduces common Bayesian methods of testing what we could call statistical hypotheses . A statistical hypothesis is a hypothesis about a particular model parameter or a set of model parameters. Most often, such a hypothesis concerns one parameter, and the assumption in question is that this ...
Conceptualizing Hypothesis Testing via Bayes Factors. Bayesian inference is a fully probabilistic framework for drawing scientific conclusions that resembles how we naturally think about the world. Often, we hold an a priori position on a given issue. On a daily basis, we are confronted with facts about that issue.
If using model comparison or hypothesis testing as the basis for a decision, state and justify the decision threshold for the posterior model probability, and the minimum prior model probability that would make the posterior model probability exceed the decision threshold. ... How to use and report Bayesian hypothesis tests. Psychol. Conscious ...
Hypothesis Testing. Suppose we have univariate data y i iid ∼ N(θ, 1) goal is to test H0: θ = 0; vs H1: θ ≠ 0. Frequentist testing - likelihood ratio, Wald, score, UMP, confidence regions, etc. Need a test statistic T(y ( n)) T ( y ( n)) (and its sampling distribution) p-value: Calculate the probability of seeing a dataset/test ...
We first describe a standard Bayesian analysis of a single binomial response, going through the prior distribution choice and explaining how the posterior is calculated. We then discuss Bayesian hypothesis testing using the Bayes factor, a measure of how much the posterior odds of believing in one hypothesis changes from the prior odds.
In this chapter, you learnt about hypothesis testing using a Bayesian framework. The first two activities explored the logic of Bayesian statistics to make inferences and how it can be used to test hypotheses when expressed as the Bayes factor. ... This preprint outlines common errors and misconceptions when researcher report Bayes factors. 4.8 ...
We test the following hypotheses: H 0: δ = 0 versus H 1: δ ≠ 0. The following script of R code implements the Bayesian paired t -test and presents the p -value of the classical approach for comparison. The value r = 0.707 ( 2 / 2) denotes the scale of a Cauchy prior distribution of δ.
Again, we obtain a p-value less than 0.05, so we reject the null hypothesis. What does the Bayesian version of the t-test look like? Using the ttestBF() function, we can obtain a Bayesian analog of Student's independent samples t-test using the following command: ttestBF( formula = grade ~ tutor, data = harpo )
To use and report a Bayesian hypothesis test, predicted effect sizes must be specified. The article will provide guidance in specifying effect sizes of interest (which also will be of relevance to those using frequentist statistics). First, if a minimally interesting effect size can be specified, a null interval is defined as the effects ...
This article provides guidance on interpreting and reporting Bayesian hypothesis tests, in order to aid their understanding. To use and report a Bayesian hypothesis test, predicted effect sizes must be specified. The paper will provide guidance in specifying effect sizes of interest (which also will be of relevance to those using frequentist ...
discrepancy between the p-value and the objective Bayesian answers in precise hypothesis testing? Many Fisherians (and arguably Fisher) prefer likelihood ratios to p-values, when they are available (e.g., genetics). A lower bound on the Bayes factor (or likelihood ratio): choose π(θ) to be a point mass at θˆ, yielding B01(x) = Poisson(x j 0 ...
9.1.8 Bayesian Hypothesis Testing. Suppose that we need to decide between two hypotheses H0 H 0 and H1 H 1. In the Bayesian setting, we assume that we know prior probabilities of H0 H 0 and H1 H 1. That is, we know P(H0) = p0 P ( H 0) = p 0 and P(H1) = p1 P ( H 1) = p 1, where p0 + p1 = 1 p 0 + p 1 = 1. We observe the random variable (or the ...
In the context of Bayesian inference, hypothesis testing can be framed as a special case of model comparison where a model refers to a likelihood function and a prior distribution. Given two competing hypotheses and some relevant data, Bayesian hypothesis testing begins by specifying separate prior distributions to quantitatively describe each ...
To use and report a Bayesian hypothesis test, predicted effect sizes must be specified. The article will provide guidance in specifying effect sizes of interest (which also will be of relevance to those using frequentist statistics). First, if a minimally interesting effect size can be specified, a null interval is defined as the effects ...
To use and report a Bayesian hypothesis test, predicted effect sizes must be specified. The paper will provide guidance in specifying effect sizes of interest (which also will be of relevance to ...
To use and report a Bayesian hypothesis test, predicted effect sizes must be specified. The paper will provide guidance in specifying effect sizes of interest (which also will be of relevance to those using frequentist statistics). First, if a minimally interesting effect size can be specified, a null interval is defined as the effects smaller ...
13.1.1 A Bayesian one-sample t-test. A Bayesian alternative to a \(t\)-test is provided via the ttestBF function. Similar to the base R t.test function of the stats package, this function allows computation of a Bayes factor for a one-sample t-test or a two-sample t-tests (as well as a paired t-test, which we haven't covered in the course). Let's re-analyse the data we considered before ...
This normal-science and revolution pattern ties into a Bayesian workflow cycling between model building, inference, and model checking. The multiverse. The point of the "forking paths" metaphor in statistics is that multiple comparisons can be a problem, even when there is no "fishing expedition" or "p-hacking" and the research ...
Again, we obtain a p-value less than 0.05, so we reject the null hypothesis. What does the Bayesian version of the t-test look like? Using the ttestBF() function, we can obtain a Bayesian analog of Student's independent samples t-test using the following command: ttestBF( formula = grade ~ tutor, data = harpo )