import numpy as npimport polars as plimport scipy.stats as stimport numbaimport tqdmimport iqplotimport bebi103import bokeh.ioimport bokeh.plottingbokeh.io.output_notebook()
Loading BokehJS ...
In this fun exercise, we will investigate how replicable certain statistical conclusions are. What I mean by replicability is best understood be working through this notebook.
For this lesson we will use the zebrafish embryo sleep data from the Prober lab. A description of their work on the genetic regulation of sleep can be found on the research page of the lab website. In particular, the movie below comes from their experiments watching moving/sleeping larvae over time.
The data we will use are processed from raw data published in Gandhi et al., 2015. In their experiment they were studying the effect of a deletion in the gene coding for arylalkylamine N-acetyltransferase (aanat), which is a key enzyme in the rhythmic production of melatonin. Melatonin is a hormone responsible for regulation of circadian rhythms. It is often taken as a drug to treat sleep disorders. The goal of this study is to investigate the effects of aanat deletion on sleep pattern in 5+ day old zebrafish larvae.
Among other sleep properties, they measured the mean rest bout length on the sixth night, comparing wild type larvae to the homozygous mutant. A rest bout is defined as a period of time in which the fish does not move. The length of a rest bout is just the amount of time the fish is still. We are primarily interested in the difference in the mean bout lengths between the two genotypes. The processed data are found here.
Let’s load the data.
# Load datadf = pl.read_csv(os.path.join(data_path, 'mean_rest_bouts.csv'), comment_prefix='#')# Pull out wild type and mutant and drop NAsdf = df.filter(pl.col('genotype').is_in(['wt', 'mut'])).drop_nulls()
The distribution for the mutant appears to be shifted to the left of the wild type, meaning that the mutant has shorted rest bouts. Let’s add 95% confidence intervals to the plot to see how the ECDFs might change if we did the experiment again.
These is strong overlap of the confidence intervals, so it may just be that the differences are due to the finite sample size. We will investigate further with some modeling.
51.1 Cohen’s d
Cohen’s d is a commonly used measure of effect size in comparison of two data sets. It is the ratio of the difference of means compared to a pooled standard deviation.
Here, \(\bar{x}\) is the plug-in estimate for the mean of the data from sample \(x\), \(\hat{\sigma}_x^2\) is the plug-in estimate for the variance from sample \(x\), and \(n_x\) is the number of measurements in sample \(x\). The values for sample \(y\) are similarly defined.
Roughly speaking, Cohen’s d tells us how different the means of the data sets are compared to the variability in the data. A large Cohen’s d means that the effect is large compared to the variability of the measurement.
51.2 Estimates of the difference of means and Cohen’s d
First, we will compute nonparametric estimates from the data. We will estimate the difference in the mean bout length and Cohen’s d. For speed, we save the two data sets as NumPy arrays.
Next, we’ll write some functions to conveniently generate bootstrap replicates and do our hypothesis tests. These borrow heavily from the lessons on hacker stats.
@numba.jit(nopython=True)def cohen_d(x, y, return_abs=False):"""Cohen's d for two data sets.""" diff = x.mean() - y.mean() pooled_variance = (len(x) * np.var(x) +len(y) * np.var(y)) / (len(x) +len(y) -2)if return_abs:return np.abs(diff) / np.sqrt(pooled_variance)return diff / np.sqrt(pooled_variance)@numba.jit(nopython=True)def t_stat(x, y):"""Welch's t-statistic."""return (np.mean(x) - np.mean(y)) / np.sqrt( np.var(x) / (len(x) -1) + np.var(y) / (len(y) -1) )@numba.jit(nopython=True)def draw_perm_sample(x, y):"""Generate a permutation sample.""" concat_data = np.concatenate((x, y)) np.random.shuffle(concat_data)return concat_data[: len(x)], concat_data[len(x) :]@numba.jit(nopython=True)def draw_bs_sample(data):"""Draw a single bootstrap sample."""return np.random.choice(data, size=len(data))@numba.jit(nopython=True)def draw_perm_reps_t(x, y, size=10000): out = np.empty(size)for i inrange(size): x_perm, y_perm = draw_perm_sample(x, y) out[i] = t_stat(x_perm, y_perm)return out@numba.jit(nopython=True)def draw_bs_reps_mean(data, size=10000): out = np.empty(size)for i inrange(size): out[i] = np.mean(draw_bs_sample(data))return out@numba.jit(nopython=True)def draw_bs_reps_diff_mean(x, y, size=10000): out = np.empty(size)for i inrange(size): out[i] = np.mean(draw_bs_sample(x)) - np.mean(draw_bs_sample(y))return out@numba.jit(nopython=True)def draw_bs_reps_cohen_d(x, y, size=10000, return_abs=False): out = np.empty(size)for i inrange(size): out[i] = cohen_d(draw_bs_sample(x), draw_bs_sample(y), return_abs)return out@numba.jit(nopython=True)def draw_bs_reps_t(x, y, size=10000):""" Bootstrap replicates using the Welch's t-statistic. """ out = np.empty(size)for i inrange(size): out[i] = t_stat(draw_bs_sample(x), draw_bs_sample(y))return out
Now we can compute the replicates. First, let’s look at the means and their respective confidence intervals.
There is some overlap in the confidence interval, though wild type tend to have longer rest bouts. Just looking at these numbers, we may not be all that certain that there is a discernible difference between wild type and mutant.
Now, let’s look at the difference in the mean rest bout lengths, which we define as \(\delta = \bar{x}_\mathrm{wt} - \bar{x}_\mathrm{mut}\).
As we might expect, on the tail end of the confidence interval for the difference of means, we see that the mutant might actually have longer rest bouts that wild type.
Finally, lets compute the Cohen’s d to check the effect size.
So, the effect size is 0.54, meaning that the mutant tends to have rest bouts 0.5 standard deviations as large as wild type fix. Jacob Cohen would call this a “medium” sized effect.
51.3 Null hypothesis significance testing
We will now perform an NHST on these two data sets. We formulate the hypothesis as follows.
\(H_0\): Wild type and mutant fish have the same mean rest about length.
Test statistic: Cohen’s d.
At least as extreme as: Cohen’s d larger than what was observed.
We can then perform a bootstrap hypothesis test.
@numba.jit(nopython=True)def cohen_nhst(wt, mut, size=100000):""" Perform hypothesis test assuming equal means, using Cohen-d as test statistic. """# Shift data sets so that they have the same mean. wt_shifted = wt - np.mean(wt) + np.mean(np.concatenate((wt, mut))) mut_shifted = mut - np.mean(mut) + np.mean(np.concatenate((wt, mut)))# Draw replicates of Cohen's d reps = draw_bs_reps_cohen_d(wt_shifted, mut_shifted, size=size)# Compute p-valuereturn np.sum(reps >= cohen_d(wt, mut)) /len(reps)print("Cohen's d p-value:", cohen_nhst(wt, mut))
Cohen's d p-value: 0.06688
We get a p-value of about 0.07, which, if we use the typical bright line p-value for statistical significance, we would say that this difference is not statistically significant. We could also test what would happen if we used a different test statistic, like the difference of means.
# Shift data sets so that they have the same mean.wt_shifted = wt - np.mean(wt) + np.mean(np.concatenate((wt, mut)))mut_shifted = mut - np.mean(mut) + np.mean(np.concatenate((wt, mut)))# Draw replicates of difference of meansreps = draw_bs_reps_diff_mean(wt_shifted, mut_shifted, size=100000)# Compute p-valuep_val = np.sum(reps >= np.mean(wt) - np.mean(mut)) /len(reps)print('Difference of means p-value:', p_val)
Difference of means p-value: 0.04077
Here, we get a p-value of about 0.04. We would say that the result is statistically significant if we used a bright line value of 0.05.
Finally, let’s try a canonical test for this circumstance, the Welch’s t-test. As a reminder, the test statistic for the Welch’s t-test is
\[\begin{align}
T = \frac{\bar{x}_w - \bar{x}_m}{\sqrt{\hat{\sigma}_w^2/n_w + \hat{\sigma}_m^2/n_m}},
\end{align}\]
where \(\hat{\sigma}_w^2\) and \(\hat{\sigma}_m^2\) are plug-in estimates for the variances. Importantly, when performing a Welch’s t-test, Normality of the two samples is assumed. So, the hypothesis test is defined as follows.
\(H_0\): The two samples are both Normally distributed with equal means.
Test statistic: t-statistic.
At least as extreme as: t-statistic (wild type minus mutant) greater than or equal to what was observed.
This is implemented as scipy.stats.ttest_ind() using the kwarg equal_var=False. We divide by two to get the one-tailed test. Note that Welch’s t-test is not exact, but is asymptotically exact for large sample sizes.
Here, we are just above the bright line value of 0.05. We can perform a similar hypothesis test without the Normal assumption using the same test statistic as in the Welch’s test.
# Draw replicates of t statisticreps = draw_bs_reps_t(wt_shifted, mut_shifted, size=100000)# Compute p-valuep_val = np.sum(reps >= t_stat(wt, mut)) /len(reps)print("Welch's t-test without Normal assumption p-value:", p_val)
Welch's t-test without Normal assumption p-value: 0.06416
Finally, we will perform a permutation test. This test is specified as follows.
\(H_0\): The sleep bout lengths of mutant and wild type fish are identically distributed.
Test statistic: Welch’s t-statistic
At least as extreme as : difference of means is greater than what was observed.
So, all of our tests give p-values that are close to each other, ranging from about 0.04 to 0.07. If we choose bright line p-values to deem something as significant or not, some similar hypothesis/test statistic pairs can give different results. So, my advice is do not use brightline p-values. You went through the trouble of computing the p-value, just report it and leave it at that. Don’t change a float to a bool.
51.4 Model comparison
As an alternative to NHST, we can ask a similar (but different) question. We can compare two generative models. In one model, wild type and mutant sleep mean bout lengths come from the same Normal distribution. In the other, they come from different Normal distributions. We can compute an AIC for each model and then the Akaike weight for the first model (that they come from the same Normal distribution). Again, we can use the convenient feature that for Normal distributions, the MLE is given by the plug-in estimates for the parameters.
Now that we have this function, we can compute the Akaike weight for this data set.
akaike_weight(wt, mut)
np.float64(0.598909687738875)
The Akaike weight says that we should slightly favor the model where the data points come from the same Normal distribution. This runs contrary to our hypothesis tests. The AIC is penalizing the model with more parameters.
51.5 Dancing
We will now do a fun, instructive experiment. We will “re-acquire” the data by drawing random samples out of Normal distributions parametrized by the maximum likelihood estimates we obtain from the data. (Recall that the MLE for the mean and variance of a Normally distributed random variable is given by the plug-in estimates.) We will then compute the confidence interval and credible region for \(\delta\) and see how they vary from experiment to experiment. We will later repeat this with p-values and odds ratios.
The idea here is that if the data are indeed Gaussian distributed, we are looking at data that could plausibly be generated in an identical experiment.
First, we’ll write a function to generate new data and use it to generate 500 new data sets.
def new_data(mu, sigma, n):"""Generate new data"""return np.maximum(np.random.normal(mu, sigma, n), 0.01)# Values from real datamu_wt = np.mean(wt)mu_mut = np.mean(mut)sigma_wt = np.std(wt, ddof=0)sigma_mut = np.std(mut, ddof=0)# How many new data sets to generaten_new_data =500# Generate new datanew_wt = [new_data(mu_wt, sigma_wt, len(wt)) for _ inrange(n_new_data)]new_mut = [new_data(mu_mut, sigma_mut, len(mut)) for _ inrange(n_new_data)]
Now we can do the calculations. First, we’ll compute the confidence intervals for \(\delta\).
# Set up arrays for storing resultsconf_int = np.empty((n_new_data, 2))delta = np.empty(n_new_data)# Do calcs!for i, (wt_data, mut_data) inenumerate(tqdm.tqdm(zip(new_wt, new_mut))):# Compute confidence interval bs_reps = draw_bs_reps_diff_mean(wt_data, mut_data) conf_int[i, :] = np.percentile(bs_reps, (2.5, 97.5))# Sample difference of means delta[i] = wt_data.mean() - mut_data.mean()# Store the results convenientlydf_res = pl.DataFrame( schema=["conf_low", "conf_high", "delta"], data=np.hstack((conf_int, delta.reshape(n_new_data, 1))),)
500it [00:02, 198.75it/s]
Next, we can do some null hypothesis significance testing. We will compute three p-values, our custom bootstraped p-value with Cohen’s d, the p-value from a permutaiton test, and a p-value from Welch’s t-test. Remember, with our custom bootstrapped p-value, the hypothesis is that the mutant and wild type sleep bout lengths were drawn out of distributions of the same mean (and no other assumptions). The test statistic is Cohen’s d. The hypothesis in the permutation test is that the two data sets are identically distributed. The hypothesis in Welch’s t-test is that the mutant and wild type were drawn from Normal distributions with the same mean, but with difference variances. The test statistic is a t-statistic, defined above.
To visualize the results, we’ll plot the confidence intervals. We’ll plot the confidence interval as a bar, and then the bounds of the credible region as dots. For ease of viewing, we will only plot 100 of these and will sort them by the plug-in estimate for δ.
# Choose 100 and sort by delta for easy displaydf_res_sorted = df_res.sample(100).sort(by='delta')# Set up figurep = bokeh.plotting.figure(frame_height=250, frame_width=700, y_axis_label="δ (min)")# Populate glyphsp.scatter(np.arange(len(df_res_sorted)), df_res_sorted['delta'])x_conf = [[i, i] for i inrange(len(df_res_sorted))]y_conf = [[r["conf_low"], r["conf_high"]] for r in df_res_sorted.iter_rows(named=True)]p.multi_line(x_conf, y_conf, line_width=2, color="#1f77b4")# Turn off axis ticks for xp.xaxis.visible =Falsep.xgrid.visible =Falsebokeh.io.show(p)
The confidence interval can vary from experiment to experiment, but not that much. That is, the confidence intervals “dance” around, but all within about a factor of 3 of the original observed values.
51.5.2 Dancing: p-values
Now, let’s look at the p-values and the Akaike weights.
The custom p-value computed with a Cohen’s d test statistic and the permutation test p-value are nearly equal to the Welch’s p-value.
But what is really striking here is the scale! Wow! In 500 repeats, we get p-values ranging over four or five orders of magnitude! That’s three exclamations in a row! Four, now. Those exclamation points are there to highlight that the p-value is not a reproducible statistic at all.
The Akaike weights are less variable, and more conservative. Not many of them dip below 0.1, and we would be unlikely to select one model against another for most of the values of the Akaike weights we calculated. However, they are still rather variable, and in 500 repeats the can var over many orders of magnitude as well. Though better than the p-values, they are still not terribly reproducible.
Conversely, both confidence intervals don’t really dance much, and p-values and odds ratios dance like this.
51.5.3 The effect of sample size on dancing
The zebrafish sleep experiment had only about 20 samples, so maybe larger sample sizes will result in less extreme dancing of p-values. Let’s do a numerical experiment to look at that. We will take 15, 20, 50, and 100 samples for our experimental “repeats” and investigate how the p-value varies. For speed, we will only compute the p-value from Welch’s t-test, which we showed previously to track closely with our custom p-values.
n_new_data =1000n_samples = [15, 20, 50, 100]p_vals = np.empty((n_new_data, len(n_samples)))akaike_weights = np.empty((n_new_data, len(n_samples)))# Do calcs!for i in tqdm.tqdm(range(n_new_data)):for j, n inenumerate(n_samples):# Generate new data new_wt = new_data(mu_wt, sigma_wt, n) new_mut = new_data(mu_mut, sigma_mut, n)# Compute p-values and Akaikie weights p_vals[i,j] = st.ttest_ind(new_wt, new_mut, equal_var=False)[1]/2 akaike_weights[i, j] = akaike_weight(new_wt, new_mut)
Let’s look at the ECDFs of p-values and Akaike weights.
# Make tidy data frames for convenient plottingdf_p = pl.DataFrame(data=p_vals, schema=["n = "+str(n) for n in n_samples])df_p = df_p.unpivot(variable_name="n", value_name="p")df_akaike = pl.DataFrame( data=akaike_weights, schema=["n = "+str(n) for n in n_samples])df_akaike = df_akaike.unpivot(variable_name="n", value_name="akaike_weight")# Make plotsp1 = iqplot.ecdf( df_p, cats=["n"], q="p", x_axis_label="p-value", x_axis_type="log", order=["n = 15", "n = 20", "n = 50", "n = 100"], frame_width=500, frame_height=150, palette=bokeh.palettes.d3["Category20c"][4],)p2 = iqplot.ecdf( df_akaike, cats=["n"], q="akaike_weight", order=["n = 15", "n = 20", "n = 50", "n = 100"], x_axis_label="Akaike weight", x_axis_type="log", frame_width=500, frame_height=150, palette=bokeh.palettes.d3["Category20c"][8][4:],)p1.legend.location ="top_left"p2.legend.location ="top_left"p1.x_range = p2.x_rangebokeh.io.show(bokeh.layouts.column(p1, p2))
We see that even though the p-value and Akaike weight have large spreads as the number of samples increases, they also shift leftward. This is because small differences in samples can be discerned with large sample sizes. But notice that the p-value and the Akaike weight varies over orders of magnitude for similar data set.
51.6 Conclusion
This little exercise in reproducibility tells use that because the p-values “dance”, and to a lesser extent so do the Akaike weights, we had better be sure the dancefloor is far to the left. This suggests large \(n\) is needed.
I would argue that you should do a similar “dancing” analysis of your data sets when you have a reasonable generative model in mind so that you can decide what constitutes a small p-value of Akaike weight.