import warningsimport polars as plimport numpy as npimport scipy.optimizeimport scipy.stats as stimport numbaimport bebi103import iqplotimport bokeh.iobokeh.io.output_notebook()
Loading BokehJS ...
46.1 Graphical model assessment
We have already discussed model assessment. In this lesson, we will discuss implementation of graphical model assessment.
We will use the transcript counts from Singer, et al., this time focusing on the Rex1 gene, which shows clear biomodality. As a reminder, here is the ECDF for the transcript counts.
Before we can proceed to the model assessments, we need to write functions to calculate the MLE.
46.2 Likelihood calculators
In this and the following parts of this lesson, we will need the functions we have written to perform maximum likelihood estimation for a Negative Binomial distribution and for a mixture of two Negative Binomials. These are copied directly from the lesson on mixture models, so you do not need to focus on this rather long code cell.
def log_like_iid_nbinom(params, n):"""Log likelihood for i.i.d. NBinom measurements, parametrized by alpha, b=1/beta.""" alpha, b = paramsif alpha <=0or b <=0:return-np.infreturn np.sum(st.nbinom.logpmf(n, alpha, 1/ (1+ b)))def mle_iid_nbinom(n):"""Perform maximum likelihood estimates for parameters for i.i.d. NBinom measurements, parametrized by alpha, b=1/beta"""with warnings.catch_warnings(): warnings.simplefilter("ignore") res = scipy.optimize.minimize( fun=lambda params, n: -log_like_iid_nbinom(params, n), x0=np.array([3, 3]), args=(n,), method="Powell", )if res.success:return res.xelse:raiseRuntimeError("Convergence failed with message", res.message)def initial_guess_mix(n, w_guess):"""Generate initial guess for mixture model.""" n_low = n[n < np.percentile(n, 100* w_guess)] n_high = n[n >= np.percentile(n, 100* w_guess)] alpha1, b1 = mle_iid_nbinom(n_low) alpha2, b2 = mle_iid_nbinom(n_high)return alpha1, b1, alpha2, b2def log_like_mix(alpha1, b1, alpha2, b2, w, n):"""Log-likeihood of binary Negative Binomial mixture model."""# Fix nonidentifieability be enforcing values of wif w <0or w >1:return-np.inf# Physical bounds on parametersif alpha1 <0or alpha2 <0or b1 <0or b2 <0:return-np.inf logx1 = st.nbinom.logpmf(n, alpha1, 1/ (1+ b1)) logx2 = st.nbinom.logpmf(n, alpha2, 1/ (1+ b2))# Multipliers for log-sum-exp lse_coeffs = np.tile([w, 1- w], [len(n), 1]).transpose()# log-likelihood for each measurement log_likes = scipy.special.logsumexp(np.vstack([logx1, logx2]), axis=0, b=lse_coeffs)return np.sum(log_likes)def mle_mix(n, w_guess):"""Obtain MLE estimate for parameters for binary mixture of Negative Binomials."""with warnings.catch_warnings(): warnings.simplefilter("ignore") res = scipy.optimize.minimize( fun=lambda params, n: -log_like_mix(*params, n), x0=[*initial_guess_mix(n, w_guess), w_guess], args=(n,), method="Powell", tol=1e-6, )if res.success:return res.xelse:raiseRuntimeError("Convergence failed with message", res.message)
We can now carry out the MLE for both models.
# Load in datan = df['Rex1'].to_numpy()# Single Negative Binomial MLEalpha, b = mle_iid_nbinom(n)# Mixture model MLEalpha1, b1, alpha2, b2, w = mle_mix(n, 0.2)
46.3 Constructing a Q-Q plot
We will start by making a Q-Q plot for the single Negative-Binomial case. The function bebi103.viz.qqplot() generates Q-Q plots. In order to provide the function with samples out of the generative distribution, let’s draw out of a single Negative Binomial with our \(\alpha\), \(b\) parametrization.
With that in place, we can call bebi103.viz.qqplot() to generate the plot.
p = bebi103.viz.qqplot( data=n, samples=single_samples, x_axis_label="transcript counts", y_axis_label="transcript counts",)bokeh.io.show(p)
Clearly, the generative model cannot produce the observed transcript bounds, since the Q-Q plot shows strong separation of the generative quantiles from the observed quantiles.
46.3.1 Q-Q plot for the mixture model
We can also make a Q-Q plot for the mixture model. We need to write a function to generate data out of the mixture model. As we did in a previous lesson, we first randomly choose a cell type with weight \(w\), and then draw out of a Negative Binomial distribution with parameters corresponding to that cell type.
p = bebi103.viz.qqplot( data=n, samples=mixture_samples, x_axis_label="transcript counts", y_axis_label="transcript counts",)bokeh.io.show(p)
The Q-Q plot shows a much stronger agreement between the theoretical distribution and the observed.
46.4 Predictive ECDFs
If we wish to accept a model with the MLE for the parameters as a good representation of the true generative model, we would like to see how data generated from that model would look. We can do a parametric bootstrap to generate data sets and then plot confidence intervals of the resulting ECDFs. We call ECDFs that come from the generative model predictive ECDFs. We can try that for the single Negative Binomial model.
Given the samples (which we already computed and stored in single_samples), we can then compute the predictive ECDFs for each of these samples. We want the values of the predictive ECDF at each possible value of \(n\), which requires us to compute the value of the ECDF at an arbitrary value. The function below accomplishes this.
@numba.njitdef ecdf(x, data):"""Give the value of an ECDF at arbitrary points x.""" y = np.arange(len(data) +1) /len(data)return y[np.searchsorted(np.sort(data), x, side="right")]
Now we can compute the value of the ECDF for the respective values of \(n\).
n_theor = np.arange(0, single_samples.max() +1)ecdfs = np.array([ecdf(n_theor, sample) for sample in single_samples])
Now that we have the ECDF values, we can compute a 95th-percentile of the value of the ECDF.
This is a predictive ECDF. Is is the envelope in which we would expect 95% of the data points to lie if the generative model were true.
46.4.1 Comparing with data
We can overlay the ECDF of the data to see if it falls within this envelope.
p = iqplot.ecdf(data=df, q="Rex1", palette=["orange"], p=p)bokeh.io.show(p)
The Negative Binomial model generated some counts that were quite large, which is why the plot extends past counts of 1000. If you zoom in on the region where the data lie, the deviation of the measured ECDF from the ECDF generated by the model is clear.
We can also compare with the mixture model.
46.4.2 Using predictive ECDFs in the bebi103 package
The bebi103 package enables automated generation of predictive ECDFs. To demonstrate its usage, we can generate samples from the mixture model and plot a predictive ECDF.
# Use discrete=True for discrete datap = bebi103.viz.predictive_ecdf( samples=mixture_samples, data=n, discrete=True, x_axis_label="n")bokeh.io.show(p)
Here we see that the predictive ECDF captures the measured data much better. If you zoom into the plot, you will notice that the bebi103.viz.predictive_ecdf() function plots quantiles in various shades of blue. By default, the outermost region (lightest blue) contains the middle 95% of the generated ECDFs, and the next one in captures the middle 68%. The median ECDF from the generative model is a dark blue line in the middle.
46.4.3 ECDF difference
While the predictive ECDF plot is useful, it is perhaps more illuminating to plot the difference between the predictive ECDF and the measured ECDF. We can do this with the diff='ecdf' kwarg.
p = bebi103.viz.predictive_ecdf( samples=mixture_samples, data=n, diff="ecdf", discrete=True, x_axis_label="n")bokeh.io.show(p)
It is now clear that very few of the 279 data points lie outside the 95% envelope of the predictive ECDF. We might expect more to lie outside; it is conceivable that the model is too flexible.