50  Factor analysis

Open in Google Colab | Download notebook

Code
# Colab setup ------------------
import os, shutil, sys, subprocess, urllib.request
if "google.colab" in sys.modules:
    cmd = "pip install --upgrade watermark"
    process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    stdout, stderr = process.communicate()
    data_path = "https://s3.amazonaws.com/bebi103.caltech.edu/data/"
else:
    data_path = "../data/"
# ------------------------------
import warnings

import numpy as np
import scipy.io
import scipy.stats as st

import sklearn
import sklearn.decomposition
import sklearn.preprocessing

import bebi103

import bokeh.io
import bokeh.plotting
bokeh.io.output_notebook()
Loading BokehJS ...

In Chapter 49, we took a heuristic approach to principal component analysis in which we did the following.

  1. We noted that we have (centered and scaled) high dimensional (dimension \(D\)) observations \(\mathbf{y}\) and posit that they could be generated from \(L\) causes, with \(L < D\), often with \(L \ll D\).
  2. We sought a linear projection of \(D\)-dimensional data \(\mathbf{y}\) onto \(L\)-dimensional latent variables (causes) \(\mathbf{z}\).
  3. We defined a loss function that is the mean squared difference between observed data and data recovered from a linear transformation of the causes.
  4. We minimized the loss function to get the optimal transformation.

To be clear, this transformation is optimal only in the sense of the loss function we defined.

We now approach this problem with more careful probabilistic modeling.

50.1 Formulation of factor analysis

In factor analysis, we have a similar goal as in PCA in that we define a model \(\mathbf{y} = h(\mathbf{z};\theta)\), where \(\theta\) is some set of parameters, and we want to infer a function \(h(\mathbf{z};\theta)\). As in PCA, we restrict \(h(\mathbf{z};\theta)\) to be linear function defined by a \(D\times L\) loading matrix \(\mathsf{W}\) such that

\[\begin{align} h(\mathbf{z};\mathsf{W}) = \mathsf{W}\cdot \mathbf{z} + \boldsymbol{\mu}. \end{align} \]

Here we have included an additive term \(\boldsymbol{\mu}\) that we had set to zero in our treatment of PCA.

50.1.1 Generative distribution of factor analysis

In a factor analysis, in addition to choosing a linear function \(h(\mathbf{z};\mathsf{W})\), we assume an i.i.d. Normal likelihood,

\[\begin{align} \mathbf{y}_i \mid \boldsymbol{\mu}, \mathsf{W}, \mathbf{z}_i, \mathsf{\Psi} \sim \text{Norm}(\mathsf{W}\cdot \mathbf{z}_i + \boldsymbol{\mu}, \mathsf{\Psi})\;\forall i, \end{align} \]

where \(\mathsf{\Psi}\) is a \(D\times D\) covariance matrix describing how the observed data vary from the linear model. We typically take \(\mathsf{\Psi}\) to be diagonal; the residuals of the observations are uncorrelated (though the observations are correlated through the latent variables \(\mathbf{z}\)). We therefore will define the vector \(\boldsymbol{\sigma}\) to be the diagonal elements of \(\mathsf{\Psi}\) such that \(\Psi = \text{diag}(\boldsymbol{\sigma}^2)\), where \(\boldsymbol{\sigma}^2\) denotes element-wise squaring of the elements of \(\boldsymbol{\sigma}\).

We need to specify a prior on the latent variables \(\mathbf{z}\). In factor analysis, we take the latent variables to be \(L\)-variate Normally distributed.   \[\begin{align} \mathbf{z}_i \mid \boldsymbol{\mu}_0, \mathsf{\Sigma}_0 \sim \text{Norm}(\boldsymbol{\mu}_0, \mathsf{\Sigma}_0)\;\forall i. \end{align}\]

Note that since we are using a linear model and \(\mathbf{z}_i\) gets transformed according to \(\mathsf{W}\cdot \mathbf{z}_i + \boldsymbol{\mu}\), we can absorb \(\boldsymbol{\mu}_0\) and \(\mathsf{\Sigma}_0\) into \(\mathsf{W}\) and \(\boldsymbol{\mu}\). Absorbing these as described, we can write

\[\begin{align} \mathbf{z}_i \sim \text{Norm}(\mathbf{0}, \mathsf{I})\;\forall i, \end{align} \]

where \(\mathsf{I}\) is the \(L\times L\) identity matrix. The full generative model for a factor analysis is then

\[\begin{align} &\boldsymbol{\mu} \sim \text{prior for }\boldsymbol{\mu}, \\[1em] &\mathsf{W} \sim \text{prior for }\mathsf{W}, \\[1em] &\boldsymbol{\sigma} \sim \text{prior for }\boldsymbol{\sigma} \\[1em] &\mathsf{\Psi} = \text{diag}(\boldsymbol{\sigma}^2) \\[1em] &\mathbf{z}_i \sim \text{Norm}(\mathbf{0}, \mathsf{I})\;\forall i, \\[1em] &\mathbf{y}_i \mid\boldsymbol{\mu}, \mathsf{W}, \mathbf{z}_i, \mathsf{\Psi} \sim \text{Norm}(\mathsf{W}\cdot \mathbf{z}_i + \boldsymbol{\mu}, \mathsf{\Psi})\; \forall i \end{align} \]

For clarity and for reference, the dimensionality of the variables are shown in the table below. The index \(i\) indexes data points, and there are \(N\) of them.

Variable Description Dimension
\(\boldsymbol{\mu}\) location parameter of measurements \(D\)-vector
\(\mathsf{W}\) loading matrix \(D \times L\) matrix
\(\boldsymbol{\sigma}\) standard deviation for each measurement positive \(D\)-vector
\(\mathsf{\Psi}\) matrix representation of \(\boldsymbol{\sigma}^2\) \(D \times D\) nonnegative diagonal matrix
$_i $ Latent variable for datum \(i\) \(L\)-vector
$_i $ Datum \(i\) \(D\)-vector

Obviously, we need priors for \(\mathsf{W}\), \(\boldsymbol{\mu}\), and \(\mathsf{\Psi}\). We will temporarily leave those priors unspecified and turn our attention to marginal distributions and then to identifiability of the parameters of a factor analysis.

50.1.2 Marginal likelihood in factor analysis

We note that we can write a marginalized likelihood for datum \(i\) as

\[\begin{align} f(\mathbf{y}_i\mid \boldsymbol{\mu}, \mathsf{W}, \mathsf{\Psi}) &= \int\mathrm{d}\mathbf{z}_i\,g(\mathbf{y}_i, \mathbf{z}_i\mid \boldsymbol{\mu}, \mathsf{W}, \mathsf{\Psi}) = \int\mathrm{d}\mathbf{z}_i\,f(\mathbf{y}_i\mid \boldsymbol{\mu}, \mathsf{W}, \mathbf{z}_i, \mathsf{\Psi})\,g(\mathbf{z}_i\mid \boldsymbol{\mu}, \mathsf{W}, \mathsf{\Psi}) =\int\mathrm{d}\mathbf{z}_i\,f(\mathbf{y}_i\mid \boldsymbol{\mu}, \mathsf{W}, \mathbf{z}_i, \mathsf{\Psi})\,g(\mathbf{z}_i), \end{align} \]

where we have used the fact that the latent variables \(\mathbf{z}_i\) are not conditioned on the parameters \(\boldsymbol{\mu}\), \(\mathsf{W}\), and \(\mathsf{\Psi}\). Computing the integral, we get a marginalized likelihood of

\[\begin{align} f(\mathbf{y}_i\mid \boldsymbol{\mu}, \mathsf{W}, \mathsf{\Psi}) = \int\mathrm{d}\mathbf{z}_i\,\mathcal{N}(\mathbf{y}_i\mid \mathsf{W}\cdot \mathbf{z}_i + \boldsymbol{\mu}, \mathsf{\Psi})\,\mathcal{N}(\mathbf{z}_i\mid\mathbf{0}, \mathsf{I}) = \mathcal{N}(\mathbf{y}_i\mid \boldsymbol{\mu},\mathsf{W}\cdot\mathsf{W}^\mathsf{T} + \mathsf{\Psi}), \end{align} \]

where we have used properties of Gaussian integrals to compute the integral, such that

\[\begin{align} \mathbf{y}_i\mid \boldsymbol{\mu}, \mathsf{W}, \mathsf{\Psi}\sim \text{Norm}(\boldsymbol{\mu},\mathsf{W}\cdot\mathsf{W}^\mathsf{T} + \mathsf{\Psi}). \end{align} \tag{50.1}\]

This marginalized likelihood makes clear what factor analysis does: It expresses the observed data as generated by a low-dimensional Gaussian distribution with mean \(\boldsymbol{\mu}\) and covariance matrix \(\mathsf{W}\cdot\mathsf{W}^\mathsf{T} + \mathsf{\Psi}\).

50.1.3 Recognition distribution in factor analysis

The recognition distribution is the distribution of latent variables conditioned on the observed data, \(g(\mathbf{z}_i\mid \mathbf{y}_i, \boldsymbol{\mu}, \mathsf{W}, \mathsf{\Psi})\). Bayes’s theorem gives

\[\begin{align} g(\mathbf{z}_i \mid \mathbf{y}_i, \boldsymbol{\mu}, \mathsf{W}, \mathsf{\Psi}) = \frac{f(\mathbf{y}_i\mid \boldsymbol{\mu}, \mathsf{W}, \mathbf{z}_i, \mathsf{\Psi})\,g(\mathbf{z}_i)}{f(\mathbf{y}_i\mid \boldsymbol{\mu}, \mathsf{W}, \mathsf{\Psi})}. \end{align} \]

Because the denominator has no \(\mathbf{z}_i\)-dependence, it is a normalization constant. The recognition distribution is then a product of two Normal distributions, since

\[\begin{align} &\mathbf{y}_i \mid\boldsymbol{\mu}, \mathsf{W}, \mathbf{z}_i, \mathsf{\Psi} \sim \text{Norm}(\mathsf{W}\cdot \mathbf{z}_i + \boldsymbol{\mu}, \mathsf{\Psi}),\\[1em] &\mathbf{z}_i \sim \text{Norm}(\mathbf{0}, \mathsf{I}). \end{align} \]

After some algebraic grunge, we can work out that

\[\begin{align} \mathbf{z}_i\mid \mathbf{y}_i, \boldsymbol{\mu}, \mathsf{W}, \mathsf{\Psi} \sim \text{Norm}(\boldsymbol{\mu}_{\mathbf{z}\mid\mathbf{y}_i}, \mathsf{\Sigma}_{\mathbf{z}}), \end{align} \]

where

\[\begin{align} \mathsf{\Sigma}_{\mathbf{z}} = (\mathsf{I} + \mathsf{W}^\mathsf{T}\cdot\mathsf{\Psi}^{-1}\cdot\mathsf{W})^{-1} \end{align} \tag{50.2}\]

and

\[\begin{align} \boldsymbol{\mu}_{\mathbf{z}\mid\mathbf{y}_i} = \mathsf{\Sigma}_{\mathbf{z}} \cdot\mathsf{W}^\mathsf{T}\cdot \mathsf{\Psi}^{-1}\cdot(\mathbf{y_i - \boldsymbol{\mu}}). \end{align} \tag{50.3}\]

We note that evaluating the recognition distribution is part of the E step of an EM algorithm for finding the MAP of a factor model, as described in Section 50.2.

50.1.4 Nonidentifiability of the factor loading matrix and latent variables

In general, the factor loading matrix \(\mathsf{W}\) and latent variable \(\mathbf{z}\) are not identifiable. This can be seen by considering an arbitrary \(L\times L\) orthonormal matrix \(\mathsf{Q}\) (meaning that the rows and vectors of \(\mathsf{Q}\) are orthonormal vectors such that \(\mathsf{Q}\cdot\mathsf{Q}^\mathsf{T} = \mathsf{I}\)). We define a new loading matrix and latent variable vector according to

\[\begin{align} &\mathsf{W}_\mathrm{new} = \mathsf{W} \cdot \mathsf{Q}^\mathsf{T}, \\[1em] &\mathbf{z}_\mathrm{new} = \mathsf{Q}\cdot\mathbf{z}. \end{align}\]

Then,

\[\begin{align} \mathsf{W}_\mathrm{new}\cdot \mathbf{z}_\mathrm{new} = \mathsf{W} \cdot \mathsf{Q}^\mathsf{T}\cdot\mathsf{Q}\cdot\mathbf{z} = \mathsf{W}\cdot\mathbf{z}. \end{align}\]

Thus, two different loading matrices and vector of latent variables gives exactly the same likelihood upon multiplication by an orthonormal matrix.

This presents a real problem! The variables we are most interested in, the latent variables and the factor loading matrix, are not in general identifiable. We will discuss methods to deal with this nonidentifiability momentarily, but for now we will proceed without it giving us too much concern. Chris Bishop gives a nice motivation for this approach in his book Pattern Recognition and Machine Learning.

We shall view factor analysis as a form of latent variable density model, in which the form of the latent space is of interest but not the particular choice of coordinates used to describe it.

50.1.5 Specifying priors

We have left priors for \(\boldsymbol{\mu}\), \(\mathsf{W}\), and \(\boldsymbol{\Psi}\) unspecified. The rotational nonidentifiability makes prior assignment difficult, as we would like to ascribe a prior that breaks the nonidentifiability. There are several approaches to this. One is to take the same approach we did with PCA (see Chapter 49), and insist in our prior that the columns of \(\mathsf{W}\) are othronormal. This breaks much of the nonidentifiability because in this case, \(\mathsf{W}\cdot \mathsf{Q}^\mathsf{T}\) is itself orthonormal, as can be shown by noting that

\[\begin{align} \left(\mathsf{W}\cdot \mathsf{Q}^\mathsf{T}\right)^\mathsf{T} \cdot \left(\mathsf{W}\cdot \mathsf{Q}^\mathsf{T}\right) = \mathsf{Q} \cdot \mathsf{W}^\mathsf{T} \cdot \mathsf{W}\cdot \mathsf{Q}^\mathsf{T} = \mathsf{Q} \cdot \mathsf{Q}^\mathsf{T} = \mathsf{I}. \end{align}\]

However, any given column of \(\mathsf{W}\) can still be arbitrarily multiplied by \(-1\), so \(\mathsf{W}\) and therefore also the latent variables \(\mathbf{z}\), are not identifiable (see also Section 49.1.6).

Even if we wanted to formulate a prior to insist that the columns of \(\mathsf{W}\) are orthonormal, this is difficult to implement. An easier, but fraught, approach is to force \(\mathsf{W}\) to be lower triangular. Forcing this upon the load matrix can lead to constraints on the resulting latent variables \(\mathbf{z}\), clouding their interpretation.

In our analysis, we will punt on this and take the egregious step of leaving the priors unspecified and follow Bishop’s philosophy we are using the factor analysis to expose the form of the latent space and not concern ourselves with the particularities of coordinate choices. We will also forego evaluating the full posterior using MCMC because of the awful nonidentifiabilities and will proceed to find a MAP (using an indefinite article because there are many MAPs!).

(Rick Farouni has written a couple of nice blog posts about the nonidentifiability challenges of factor analysis, available here and here.)

50.2 EM for factor analysis

As a latent variable model, factor analysis is well-suited for the EM algorithm. We can take the usual approach for deriving the EM algorithm by evaluating the recognition distribution, defining the parameter-dependent part of the surrogate function, and then finding the parameter values that maximize it. It turns out that the MAP estimate for \(\boldsymbol{\mu}\) is the sample mean, so that can be computed directly outside of the EM algorithm.

\[\begin{align} \boldsymbol{\mu}_\mathrm{MAP} = \bar{\mathbf{y}} = \frac{1}{N}\sum_{i=1}^N \mathbf{y}_i. \end{align} \]

In what follows, recall that \(\Psi = \text{diag}(\boldsymbol{\sigma}^2)\). We also define

\[\begin{align} \hat{\boldsymbol{\sigma}}^2 = \frac{1}{N}\sum_{i=1}^N \mathbf{y}_i \odot \mathbf{y}_i \end{align} \]

to be the plug-in estimate for \(\boldsymbol{\sigma}^2\), where the symbol \(\odot\) denotes element-wise multiplication.

  1. Initialize \(\mathsf{W}\) and \(\boldsymbol{\sigma}\).
  2. Compute \(\mathsf{\Sigma}_\mathbf{z} = (\mathsf{I} + \mathsf{W}^\mathsf{T}\cdot\mathsf{\Psi}^{-1}\cdot \mathsf{W})^{-1}\). (E step)
  3. Compute the \(\mathbf{E}_i = \mathsf{\Sigma}_\mathbf{z}\cdot\mathsf{W}^\mathsf{T}\cdot\mathsf{\Psi}^{-1}\cdot(\mathbf{y}_i - \bar{\mathbf{y}})\) for all \(i\). (E step)
  4. Compute \(\mathsf{E}_{ii} = \mathsf{\Sigma}_\mathbf{z} + \mathbf{E}_i\cdot \mathbf{E}_i^\mathsf{T}\) for all \(i\). (E step)
  5. Compute \(\displaystyle{\mathsf{W}^* = \left(\sum_{i=1}^N(\mathbf{y}_i - \bar{\mathbf{y}})\cdot \mathbf{E}_i^\mathsf{T}\right)\left(\sum_{i=1}^N \mathsf{E}_{ii}\right)^{-1}}\). (M step)
  6. Compute \(\displaystyle{(\boldsymbol{\sigma}^*)^2 = \hat{\boldsymbol{\sigma}}^2 - \mathsf{W}^*\cdot\left(\frac{1}{N}\sum_{i=1}^N \mathbf{E}_i\cdot(\mathbf{y}_i - \bar{\boldsymbol{y}})\right)}\). (M step)
  7. If the difference between the log posterior (which in our case is equal to the log likelihood) evaluated at \(\boldsymbol{\mu}_\mathrm{MAP}\), \(\mathsf{W}^*\) and \(\boldsymbol{\sigma}^*\) and the log posterior evaluated at \(\boldsymbol{\mu}_\mathrm{MAP}\), \(\mathsf{W}\) and \(\boldsymbol{\sigma}\) is smaller than a predefined threshold, STOP and record \(\boldsymbol{\mu}_\mathrm{MAP}\), \(\mathsf{W}^*\) and \(\boldsymbol{\sigma}^*\) as the MAP estimate. Otherwise, go to 6.
  8. Set \(\mathsf{W} = \mathsf{W}^*\) and \(\boldsymbol{\sigma} = \boldsymbol{\sigma}^*\).
  9. Go to 2.

50.2.1 Factor analysis with scikit-learn

We will not code up the EM algorithm here, as we did for a Gaussian mixture models (see ?sec-gmm-em-example), as it is not particularly instructive. Instead, we will utilize the implementation in scikit-learn. Note, though, that if we did have priors on the parameters, we would need to update the above algorithm and implement it ourselves, since scikit-learn’s implementation does not have priors.

We will use the same data set from Remedios, et al. that we did in Section 49.2. We first load in the data set.

# Load in the data set and separate into Numpy arrays
data = scipy.io.loadmat(os.path.join(data_path, 'hypothalamus_calcium_imaging_remedios_et_al.mat'))
neural_data = data['neural_data']
attack_vector = data['attack_vector'].flatten()
sex_vector = data['sex_vector'].flatten()

# Time points at 30 Hz sampling
t = np.arange(len(attack_vector)) / 30

Scikit-learn’s implementation of factor analysis assumes centered data. This is because we know the MAP value of \(\boldsymbol{\mu}\) analytically. So, we can go ahead and compute that directly.

# MAP estimate for parameter mu
mu = np.mean(neural_data, axis=1)

We should therefore center the data. We will not scale the data set, since the factor analysis involves estimation of the variance via the parameter \(\boldsymbol{\sigma}\) along each dimension.

# Centered data set by subtracting the mean
scaler = sklearn.preprocessing.StandardScaler(with_std=False).fit(neural_data.T)
y = scaler.transform(neural_data.T)

Next, we can instantiate a FactorAnalysis instance, which requires specification of the number of components. We will again use two components. We will go ahead and run the fit() method to perform the factor analysis.

with warnings.catch_warnings():
    warnings.simplefilter("ignore")

    fa = sklearn.decomposition.FactorAnalysis(n_components=2).fit(y)

We can now explore the results of the factor analysis. The MAP estimate for the location parameter (up to an additive constant) of the recognition distribution, \(\boldsymbol{\mu}_{\mathbf{z}\mid\mathbf{y}_i}\), is computed using the transform() method.

mu_z = fa.transform(y)

The parameter \(\boldsymbol{\sigma}^2\) is given by the noise_variance_ parameter.

sigma = np.sqrt(fa.noise_variance_)

The loading matrix is given by the components_ attribute.

# D x L loading matrix is the transpose of the components_ attribute
W = fa.components_.T

Scikit-learn does not compute the covariance matrix of the recognition distribution, \(\mathsf{\Sigma}_\mathbf{z}\). As shown in Section 50.1.3, \(\mathsf{\Sigma}_\mathbf{z}\) is the same for all data points, given by Equation 50.2. We can therefore compute it.

Sigma = np.linalg.inv(np.eye(2) + W.T @ np.diag(1 / fa.noise_variance_) @ W)

# Take a look
Sigma
/var/folders/8h/qwnxpqcx6vldhxr71n1582d00000gn/T/ipykernel_33816/4192835866.py:1: RuntimeWarning: divide by zero encountered in matmul
  Sigma = np.linalg.inv(np.eye(2) + W.T @ np.diag(1 / fa.noise_variance_) @ W)
/var/folders/8h/qwnxpqcx6vldhxr71n1582d00000gn/T/ipykernel_33816/4192835866.py:1: RuntimeWarning: overflow encountered in matmul
  Sigma = np.linalg.inv(np.eye(2) + W.T @ np.diag(1 / fa.noise_variance_) @ W)
/var/folders/8h/qwnxpqcx6vldhxr71n1582d00000gn/T/ipykernel_33816/4192835866.py:1: RuntimeWarning: invalid value encountered in matmul
  Sigma = np.linalg.inv(np.eye(2) + W.T @ np.diag(1 / fa.noise_variance_) @ W)
array([[1.41289949e-02, 1.09676565e-15],
       [1.09645742e-15, 2.29261582e-02]])

Evidently, \(\mathsf{\Sigma}_\mathbf{z}\) is approximately diagonal. This is not the case in general for factor analysis, but is in this case; the components of the latent variables are essentially independent of each other.

50.2.2 Visualizing FA results

In a factor analysis, instead of having a point estimate for the latent variables \(\mathbf{z}\) as in PCA, we have a posterior distribution for them. We have already computed its parameters \(\boldsymbol{\mu}_{\mathbf{z}\mid\mathbf{y}}\) and \(\mathsf{\Sigma}_\mathbf{z}\). If we want to make a plot of one latent variable plotted against another, we can do so using ellipse glyphs that encompass 95% of the posterior probability mass. To plot an ellipse shape, Bokeh requires a major axis length, minor axis length, and angle from horizontal of the ellipse. Let \(\lambda_1\) and \(\lambda_2\) be the eigenvalues of \(\mathsf{\Sigma}_\mathbf{z}\) with \(\lambda_1 \le \lambda_2\) with corresponding eigenvectors \(\mathbf{v}_1\) and \(\mathbf{v}_2\). Then, for a 2D Normal distribution, the ellipse that encompasses a fraction \(p\) of the total posterior probability mass has:

  • Major axis length: \(2\sqrt{\lambda_2\,\chi_2^2(p)}\)
  • Minor axis length: \(2\sqrt{\lambda_1\,\chi_2^2(p)}\)
  • Angle from horizontal: \(\arctan(v_{2,y} / v_{2,x})\)

Here, \(\chi_2^2(p)\) represents the percentile of a \(\chi\)-squared distribution with two degrees of freedom. We can calculate these quantities from \(\mathsf{\Sigma}_{\mathbf{z}}\).

chi = st.chi2.ppf(0.95, df=2)

# Eigensystem (ordered using eigh)
lam, v = np.linalg.eigh(Sigma)

# Major and minor axes
b = 2 * np.sqrt(lam[1] * chi)
a = 2 * np.sqrt(lam[0] * chi)

# Angle from vertical
angle = np.atan2(v[1, 1], v[1, 0])

Now let’s make the plot!

p = bokeh.plotting.figure(
    frame_height=350,
    frame_width=350,
    x_axis_label="z₁",
    y_axis_label="z₂",
)

# Set up date for the plot
time_color = bebi103.viz.q_to_color(t, bokeh.palettes.Viridis256)
sex_color = ["orchid" if s else "dodgerblue" for s in sex_vector]
cds = bokeh.models.ColumnDataSource(dict(z1=mu_z[:, 0], z2=mu_z[:, 1], t=t, color=time_color))

# We'll allow selection of which color we want to visualize
color_selector = bokeh.models.RadioButtonGroup(labels=["time", "sex"], active=0)
js_code = """
cds.data['color'] = color_selector.active == 0 ? time_color : sex_color;
cds.change.emit();
"""

color_selector.js_on_event(
    "button_click",
    bokeh.models.CustomJS(
        code=js_code,
        args=dict(
            time_color=time_color,
            sex_color=sex_color,
            cds=cds,
            color_selector=color_selector,
        ),
    ),
)

# Populate glyphs
p.ellipse(source=cds, x="z1", y="z2", width=b, height=a, angle=angle, alpha=0.2, color="color")

# Lay out and show!
bokeh.io.show(
    bokeh.layouts.row(p, bokeh.layouts.column(bokeh.models.Div(text="Color by"), color_selector)),
)

This is a very similar result to what we got when doing PCA, but we can see a measure if the posterior variation in the inferred latent variables.

Reconstruction of the data based on the reduced representation is not as straightfoward, as we would have the sample out of the generative distribution parametrized but the posterior distribution. We could take \(\boldsymbol{\mu}_{\mathbf{z}\mid \mathbf{y}}\) as a point estimate for the latent parameter \(\mathbf{z}\).

# Compute the reconstruction
y_hat = np.dot(W, mu_z.T).T

# Uncenter the reconstruction
y_hat = scaler.inverse_transform(y_hat)

# Kwargs for plotting
kwargs = dict(
    x_interpixel_distance=1/30,
    y_interpixel_distance=1,
    frame_height=150,
    frame_width=600,
    x_axis_label="time (s)",
    y_axis_label="neuron",
)

# Plot original data
p_neuron = bebi103.image.imshow(neural_data, **kwargs)

# Plot reconstructed data
p_hat = bebi103.image.imshow(y_hat.T, **kwargs)

p_neuron.yaxis.major_label_text_font_size = '0pt'
p_hat.yaxis.major_label_text_font_size = '0pt'

bokeh.io.show(bokeh.layouts.column(p_neuron, p_hat))

Computing environment

Code
%load_ext watermark
%watermark -v -p numpy,scipy,sklearn,bebi103,bokeh,jupyterlab
Python implementation: CPython
Python version       : 3.12.9
IPython version      : 9.1.0

numpy     : 2.1.3
scipy     : 1.15.2
sklearn   : 1.6.1
bebi103   : 0.1.27
bokeh     : 3.6.2
jupyterlab: 4.4.3