49 Principal component analysis: A heuristic approach

Code

# Colab setup ------------------
import os, shutil, sys, subprocess, urllib.request
if "google.colab" in sys.modules:
    cmd = "pip install --upgrade watermark"
    process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    stdout, stderr = process.communicate()
    data_path = "https://s3.amazonaws.com/bebi103.caltech.edu/data/"
else:
    data_path = "../data/"
# ------------------------------

import warnings

import numpy as np
import scipy.io

import sklearn
import sklearn.decomposition
import sklearn.preprocessing

import bebi103

import bokeh.io
import bokeh.plotting
bokeh.io.output_notebook()

Loading BokehJS ...

49.1 Principal component analysis: A heuristic approach

In our discussion of principal component analysis, we will first approach the problem heuristically without defining a full generative model. We will then demonstrate a Bayesian model behind this heuristic.

49.1.1 The problem of PCA

We often obtain high-dimensional data sets, in which each observation has many dimensions. In the parlance of practitioners of machine learning, each multidimensional observation is called a sample and each dimension is called a feature. We will use the observation/dimension and sample/feature nomenclature interchangeably. Considering the Palmer’s penguin data set, each observation (sample) is a penguin, and the dimensions (features) are bill depth, bill length, flipper length, and body mass.

The features, though many, may be related. We can imagine that there are unobservable variables, called latent variables at play the influence many features. For example, consider the four features of the penguin. They could all scale with the caloric intake of the penguins, in which case the latent variable of caloric intake could use used to predict all four of the features; it is responsible for most of the variation in what is observed. (We have seen latent variables before when we \(z_{ik}\) in Gaussian mixture models we encountered in Chapter 41.)

Let us now formalize this. Let \(\mathbf{y}\in \mathbb{R}^D\) be a set of observable features. There are \(D\) of them. In the case of the penguins, \(D = 4\). Let \(\mathbf{z} \in \mathbb{R}^L\) be the set of latent variables that influence the observations. Typically, \(L < D\) and often \(L \ll D\), and, as such, if we can infer the latent variable \(\mathbf{z}\), we say we have achieved dimensionality reduction. Specifically, we define a model for \(\mathbf{y}\), \(\mathbf{y} = h(\mathbf{z};\theta)\), where \(\theta\) is some set of parameters. So, to achieve dimensionality reduction, we want to find a function \(h(\mathbf{z};\theta)\) that most faithfully reproduces the observed \(\mathbf{y}\).

49.1.2 Linear \(h(\mathbf{z};\theta)\)

As is commonly done, we will restrict the functions \(h(\mathbf{z};\theta)\) we choose to be linear functions. Further, we stipulate that \(h(\mathbf{z};\theta)\) serves to orthogonally project \(\mathbf{z}\) onto \(\mathbf{y}\).

\[\begin{align} h(\mathbf{z};\mathsf{W}) = \mathsf{W}\cdot \mathbf{z}, \end{align} \]

where the parameters \(\theta\) are given by the \(D \times L\) matrix \(\mathsf{W}\), commonly referred to as a loading matrix. This matrix is semiorthogonal amd unitary such that the columns are orthonormal vectors and \(\mathsf{W}^\mathsf{T}\cdot \mathsf{W} = \mathsf{I}\), the identity matrix. If \(\mathbf{w}_j\) is a column of \(\mathsf{W}\), then

\[\begin{align} \mathbf{w}_j^\mathsf{T}\cdot \mathbf{w}_k = \delta_{jk}, \end{align} \]

where \(\delta_{jk}\) is the Kronecker delta, which is equal to one when \(j = k\) and zero otherwise.

Now, let \(\hat{\mathbf{y}}\in \mathbb{R}^D\) be the a data point we get by applying \(h(\mathbf{z};\mathsf{W})\),

\[\begin{align} \hat{\mathbf{y}} = h(\mathbf{z}; \mathsf{W}) = \mathsf{W} \cdot \mathbf{z}. \end{align} \tag{49.1}\]

We seek a loading matrix \(\mathsf{W}\) and set of latent variables \(\{\mathbf{z}\}\) that most faithfully reproduce the measured data \(\{\mathbf{y}\}\), where I am using the braces to denote a set of data; that is \(\{\mathbf{y}\} = \{\mathbf{y}_1, \mathbf{y}_2, \ldots \mathbf{y}_N\}\), where \(N\) is the number of observations we have made. That is, we want \(\mathbf{y}_i\) to be as close as possible to \(\hat{\mathbf{y}}_i\) for all \(i\). Toward this end, we define a loss function as the mean square distance between \(\mathbf{y}_i\) and \(\hat{\mathbf{y}}_i\).

\[\begin{align} \text{loss}(\mathsf{W}, \{\mathbf{z}\}) = \frac{1}{N}\sum_{i=1}^N \lVert \mathbf{y}_i - \hat{\mathbf{y}}_i\rVert_2^2 = \frac{1}{N}\sum_{i=1}^N \lVert \mathbf{y}_i - \mathsf{W}\cdot \mathbf{z}_i\rVert_2^2, \end{align} \tag{49.2}\]

The goal of principal component analysis is to find the \(\mathsf{W}\) that minimizes this loss function. Once this optimal \(\mathsf{W}\) is found, we compute \(\hat{\mathbf{z}} = \mathsf{W}^\mathsf{T}\cdot \mathbf{y}\); the entries of \(\hat{\mathbf{z}}\) are called the principal components.

49.1.3 The optimal projection

In this section, we work out what loading matrix \(\mathsf{W}\) optimizes the above loss function. Be careful not to get lost in the weeds of the mathematics that follows. Just remember that we are simply finding the projection that minimizes the mean squared difference between measured and projected data.

As we work to find the optimal projection, it will help us to write the loss function in terms of the columns of \(\mathsf{W}\).

\[\begin{align} \text{loss}(\mathsf{W}, \{\mathbf{z}\}) &= \frac{1}{N}\sum_{i=1}^N \left\lVert \mathbf{y}_i - \sum_{j=1}^L z_{ij}\mathbf{w}_j\right\rVert_2^2 \\[1em] &= \frac{1}{N}\sum_{i=1}^N\left(\mathbf{y}_i - \sum_{j=1}^L z_{ij}\mathbf{w}_j\right)^\mathsf{T}\cdot\left(\mathbf{y}_i - \sum_{j=1}^L z_{ij}\mathbf{w}_j\right) \\[1em] &= \frac{1}{N}\sum_{i=1}^N\left(\mathbf{y}_i^\mathsf{T}\cdot\mathbf{y}_i - 2\sum_{j=1}^L z_{ij}\mathbf{y}_i^\mathsf{T}\cdot\mathbf{w}_j + \sum_{j=1}^L\sum_{k=1}^L z_{ij}z_{ik}\mathbf{w}_j^\mathsf{T}\cdot\mathbf{w}_k\right). \end{align}\]

We wish to find the \(\{\mathbf{w}_j\}\) and \(\{\mathbf{z}_j\}\) that minimize the loss function. To do so while enforcing the orthonormality of \(\mathsf{W}\), we define a Lagrangian

\[\begin{align} \mathcal{L} &= \text{loss}(\mathsf{W}, \{\mathbf{z}\}) + \sum_{j=1}^L \lambda_j(1 - \mathbf{w}_j^\mathsf{T}\cdot \mathbf{w}_j) + \sum_{j=1}^L\sum_{k=j+1}^L \mu_{jk}\mathbf{w}_j^\mathsf{T}\cdot\mathbf{w}_k \\[1em] &= \frac{1}{N}\sum_{i=1}^N\left(\mathbf{y}_i^\mathsf{T}\cdot\mathbf{y}_i - 2\sum_{j=1}^L z_{ij}\mathbf{y}_i^\mathsf{T}\cdot\mathbf{w}_j + \sum_{j=1}^L\sum_{k=1}^L z_{ij}z_{ik}\mathbf{w}_j^\mathsf{T}\cdot\mathbf{w}_k\right) \\[1em] &\;\;\;\;+ \sum_{j=1}^L \lambda_j(\mathbf{w}_j^\mathsf{T}\cdot \mathbf{w}_j - 1) + \sum_{j=1}^L\sum_{k=j+1}^L \mu_{jk}\mathbf{w}_j^\mathsf{T}\cdot\mathbf{w}_k. \end{align}\]

where the Lagrange multipliers denoted with \(\lambda\) enforce the normality of the columns of \(\mathsf{W}\) and those denoted with \(\mu\) enforce the orthogonality of the columns of \(\mathsf{W}\).

Because the loss function and the normality constraints are all strictly convex, the necessary and sufficient conditions for minimizing the loss function are the Karush-Kuhn-Tucker (KKT) conditions,

\[\begin{align} &\frac{\partial \mathcal{L}}{\partial \mathbf{w}_j} = \mathbf{0}, \\[1em] &\frac{\partial \mathcal{L}}{\partial z_{ij}} = 0, \\[1em] &\frac{\partial \mathcal{L}}{\partial\lambda_j} = 0, \\[1em] &\frac{\partial \mathcal{L}}{\partial\mu_{jk}} = 0. \end{align}\]

for all \(i\in[1,N]\), \(j\in[1,L]\), and \(k \in [j+1,L]\). To avoid stumbling over indices that are present in sums, I prefer to differentiate with respect to a variable with a specific index, e.g., with respect to \(z_{lm}\). With that in mind, let us compute the derivatives of the KKT conditions one-by-one, in reverse order as they are listed above.

\[\begin{align} \frac{\partial \mathcal{L}}{\partial\mu_{lm}} = \mu_{lm}\mathbf{w}_l^\mathsf{T}\cdot\mathbf{w}_m = 0, \end{align}\]

noting that \(m > l\). Not surprisingly, this means that \(\mathbf{w}_j^\mathsf{T}\cdot\mathbf{w}_k = 0\) for all \(j\ne k\); the columns of \(\mathsf{W}\) are orthogonal.

Turning now to the derivative with respect to the \(\lambda\)’s,

\[\begin{align} \frac{\partial \mathcal{L}}{\partial\lambda_l} = \mathbf{w}_l^\mathsf{T}\cdot \mathbf{w}_l - 1 = 0, \end{align}\]

which means that \(\mathbf{w}_j^\mathsf{T}\cdot\mathbf{w}_j = 1\). With the previous KKT condition, this means that the columns of \(\mathsf{W}\) are orthonormal, or

\[\begin{align} \mathbf{w}_j^\mathsf{T}\cdot \mathbf{w}_k = \delta_{jk}. \end{align}\]

Now considering the derivative with respect to \(z_{lm}\).

\[\begin{align} \frac{\partial \mathcal{L}}{\partial z_{lm}} &= \frac{1}{N}\left(-2\mathbf{y}_l^\mathsf{T}\cdot\mathbf{w}_m + 2\sum_{k=1}^Lz_{lm}\mathbf{w}_m^T\cdot \mathbf{w}_k\right)\\[1em] &= -\frac{2}{N}\left(\mathbf{y}_l^\mathsf{T}\cdot\mathbf{w}_m - z_{lm}\right) = 0, \end{align}\]

where we have used \(\mathbf{w}_m^\mathsf{T}\cdot \mathbf{w}_k = \delta_{mk}\), which we worked out from the first two KKT conditions we considered. Thus, we have,

\[\begin{align} z_{ij} = \mathbf{y}_i^\mathsf{T}\cdot\mathbf{w}_j = \mathbf{w}_j^\mathsf{T}\cdot\mathbf{y}_i. \end{align}\]

We have satisfied three of the four KKT conditions. We can substitute these results into the loss function because doing so does not change the feasible set of minimizers. The updated Lagrangian is then

\[\begin{align} \text{loss}(\mathsf{W}) = &= \frac{1}{N}\sum_{i=1}^N\left(\mathbf{y}_i^\mathsf{T}\cdot\mathbf{y}_i - 2\sum_{j=1}^L (\mathbf{y}_i^\mathsf{T}\cdot\mathbf{w}_j)\mathbf{y}_i^\mathsf{T}\cdot\mathbf{w}_j + \sum_{j=1}^L(\mathbf{y}_i^\mathsf{T}\cdot\mathbf{w}_j)^2\right) \\[1em] &= \frac{1}{N}\sum_{i=1}^N\left(\mathbf{y}_i^\mathsf{T}\cdot\mathbf{y}_i - \sum_{j=1}^L(\mathbf{y}_i^\mathsf{T}\cdot\mathbf{w}_j)^2\right) \\[1em] &= \frac{1}{N}\sum_{i=1}^N \mathbf{y}_i^\mathsf{T}\cdot\mathbf{y}_i - \sum_{j=1}^L \left[\frac{1}{N}\sum_{i=1}^N(\mathbf{y}_i^\mathsf{T}\cdot\mathbf{w}_j)^2\right] \\[1em] &= \frac{1}{N}\sum_{i=1}^N \mathbf{y}_i^\mathsf{T}\cdot\mathbf{y}_i - \sum_{j=1}^L \mathbf{w}_j^\mathsf{T}\cdot\hat{\mathsf{\Sigma}}\cdot\mathbf{w}_j, \end{align}\]

where we have recognized the bracketed term as a quadratic form with the \(D\times D\) matrix \(\hat{\mathsf{\Sigma}}\) having entries given by

\[\begin{align} \hat{\Sigma}_{jk} = \frac{1}{N}\sum_{i=1}^N y_{ij}\, y_{ik}. \end{align}\]

We will clarify the suggestive symbol we chose for this matrix momentarily. The updated Lagrangian is then

\[\begin{align} \mathcal{L} &= \frac{1}{N}\sum_{i=1}^N \mathbf{y}_i^\mathsf{T}\cdot\mathbf{y}_i - \sum_{j=1}^L \mathbf{w}_j^\mathsf{T}\cdot\hat{\mathsf{\Sigma}}\cdot\mathbf{w}_j \\[1em] &\;\;\;\;+ \sum_{j=1}^L \lambda_j(\mathbf{w}_j^\mathsf{T}\cdot \mathbf{w}_j - 1) - \sum_{j=1}^L\sum_{k=j+1}^L \mu_{jk}\mathbf{w}_j^\mathsf{T}\cdot\mathbf{w}_k. \end{align}\]

Now, differentiating with respect to \(\mathbf{w}_l\) and setting the result to zero to satisfy our final KKT condition, we have

\[\begin{align} \frac{\partial \mathcal{L}}{\partial \mathbf{w}_l} = -2\hat{\mathsf{\Sigma}}\cdot\mathbf{w}_l + 2\lambda_l\,\mathbf{w}_l + \sum_{k=1}^{l-1} \mu_{kl}\mathbf{w}_k + \sum_{k=l+1}^{L} \mu_{lk}\mathbf{w}_k = \mathbf{0}. \end{align}\]

Dotting both sides of this equation with \(\mathbf{w}_l^\mathsf{T}\) yields

\[\begin{align} -2\mathbf{w}_l^\mathsf{T}\cdot\hat{\mathsf{\Sigma}}\cdot\mathbf{w}_l + 2\lambda_j\,\mathbf{w}_l^\mathsf{T}\cdot\mathbf{w}_j + \sum_{k=1}^{l-1} \mu_{kj}\mathbf{w}_l^\mathsf{T}\cdot\mathbf{w}_k + \sum_{k=l+1}^{L} \mu_{lk}\mathbf{w}_l^\mathsf{T}\cdot\mathbf{w}_k = 0. \end{align}\]

Using

\[\begin{align} \mathbf{w}_l^\mathsf{T}\cdot \mathbf{w}_k = 0, \end{align}\]

for \(l \ne k\), this reduces to

\[\begin{align} -2\mathbf{w}_l^\mathsf{T}\cdot\hat{\mathsf{\Sigma}}\cdot\mathbf{w}_l + 2\lambda_l\mathbf{w}_l^\mathsf{T}\cdot\mathbf{w}_l = 0, \end{align}\]

which we can rewrite as

\[\begin{align} \frac{\mathbf{w}_j^\mathsf{T}\cdot\hat{\mathsf{\Sigma}}\cdot\mathbf{w}_j}{\mathbf{w}_j^\mathsf{T}\cdot\mathbf{w}_j} = \lambda_j. \end{align}\]

We recognize this equation as an expression for a Rayleigh quotient. Evidently, then, the Lagrange multipliers \(\lambda_j\) are the eigenvalues of \(\hat{\mathsf{\Sigma}}\) and \(\mathbf{w}_j\) are the normalized eigenvectors. Inserting this expression for \(\mathbf{w}_j\) into the loss function yields

\[\begin{align} \text{loss}(\mathsf{W}) = \frac{1}{N}\sum_{i=1}^N \mathbf{y}_i^\mathsf{T}\cdot\mathbf{y}_i - \sum_{j=1}^L \mathbf{w}_j^\mathsf{T}\cdot(\lambda_j\mathbf{w}_j) = \frac{1}{N}\sum_{i=1}^N \mathbf{y}_i^\mathsf{T}\cdot\mathbf{y}_i - \sum_{j=1}^L \lambda_j. \end{align}\]

The loss function is minimized if we choose the \(L\) largest eigenvalues of \(\hat{\mathsf{\Sigma}}\). Therefore, the columns of \(\mathsf{W}\) are comprised of the eigenvectors of \(\hat{\mathsf{\Sigma}}\) corresponding to its largest eigenvalues.

49.1.4 Centering and scaling, the interpretation of \(\hat{\mathsf{\Sigma}}\), and maximal variance

Assume for a moment that the mean of \(\mathbf{y}\) is zero,

\[\begin{align} \bar{\mathbf{y}} = \frac{1}{N}\sum_{i=1}^N \mathbf{y}_i = \mathbf{0}, \end{align} \]

and the variance of each of the \(D\) entries in \(\mathbf{y}\) is one,

\[\begin{align} s_j = \frac{1}{N}\sum_{i=1}^N (y_{i,j} - \bar{y}_j)^2 = \frac{1}{N}\sum_{i=1}^N y_{i,j}^2 = 1\;\forall j\in[1, D]. \end{align} \]

In that case, the entries of \(\hat{\mathsf{\Sigma}}\) are

\[\begin{align} \hat{\Sigma}_{jk} = \frac{1}{N}\sum_{i=1}^N y_{ij}\, y_{ik} = \frac{1}{N}\sum_{i=1}^N (y_{ij} - \bar{y_j})(y_{ik} - \bar{y_k}), \end{align} \tag{49.3}\]

which we recognize as the elements of the empirical covariance matrix (or, equivalently, since the diagonal entries are all one, an empirical correlation matrix). Thus, \(\hat{\mathsf{\Sigma}}\) is the plug-in estimate for the covariance matrix when using centered-and-scaled data.

Consider a principal component \(z_{ij}\). The plug-in estimate for the variance of this is

\[\begin{align} \widehat{\mathbb{V}(z_{ij})} = \frac{1}{N}\sum_{i=1}^N z_{ij}^2 - \left(\frac{1}{N}\sum_{i=1}^Nz_{ij}\right)^2. \end{align} \]

Let us compute the first moment.

\[\begin{align} \frac{1}{N}\sum_{i=1}^Nz_{ij} = \frac{1}{N}\sum_{i=1}^N \mathbf{w}_{j}^\mathsf{T} \cdot \mathbf{y}_i = \mathbf{w}_{j}^\mathsf{T}\cdot\left(\frac{1}{N}\sum_{i=1}^N \mathbf{y}_i\right) = 0, \end{align}\]

where we have recognized the average in parentheses to be zero since the data are centered. Therefore, the plug-in estimate for the variance is

\[\begin{align} \widehat{\mathbb{V}(z_{ij})} = \frac{1}{N}\sum_{i=1}^N z_{ij}^2 = \frac{1}{N}\sum_{i=1}^N \frac{1}{N}\sum_{i=1}^N(\mathbf{y}_i^\mathsf{T}\cdot\mathbf{w}_j)^2. \end{align}\]

We have seen this expression before. This is a quaratic form involving the covariance matrix \(\hat{\mathsf{\Sigma}}\).

\[\begin{align} \widehat{\mathbb{V}(z_{ij})} = \mathbf{w}_j^\mathsf{T}\cdot\hat{\mathsf{\Sigma}}\cdot\mathbf{w}_j. \end{align}\]

Recall that the loss function is

\[\begin{align} \text{loss}(\mathsf{W}) = \frac{1}{N}\sum_{i=1}^N \mathbf{y}_i^\mathsf{T}\cdot\mathbf{y}_i - \sum_{j=1}^L \mathbf{w}_j^\mathsf{T}\cdot\hat{\mathsf{\Sigma}}\cdot\mathbf{w}_j = \text{constant} - \widehat{\mathbb{V}(z_{ij})}. \end{align}\]

Therefore, finding the transformation \(\mathbf{w}_j\) that minimizes the loss function is the same as finding the transformation that maximizes the variance of the projection. This is why the principal components are often referred to as the directions that have maximal variance.

As a final note, I mention that it is important to center and scale the data set before performing PCA because given directions may have high variance just because of the scaling of the measurement. This is generally the variance we do not wish to interpret.

49.1.5 Prescription for PCA

Following the above analysis, we arrive at a prescription for PCA. After choosing the number of principal components \(L\), do the following.

Center and scale the data set by subtracting the mean and dividing by the standard deviation along each dimension.
Compute the empirical covariance matrix.
Compute the first \(L\) eigenvalues and eigenvectors.
Construct the loading matrix \(\mathsf{W}\) from the eigenvectors corresponding to the \(L\) largest eigenvalues.
The principal components are \(\mathbf{z}_i = \mathbf{W}^\mathsf{T} \cdot \mathbf{y}_i \;\forall i\).

49.1.6 Nonidentifiability of PCA

The loading matrix \(\mathsf{W}\) is comprised of the eigenvectors of the empirical covariance matrix and therefore the columns are orthogonal, since the covariance matrix is symmetric. However, any of these eigenvectors can be multiplied by a scalar (which is true in general for an eigenvector; if \(\mathbf{v}\) is an eigenvector of \(\mathsf{A}\), then so is \(\alpha \mathbf{v}\) for real \(\alpha \ne 0\).) We can insist that the columns of \(\mathsf{W}\) for an orthonormal basis, such that the magnitude of each eigenvector is 1, which is what is commonly done. However, any given column of \(\mathsf{W}\) may be multiplied by \(-1\), still resulting in a nonidentifiability. We therefore cannot uniquely find a loading matrix \(\mathsf{W}\) and therefore a unique set of principal components.

49.2 PCA of a data set

We will perform PCA on data from Remedios, et al. that we visited in Chapter 8. Recall that in that data set, we had recording from 115 neurons over 10 or so minutes. Let’s load in the data set and convert everything to Numpy arrays as we did in Chapter 8.

# Load in the data set and separate into Numpy arrays
data = scipy.io.loadmat(os.path.join(data_path, 'hypothalamus_calcium_imaging_remedios_et_al.mat'))
neural_data = data['neural_data']     # Row is time, column is neuron
attack_vector = data['attack_vector'].flatten()
sex_vector = data['sex_vector'].flatten()

We also have time points, which we can set up knowing that the sampling rate is 30 Hz.

# Time points at 30 Hz sampling
t = np.arange(len(attack_vector)) / 30

Finally, we can center and scale the data. We will also transpose the neural data so that it has dimension \(N\times D\) for ease in comparing to the above theoretical results.

# Transpose
y = neural_data.T

# Center data set by subtracting the mean from each time point and dividing by the standard deviation
y = (y - np.mean(y, axis=0)) / np.std(y, axis=0)

We can now carry out the prescription of PCA. We are hoping to visualize the trajectory of the neuronal activity of the mouse on a 2D plot, so we will specify that there are \(L = 2\) latent variables (principal componenents).

# Two principle components
L = 2

# Compute the plugin estimate for the covariance matrix
cov = np.cov(y.T)

# Compute the eigensystem
eigenvalues, eigenvectors = np.linalg.eig(cov)

# Sort highest to lowest
idx = eigenvalues.argsort()[::-1]

# Extract largest L to build eigenvalues and loading matrix
lam = eigenvalues[idx][:L]
W = eigenvectors[:, idx][:, :L]

# Compute principal components
z = np.dot(W.T, y.T)

We will plot the visualization of the principal components in a moment, but for now, we will take the opportunity to compute the fraction of the variance that each principal component contributes. Given that the eigenvalues are the eigenvalues of the covariance matrix, each eigenvalue gives the scale of the variance along the given eigenvector. Thus, the fraction of the total variance that runs along principal component \(j\) is \(\lambda_j / \sum_{j=1}^L \lambda_j\).

lam / eigenvalues.sum()

array([0.22153642, 0.15257567])

Evidently, the first two principal components account for nearly 40% of the total variance.

Now, we will proceed to plot the principal components. We will plot them against each other and color the glyphs by time or by sex of the presented mouse. We will also plot each of the two principal components vs time.

p1 = bokeh.plotting.figure(
    frame_height=350,
    frame_width=350,
    x_axis_label="PC 1",
    y_axis_label="PC 2",
)

p2 = bokeh.plotting.figure(
    frame_height=150,
    frame_width=550,
    x_axis_label="time",
    y_axis_label="PC",
)

# Set up date for the plot
time_color = bebi103.viz.q_to_color(t, bokeh.palettes.Viridis256)
sex_color = ["orchid" if s else "dodgerblue" for s in sex_vector]
cds = bokeh.models.ColumnDataSource(dict(pc1=z[0], pc2=z[1], t=t, color=time_color))

# We'll allow selection of which color we want to visualize
color_selector = bokeh.models.RadioButtonGroup(labels=["time", "sex"], active=0)
js_code = """
cds.data['color'] = color_selector.active == 0 ? time_color : sex_color;
cds.change.emit();
"""

color_selector.js_on_event(
    "button_click",
    bokeh.models.CustomJS(
        code=js_code,
        args=dict(
            time_color=time_color,
            sex_color=sex_color,
            cds=cds,
            color_selector=color_selector,
        ),
    ),
)

# Populate glyphs
p1.scatter(source=cds, x="pc1", y="pc2", color="color")
p2.line(source=cds, x='t', y='pc1', legend_label='PC 1')
p2.line(source=cds, x='t', y='pc2', color="orange", legend_label='PC 2')

# Lay out and show!
bokeh.io.show(
    bokeh.layouts.column(
        bokeh.layouts.row(p1, bokeh.layouts.column(bokeh.models.Div(text="Color by"), color_selector)),
        p2
    )
)

When looking at the coloring by sex, it is clear that there is a difference along the PC 1-axis depending on what sex of mouse is presented. When the male mouse is introduced about 250 seconds into the experiment, PC 1 shifts to exceed PC 2.

49.2.1 Reconstruction of neuronal measurements from principal components

By restricting ourselves to only two latent variables, we necessarily have omitted some of the dynamics observed in the system. To assess what we have lost, we can reconstruct the neuronal signal from the estimates of the latent variables and the corresponding loading matrix \(\mathsf{W}\). Recall from Equation 49.1 that the reconstructed data set is

\[\begin{align} \hat{\mathbf{y}}_i = h(\mathbf{z}_i; \mathsf{W}) = \mathsf{W} \cdot \mathbf{z}_i\;\forall i. \end{align} \]

We can directly compute this and compare to the measured signal.

# Compute the reconstruction
y_hat = np.dot(W, z)

# Uncenter and unscale the reconstruction
y_hat = np.std(neural_data, axis=0) * y_hat + np.mean(neural_data, axis=0)

# Kwargs for plotting
kwargs = dict(
    x_interpixel_distance=1/30,
    y_interpixel_distance=1,
    frame_height=150,
    frame_width=600,
    x_axis_label="time (s)",
    y_axis_label="neuron",
)

# Plot original data
p_neuron = bebi103.image.imshow(neural_data, **kwargs)

# Plot reconstructed data
p_hat = bebi103.image.imshow(y_hat, **kwargs)

p_neuron.yaxis.major_label_text_font_size = '0pt'
p_hat.yaxis.major_label_text_font_size = '0pt'

bokeh.io.show(bokeh.layouts.column(p_neuron, p_hat))

Qualitatively, we see that the reconstruction recapitulates the key features of the measured data.

49.3 Principal component analysis using singular value decomposition

Performing PCA by directly computing eigenvalues is instructive. It gives clear, direct interpretation of what the estimates latent variables mean. Here, we show that PCA may be equivalently computed using singular value decomposition. We will not go into the details here, but will instead state a few theorems from linear algebra and comment that using SVD can result in faster calculations that directly computing the eigenvalues. First, the theorems.

Any real \(m\times n\) matrix \(\mathsf{A}\) may be factored as \(\mathsf{A} = \mathsf{U}\cdot\mathsf{S}\cdot\mathsf{V}^\mathsf{T}\), where \(\mathsf{U}\) is a real \(m\times m\) orthogonal unitary matrix (such that the columns of \(\mathsf{U}\) for orthogonal vectors and \(\mathsf{U}^{-1} = \mathsf{U}^\mathsf{T}\)), \(\mathsf{S}\) is an \(m\times n\) diagonal matrix with nonnegative real entries, and \(\mathsf{V}\) is an \(n\times n\) real unitary orthogonal matrix. This factorization is called a singular value decomposition. The diagonal entries of \(\mathsf{S}\) are referred to as the singular values of \(\mathsf{A}\).
Given a singular value decomposition, the columns of \(\mathsf{U}\) are the eigenvectors of \(\mathsf{A}\cdot \mathsf{A}^\mathsf{T}\), the diagonal entries of \(\mathsf{S}\) are the eigenvalues of \(\mathsf{A}^\mathsf{T}\cdot \mathsf{A}\) (which are equal to the eigenvalues of \(\mathsf{A}\cdot \mathsf{A}^\mathsf{T}\)), and the columns of \(\mathsf{V}\) are the eigenvectors of \(\mathsf{A}^\mathsf{T} \cdot \mathsf{A}\).
Any \(n\times n\) matrix \(\mathsf{B}\) with \(n\) linearly independent eigenvectors may be factored as \(\mathsf{Q}\cdot\mathsf{\Lambda}\cdot\mathsf{Q}^{-1}\), where \(\mathsf{Q}\) is \(n\times n\) and its columns are the eigenvectors of \(\mathsf{B}\) and \(\mathsf{\Lambda}\) is a diagonal matrix whose diagonal entries are the eigenvalues of \(\mathsf{B}\). Such a factorization is called an eigendecomposition.
If \(\mathsf{B}\) is real symmetric positive definite, then \(\mathsf{Q}\) is a unitary matrix, and \(\mathbf{B} = \mathsf{Q}\cdot\mathsf{\Lambda}\cdot\mathsf{Q}^\mathsf{T}\).

Now, consider \(\hat{\mathsf{\Sigma}}\), the plug-in estimate for the covariance matrix. We wrote it element by element in Equation 49.3, but we can write it in a convenient matrix form by defining a \(N\times D\) matrix \(\mathsf{Y}\) where row \(i\) is \(\mathbf{y}_i\). Then, given that the data are centered, \(\hat{\mathsf{\Sigma}} = \mathsf{Y}^\mathsf{T}\cdot\mathsf{Y}/N\).

\[\begin{align} \hat{\mathsf{\Sigma}} = \frac{1}{N}\,\mathsf{Y}^\mathsf{T}\cdot\mathsf{Y} = \frac{1}{N}\mathsf{Q}\cdot \mathsf{\Lambda}\cdot\mathsf{Q}^\mathsf{T}, \end{align} \]

where we have used the fact that \(\mathsf{Q}\) is unitary since \(\mathsf{Y}^\mathsf{T}\cdot\mathsf{Y}\) is positive definite. We have already worked out that the loading matrix \(\mathsf{W}\) is given by the first \(L\) eigenvectors of \(\hat{\Sigma}\), so that \(\mathsf{Q} = \mathsf{W}\).

Now, consider the singular value decomposition of \(\mathsf{Y}\),

\[\begin{align} \mathsf{Y} = \mathsf{U}\cdot \mathsf{S}\cdot\mathsf{V}^\mathsf{T}. \end{align} \]

The matrix \(\mathsf{V}\) is comprised of the eigenvectors of \(\mathsf{Y}^\mathsf{T} \cdot \mathsf{Y}\). But this is also \(\mathsf{W}\). So, we can use SVD computing eigenvalues/vectors to find \(\mathsf{W}\).

Computing the eigenvectors of an \(D\times D\) matrix (which is what \(\hat{\mathsf{\Sigma}}\) is) is an \(\mathcal{O}(D^3)\) calculation. The calculation of \(\hat{\mathsf{\Sigma}} = \mathsf{Y}^\mathsf{T}\cdot\mathsf{Y}\) is itself an \(\mathcal{O}(ND^2)\) calculation, making the computational complexity of performing PCA by eigendecomposition as \(\mathcal{O}(ND^2)\) + \(\mathcal{D^3}\). Computing the SVD for an \(N\times D\) matrix, which we have to do in this case (the SVD of \(\mathsf{Y}\)), is an \(\mathcal{O}(ND^2) + \mathcal{O}(D^3)\) calculation, which is the same complexity of the eigendecomposition method. However, if we only need to compute the first \(L\) eigenvectors, the randomized SVD algorithm developed by Caltech’s own Joel Tropp can do the calculation in \(\mathcal{O}(NL^2) + \mathcal{O}(L^3)\) time, which is a big improvement if \(L \ll D\) as it often is, and indeed is in this case.

49.4 Principal component analysis using scikit-learn

Scikit-learn implements PCA using the aforementioned randomized SVD algorithm. A PCA instance is instantiated using sklearn.decomposition.PCA(), with \(L\), the number of components, provided.

# Instantiate a PCA object
pca = sklearn.decomposition.PCA(n_components=2)

As usual with the scikit-learn API, we call the fit() method to perform PCA. The input data is expected to be \(N\times D\), as we have defined for \(\mathsf{Y}\) above and as we have stored in the variable \(y\).

# Perform the fit
pca.fit(y)

/Users/bois/miniconda3/envs/datasai/lib/python3.12/site-packages/sklearn/decomposition/_pca.py:606: RuntimeWarning: divide by zero encountered in matmul
  C = X.T @ X
/Users/bois/miniconda3/envs/datasai/lib/python3.12/site-packages/sklearn/decomposition/_pca.py:606: RuntimeWarning: overflow encountered in matmul
  C = X.T @ X
/Users/bois/miniconda3/envs/datasai/lib/python3.12/site-packages/sklearn/decomposition/_pca.py:606: RuntimeWarning: invalid value encountered in matmul
  C = X.T @ X

PCA(n_components=2)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

We got a couple of warnings encountered during some of the matrix calculations, but the PCA fit was accomplished. We can extract the columns of the loading matrix \(\mathsf{W}\) using the components_ attribute. Here, I will show that we get the same result as we did for the eigenvalue-based calculation we did above, which should be the case potentially up to a multiplicative constant of \(-1\).

np.isclose(np.abs(pca.components_ / W.T), 1).all()

np.True_

Success!

We have already seen that we can recover the loading matrix \(\mathsf{W}\) using pca.components_. To recover other pertinent results, we can use the following attributes/methods.

Explained variance, \(\lambda_j/\sum_{k=1}^L\lambda_k\): explained_var = pca.explained_variance_ratio_.
Projection onto principal components, \(\mathbf{z}\): z = pca.transform(y).
Reconstructed data, \(\hat{\mathbf{y}}\): y_hat = pca.inverse_transform(z).

As with the analysis we did by hand above, we have to center and scale (or uncenter and unscale if we are computing reconstructed data).

49.5 Centering and scaling with scikit-learn

As a final note, I’ll mention that scikit-learn offers automated centering and scaling. A StandardScaler instance is instantiated and “fit” based on the data set.

scaler = sklearn.preprocessing.StandardScaler().fit(neural_data)

With the scaler in place, the data may be scaled and unscaled.

y = scaler.transform(neural_data)
original_data = scaler.inverse_transform(y)

We can automate the PCA process to get reconstructed data as follows. (We will silence warnings just for aesthetic purposes of these notes. In general, it is a bad idea to silence warnings in your workflow unless you really know what you are doing.)

with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    
    scaler = sklearn.preprocessing.StandardScaler().fit(neural_data)
    y = scaler.transform(neural_data)
    pca = sklearn.decomposition.PCA(n_components=2).fit(y)
    z = pca.transform(y)

    reconstructed_neural_data = scaler.inverse_transform(pca.inverse_transform(z))

49.6 Computing environment

%load_ext watermark
%watermark -v -p numpy,scipy,sklearn,bokeh,bebi103,jupyterlab

Python implementation: CPython
Python version       : 3.12.9
IPython version      : 9.1.0

numpy     : 2.1.3
scipy     : 1.15.2
sklearn   : 1.6.1
bokeh     : 3.6.2
bebi103   : 0.1.27
jupyterlab: 4.4.3