49.1 Principal component analysis: A heuristic approach
In our discussion of principal component analysis, we will first approach the problem heuristically without defining a full generative model. We will then demonstrate a Bayesian model behind this heuristic.
49.1.1 The problem of PCA
We often obtain high-dimensional data sets, in which each observation has many dimensions. In the parlance of practitioners of machine learning, each multidimensional observation is called a sample and each dimension is called a feature. We will use the observation/dimension and sample/feature nomenclature interchangeably. Considering the Palmer’s penguin data set, each observation (sample) is a penguin, and the dimensions (features) are bill depth, bill length, flipper length, and body mass.
The features, though many, may be related. We can imagine that there are unobservable variables, called latent variables at play the influence many features. For example, consider the four features of the penguin. They could all scale with the caloric intake of the penguins, in which case the latent variable of caloric intake could use used to predict all four of the features; it is responsible for most of the variation in what is observed. (We have seen latent variables before when we \(z_{ik}\) in Gaussian mixture models we encountered in Chapter 41.)
Let us now formalize this. Let \(\mathbf{y}\in \mathbb{R}^D\) be a set of observable features. There are \(D\) of them. In the case of the penguins, \(D = 4\). Let \(\mathbf{z} \in \mathbb{R}^L\) be the set of latent variables that influence the observations. Typically, \(L < D\) and often \(L \ll D\), and, as such, if we can infer the latent variable \(\mathbf{z}\), we say we have achieved dimensionality reduction. Specifically, we define a model for \(\mathbf{y}\), \(\mathbf{y} = h(\mathbf{z};\theta)\), where \(\theta\) is some set of parameters. So, to achieve dimensionality reduction, we want to find a function \(h(\mathbf{z};\theta)\) that most faithfully reproduces the observed \(\mathbf{y}\).
49.1.2 Linear \(h(\mathbf{z};\theta)\)
As is commonly done, we will restrict the functions \(h(\mathbf{z};\theta)\) we choose to be linear functions. Further, we stipulate that \(h(\mathbf{z};\theta)\) serves to orthogonally project \(\mathbf{z}\) onto \(\mathbf{y}\).
where the parameters \(\theta\) are given by the \(D \times L\) matrix \(\mathsf{W}\), commonly referred to as a loading matrix. This matrix is semiorthogonal amd unitary such that the columns are orthonormal vectors and \(\mathsf{W}^\mathsf{T}\cdot \mathsf{W} = \mathsf{I}\), the identity matrix. If \(\mathbf{w}_j\) is a column of \(\mathsf{W}\), then
We seek a loading matrix \(\mathsf{W}\) and set of latent variables \(\{\mathbf{z}\}\) that most faithfully reproduce the measured data \(\{\mathbf{y}\}\), where I am using the braces to denote a set of data; that is \(\{\mathbf{y}\} = \{\mathbf{y}_1, \mathbf{y}_2, \ldots \mathbf{y}_N\}\), where \(N\) is the number of observations we have made. That is, we want \(\mathbf{y}_i\) to be as close as possible to \(\hat{\mathbf{y}}_i\) for all \(i\). Toward this end, we define a loss function as the mean square distance between \(\mathbf{y}_i\) and \(\hat{\mathbf{y}}_i\).
The goal of principal component analysis is to find the \(\mathsf{W}\) that minimizes this loss function. Once this optimal \(\mathsf{W}\) is found, we compute \(\hat{\mathbf{z}} = \mathsf{W}^\mathsf{T}\cdot \mathbf{y}\); the entries of \(\hat{\mathbf{z}}\) are called the principal components.
49.1.3 The optimal projection
In this section, we work out what loading matrix \(\mathsf{W}\) optimizes the above loss function. Be careful not to get lost in the weeds of the mathematics that follows. Just remember that we are simply finding the projection that minimizes the mean squared difference between measured and projected data.
As we work to find the optimal projection, it will help us to write the loss function in terms of the columns of \(\mathsf{W}\).
We wish to find the \(\{\mathbf{w}_j\}\) and \(\{\mathbf{z}_j\}\) that minimize the loss function. To do so while enforcing the orthonormality of \(\mathsf{W}\), we define a Lagrangian
where the Lagrange multipliers denoted with \(\lambda\) enforce the normality of the columns of \(\mathsf{W}\) and those denoted with \(\mu\) enforce the orthogonality of the columns of \(\mathsf{W}\).
Because the loss function and the normality constraints are all strictly convex, the necessary and sufficient conditions for minimizing the loss function are the Karush-Kuhn-Tucker (KKT) conditions,
for all \(i\in[1,N]\), \(j\in[1,L]\), and \(k \in [j+1,L]\). To avoid stumbling over indices that are present in sums, I prefer to differentiate with respect to a variable with a specific index, e.g., with respect to \(z_{lm}\). With that in mind, let us compute the derivatives of the KKT conditions one-by-one, in reverse order as they are listed above.
noting that \(m > l\). Not surprisingly, this means that \(\mathbf{w}_j^\mathsf{T}\cdot\mathbf{w}_k = 0\) for all \(j\ne k\); the columns of \(\mathsf{W}\) are orthogonal.
Turning now to the derivative with respect to the \(\lambda\)’s,
which means that \(\mathbf{w}_j^\mathsf{T}\cdot\mathbf{w}_j = 1\). With the previous KKT condition, this means that the columns of \(\mathsf{W}\) are orthonormal, or
where we have used \(\mathbf{w}_m^\mathsf{T}\cdot \mathbf{w}_k = \delta_{mk}\), which we worked out from the first two KKT conditions we considered. Thus, we have,
We have satisfied three of the four KKT conditions. We can substitute these results into the loss function because doing so does not change the feasible set of minimizers. The updated Lagrangian is then
We recognize this equation as an expression for a Rayleigh quotient. Evidently, then, the Lagrange multipliers \(\lambda_j\) are the eigenvalues of \(\hat{\mathsf{\Sigma}}\) and \(\mathbf{w}_j\) are the normalized eigenvectors. Inserting this expression for \(\mathbf{w}_j\) into the loss function yields
The loss function is minimized if we choose the \(L\) largest eigenvalues of \(\hat{\mathsf{\Sigma}}\). Therefore, the columns of \(\mathsf{W}\) are comprised of the eigenvectors of \(\hat{\mathsf{\Sigma}}\) corresponding to its largest eigenvalues.
49.1.4 Centering and scaling, the interpretation of \(\hat{\mathsf{\Sigma}}\), and maximal variance
Assume for a moment that the mean of \(\mathbf{y}\) is zero,
which we recognize as the elements of the empirical covariance matrix (or, equivalently, since the diagonal entries are all one, an empirical correlation matrix). Thus, \(\hat{\mathsf{\Sigma}}\) is the plug-in estimate for the covariance matrix when using centered-and-scaled data.
Consider a principal component \(z_{ij}\). The plug-in estimate for the variance of this is
Therefore, finding the transformation \(\mathbf{w}_j\) that minimizes the loss function is the same as finding the transformation that maximizes the variance of the projection. This is why the principal components are often referred to as the directions that have maximal variance.
As a final note, I mention that it is important to center and scale the data set before performing PCA because given directions may have high variance just because of the scaling of the measurement. This is generally the variance we do not wish to interpret.
49.1.5 Prescription for PCA
Following the above analysis, we arrive at a prescription for PCA. After choosing the number of principal components \(L\), do the following.
Center and scale the data set by subtracting the mean and dividing by the standard deviation along each dimension.
Compute the empirical covariance matrix.
Compute the first \(L\) eigenvalues and eigenvectors.
Construct the loading matrix \(\mathsf{W}\) from the eigenvectors corresponding to the \(L\) largest eigenvalues.
The principal components are \(\mathbf{z}_i = \mathbf{W}^\mathsf{T} \cdot \mathbf{y}_i \;\forall i\).
49.1.6 Nonidentifiability of PCA
The loading matrix \(\mathsf{W}\) is comprised of the eigenvectors of the empirical covariance matrix and therefore the columns are orthogonal, since the covariance matrix is symmetric. However, any of these eigenvectors can be multiplied by a scalar (which is true in general for an eigenvector; if \(\mathbf{v}\) is an eigenvector of \(\mathsf{A}\), then so is \(\alpha \mathbf{v}\) for real \(\alpha \ne 0\).) We can insist that the columns of \(\mathsf{W}\) for an orthonormal basis, such that the magnitude of each eigenvector is 1, which is what is commonly done. However, any given column of \(\mathsf{W}\) may be multiplied by \(-1\), still resulting in a nonidentifiability. We therefore cannot uniquely find a loading matrix \(\mathsf{W}\) and therefore a unique set of principal components.
49.2 PCA of a data set
We will perform PCA on data from Remedios, et al. that we visited in Chapter 8. Recall that in that data set, we had recording from 115 neurons over 10 or so minutes. Let’s load in the data set and convert everything to Numpy arrays as we did in Chapter 8.
# Load in the data set and separate into Numpy arraysdata = scipy.io.loadmat(os.path.join(data_path, 'hypothalamus_calcium_imaging_remedios_et_al.mat'))neural_data = data['neural_data'] # Row is time, column is neuronattack_vector = data['attack_vector'].flatten()sex_vector = data['sex_vector'].flatten()
We also have time points, which we can set up knowing that the sampling rate is 30 Hz.
# Time points at 30 Hz samplingt = np.arange(len(attack_vector)) /30
Finally, we can center and scale the data. We will also transpose the neural data so that it has dimension \(N\times D\) for ease in comparing to the above theoretical results.
# Transposey = neural_data.T# Center data set by subtracting the mean from each time point and dividing by the standard deviationy = (y - np.mean(y, axis=0)) / np.std(y, axis=0)
We can now carry out the prescription of PCA. We are hoping to visualize the trajectory of the neuronal activity of the mouse on a 2D plot, so we will specify that there are \(L = 2\) latent variables (principal componenents).
# Two principle componentsL =2# Compute the plugin estimate for the covariance matrixcov = np.cov(y.T)# Compute the eigensystemeigenvalues, eigenvectors = np.linalg.eig(cov)# Sort highest to lowestidx = eigenvalues.argsort()[::-1]# Extract largest L to build eigenvalues and loading matrixlam = eigenvalues[idx][:L]W = eigenvectors[:, idx][:, :L]# Compute principal componentsz = np.dot(W.T, y.T)
We will plot the visualization of the principal components in a moment, but for now, we will take the opportunity to compute the fraction of the variance that each principal component contributes. Given that the eigenvalues are the eigenvalues of the covariance matrix, each eigenvalue gives the scale of the variance along the given eigenvector. Thus, the fraction of the total variance that runs along principal component \(j\) is \(\lambda_j / \sum_{j=1}^L \lambda_j\).
lam / eigenvalues.sum()
array([0.22153642, 0.15257567])
Evidently, the first two principal components account for nearly 40% of the total variance.
Now, we will proceed to plot the principal components. We will plot them against each other and color the glyphs by time or by sex of the presented mouse. We will also plot each of the two principal components vs time.
p1 = bokeh.plotting.figure( frame_height=350, frame_width=350, x_axis_label="PC 1", y_axis_label="PC 2",)p2 = bokeh.plotting.figure( frame_height=150, frame_width=550, x_axis_label="time", y_axis_label="PC",)# Set up date for the plottime_color = bebi103.viz.q_to_color(t, bokeh.palettes.Viridis256)sex_color = ["orchid"if s else"dodgerblue"for s in sex_vector]cds = bokeh.models.ColumnDataSource(dict(pc1=z[0], pc2=z[1], t=t, color=time_color))# We'll allow selection of which color we want to visualizecolor_selector = bokeh.models.RadioButtonGroup(labels=["time", "sex"], active=0)js_code ="""cds.data['color'] = color_selector.active == 0 ? time_color : sex_color;cds.change.emit();"""color_selector.js_on_event("button_click", bokeh.models.CustomJS( code=js_code, args=dict( time_color=time_color, sex_color=sex_color, cds=cds, color_selector=color_selector, ), ),)# Populate glyphsp1.scatter(source=cds, x="pc1", y="pc2", color="color")p2.line(source=cds, x='t', y='pc1', legend_label='PC 1')p2.line(source=cds, x='t', y='pc2', color="orange", legend_label='PC 2')# Lay out and show!bokeh.io.show( bokeh.layouts.column( bokeh.layouts.row(p1, bokeh.layouts.column(bokeh.models.Div(text="Color by"), color_selector)), p2 ))
When looking at the coloring by sex, it is clear that there is a difference along the PC 1-axis depending on what sex of mouse is presented. When the male mouse is introduced about 250 seconds into the experiment, PC 1 shifts to exceed PC 2.
49.2.1 Reconstruction of neuronal measurements from principal components
By restricting ourselves to only two latent variables, we necessarily have omitted some of the dynamics observed in the system. To assess what we have lost, we can reconstruct the neuronal signal from the estimates of the latent variables and the corresponding loading matrix \(\mathsf{W}\). Recall from Equation 49.1 that the reconstructed data set is