Machine learning


Unsupervised learning and spectral clustering

Clustering is an unsupervised learning technique aiming to identify meaningful coarse structures in a point cloud X.  Spectral clustering is a particularly successful approach to this problem where a graph is constructed on X as well as a graph Laplacian operator L. Then an embedding F is defined using the eigenvector of L that maps X to a low-dimensional space where the clustering of X is revealed easier.

 

graphn100.jpeg
Generic proximity graph on a dataset consisting of two clusters.

Theoretical analysis of spectral clustering is still quite scarce.  We studied the discrete and the continuum limits of spectral clustering in this preprint. We analyzed the geometry of the Laplacian embedding F when the points in X are i.i.d. with respect to a mixture model. We showed that as the components of this mixture become better separated spectral clustering will identify the correct clusters in X with highprobability.

eign800.jpeg
Second eigenvector of the graph Laplacian (Fiedler vector) of a dense graph. Value of this eigenvector naturally splits the data into two clusters.

Continuum limit of graph Laplacians

It was shown in this article that, in the limit as the number of points in X
goes to infinity, the graph Laplacian operators L converge to weighted elliptic operators of the form
\mathcal L : u \mapsto - \rho^{-p} \text{div} ( \rho^q \nabla (u \rho^{-r}))
for certain values of the parameters p,q,r depending on the normalization of the graph Laplacian. Here \rho is the probability density function according to which the points in X are distributed. Understanding the spectral properties of \mathcal L is therefore fundamental to the analysis of a large number of unsupervised and semi-supervised learning algorithms in the large data limit.
density
Black and white image of two neuron cells input as a density \rho in definition of \mathcal L.
evec2
First eigenvalue of \mathcal L. Sign of the eigenvector clearly identifies the two distinct neurons (blue is negative and red is positive).

 

 

 

 

 

 

 

 

We analyze the spectrum of \mathcal L focusing on cases where \rho is concentrated on certain clusters and show that, depending on the values of p,q,r, a gap manifests in the spectrum of \mathcal L that reveals the number of clusters in \rho and the geometry of the clusters is apparent in the eigenfunctions of \mathcal L associated to the small eigenvalues.

Semi-supervised learning

Semi-supervised learning (SSL) is the problem of labelling a collection of unlabelled points X from noisily observed labels of a small subset X' \subset X. The probit and one-hot methods are two widely used approaches to SSL — probit recovers binary labels while one-hot can handle finitely many labels. Both methods combine the observed labels with geometric information about X to label the rest of the points. Similar to spectral clustering a graph Laplacian operator L is constructed on X. The labels on all of X are then found by solving an optimization problem consisting of a data fidelity term for the observed labels on X', and a Dirichlet energy regularization term involving the graph Laplacian L that injects the geometry of X encoded in the spectrum of L.

We analyze consistency of SSL in this article. We show that, under appropriate geometric assumptions the probit and one-hot methods recover the correct label of all points in X in the limit of small observational noise. Our analysis reveals interesting interactions between different hyperparameters in both methods.

u01
Using probit to label pixels in the above black and white image to identify two separate neurons. The red and blue colors indicate different classes. The grey color indicates a third class, i.e, the background.