Unsupervised learning and spectral clustering
Clustering is an unsupervised learning technique aiming to identify meaningful coarse structures in a point cloud . Spectral clustering is a particularly successful approach to this problem where a graph is constructed on as well as a graph Laplacian operator . Then an embedding is defined using the eigenvector of that maps to a low-dimensional space where the clustering of is revealed easier.
Theoretical analysis of spectral clustering is still quite scarce. We studied the discrete and the continuum limits of spectral clustering in this preprint. We analyzed the geometry of the Laplacian embedding when the points in are i.i.d. with respect to a mixture model. We showed that as the components of this mixture become better separated spectral clustering will identify the correct clusters in with highprobability.
Continuum limit of graph Laplacians
goes to infinity, the graph Laplacian operators converge to weighted elliptic operators of the form
for certain values of the parameters depending on the normalization of the graph Laplacian. Here is the probability density function according to which the points in are distributed. Understanding the spectral properties of is therefore fundamental to the analysis of a large number of unsupervised and semi-supervised learning algorithms in the large data limit.
Semi-supervised learning (SSL) is the problem of labelling a collection of unlabelled points from noisily observed labels of a small subset . The probit and one-hot methods are two widely used approaches to SSL — probit recovers binary labels while one-hot can handle finitely many labels. Both methods combine the observed labels with geometric information about to label the rest of the points. Similar to spectral clustering a graph Laplacian operator is constructed on . The labels on all of are then found by solving an optimization problem consisting of a data fidelity term for the observed labels on , and a Dirichlet energy regularization term involving the graph Laplacian that injects the geometry of encoded in the spectrum of .
We analyze consistency of SSL in this article. We show that, under appropriate geometric assumptions the probit and one-hot methods recover the correct label of all points in in the limit of small observational noise. Our analysis reveals interesting interactions between different hyperparameters in both methods.