Spectral analysis of weighted Laplacians in data clustering

We just uploaded a preprint on spectral analysis of weighted Laplacians that arise as the continuum limits of graph Laplacian operators often used in clustering semi-supervised learning. There are some surprising results pertaining to different normalizations of graph Laplacians. Turns out, they are not equivalent and the continuum limit operators can behave quite differently.


Graph Laplacians computed from weighted adjacency matrices are widely used to identify geometric structure in data, and clusters in particular; their spectral properties play a central role in a number of unsupervised and semi-supervised learning algorithms. When suitably scaled, graph Laplacians approach limiting continuum operators in the large data limit. Studying these limiting operators, therefore, sheds light on learning algorithms. This paper is devoted to the study of a parameterized family of divergence form elliptic operators that arise as the large data limit of graph Laplacians. The link between a three-parameter family of graph Laplacians and a three-parameter family of differential operators is explained. The spectral properties of these differential perators are analyzed in the situation where the data comprises two nearly separated clusters, in a sense which is made precise. In particular, we investigate how the spectral gap depends on the three parameters entering the graph Laplacian and on a parameter measuring the size of the perturbation from the perfectly clustered case. Numerical results are presented which exemplify and extend the analysis; in particular the computations study situations with more than two clusters. The findings provide insight into parameter choices made in learning algorithms which are based on weighted adjacency matrices; they also provide the basis for analysis of the consistency of various unsupervised and semi-supervised learning algorithms, in the large data limit.




Consistency of semi-supervised learning algorithms on graphs: Probit and one-hot methods

We just uploaded a new manuscript on consistency of certain semi-supervised learning algorithms on graphs. This joint work with my wonderful collaborators Dr. Franca Hoffmann, Zhi Ren and Prof. Andrew Stuart. The first of multiple papers we’ve been working on where we develop a framework for consistency analysis of semi-supervised learning algorithms.


Graph-based semi-supervised learning is the problem of propagating labels from a small number of labelled data points to a larger set of unlabelled data. This paper is concerned with the consistency of optimization-based techniques for such problems, in the limit where the labels have small noise and the underlying unlabelled data is well clustered. We study graph-based probit for binary classification, and a natural generalization of this method to multi-class classification using one-hot encoding. The resulting objective function to be optimized comprises the sum of a quadratic form defined through a rational function of the graph Laplacian, involving only the unlabelled data, and a fidelity term involving only the labelled data. The consistency analysis sheds light on the choice of the rational function defining the optimization.

Geometric structure of graph Laplacian embeddings

We just uploaded a new preprint on the geometry of graph Laplacian embeddings in the continuum limit. Joint work with Nicolas Garcia trillos and Franca Hoffmann.


We analyze the spectral clustering procedure for identifying coarse structure in a data set x_1, \dots, x_n , and in particular study the geometry of graph Laplacian embeddings which form the basis for spectral clustering algorithms. More precisely, we assume that the data is sampled from a mixture model supported on a manifold M embedded in \mathbb{R}^d and pick a connectivity length-scale \varepsilon >0 to construct a kernelized graph Laplacian. We introduce a notion of a well-separated mixture model which only depends on the model itself, and prove that when the model is well separated, with high probability the embedded data set concentrates on cones that are centered around orthogonal vectors. Our results are meaningful in the regime where \varepsilon = \varepsilon(n) is allowed to decay to zero at a slow enough rate as the number of data points grows. This rate depends on the intrinsic dimension of the manifold on which the data is supported.

Two Metropolis-Hastings algorithms for non-Gaussian priors

I just updated my preprint ( on arXiv after major revisions following reviewer comments. Here’s the new abstract:

We introduce two classes of Metropolis-Hastings algorithms for sampling target measures that are absolutely continuous with respect to underlying non-Gaussian measures on infinite-dimensional Hilbert spaces. We particularly focus on measures that are highly non-Gaussian and utilize certain properties of the underlying prior measures to construct autoregressive-type proposal kernels that are prior reversible and result in algorithms that satisfy detailed balance with respect to the target measures. We then introduce a new class of prior measures, called the Bessel-K priors, as a generalization of the gamma distribution to measures in infinite dimensions. The Bessel-K priors interpolate between well-known priors such as the gamma distribution and Besov priors and can model sparse or compressible parameters. We present concrete instances of our algorithms for the Bessel-K priors and present numerical examples in density estimation, finite-dimensional denoising and deconvolution on the circle.


New preprint on model calibration and source inversion in atmospheric dispersion

We just uploaded a new preprint “Simultaneous model calibration and source inversion in atmospheric dispersion models” that I co-authored with Juan Garcia and John Stockie. This is mainly thanks to Juan’s work for his master’s thesis. I am very proud of him. Here’s the abstract:

We present a cost-effective method for model calibration and solution of source inversion problems in atmospheric dispersion modelling. We use Gaussian process emulations of atmospheric dispersion models within a Bayesian framework for solution of inverse problems. The model and source parameters are treated as unknowns and we obtain point estimates and approximation of uncertainties for sources while simultaneously calibrating the forward model. The method is validated in the context of an industrial case study involving emissions from a smelting operation for which cumulative monthly measurements of zinc particulate depositions are available.

A day of celebration at SFU

I had the opportunity to celebrate my graduation with some of the lovelies people in my life. My two supervisors John Stockie and Nilima Nigam hooded me at the convocation ceremony. I was also awarded the Governor General’s gold medal by President Andrew Petter.  This is a huge honor which I credit to my supervisors as well as the wonderful people at SFU who supported me for the past six years.