Data science

Below I showcase some of my projects in data science and machine learning. For an up to date list of my papers please refer to the publications page.


Conditional sampling with MGANs

Conditional sampling is a fundamental problem in statistics and machine learning. Consider the supervised learning problem of predicting an output y at an input x. We cast this problem as the problem of identifying the conditional measure y|x from a training data set consisting of input-output samples (x_i, y_i)

Monotone Generative Adversarial Networks (MGANs) are a variant of the standard GANs, by adding appropriate structural and monotonicity conditions, that are able to sample the desired conditionals y|x. More precisely, MGANs give a mapping T(x,y) so that for any new input x^\ast the map T(x^\ast, \cdot) pushes a standard Gaussian on y to the desired conditional y|x^\ast

7932

6
Image in-painting with MGANs. First column denotes the original image. Second column is the masked image. Columnes three through five are independent conditional samples of the in-painted image. The jet colored images denote the mean and standard deviation of the conditional while the far right column denotes the classification probability of conditional samples.

Spectral clustering

Clustering is an unsupervised learning technique aiming to identify meaningful coarse structures in a point cloud X.  Spectral clustering is a particularly successful approach to this problem where a graph is constructed on X as well as a graph Laplacian operator L. Then an embedding F is defined using the eigenvector of L that maps X to a low-dimensional space where the clustering of X is revealed easier.

graphn100.jpeg
Generic proximity graph on a dataset consisting of two clusters.

Theoretical analysis of spectral clustering is still quite scarce.  We studied the discrete and the continuum limits of spectral clustering by analyzing the geometry of the Laplacian embedding F when the points in X are i.i.d. with respect to a mixture model. We showed that as the components of this mixture become better separated spectral clustering will identify the correct clusters in X with high probability.

eign800.jpeg
Second eigenvector of the graph Laplacian (Fiedler vector) of a dense graph. Value of this eigenvector naturally splits the data into two clusters.

Continuum limit of graph Laplacians

It was shown in this article that, in the limit as the number of points in X
goes to infinity, the graph Laplacian operators L converge to weighted elliptic operators of the form
\mathcal L : u \mapsto - \rho^{-p} \text{div} ( \rho^q \nabla (u \rho^{-r}))
for certain values of the parameters p,q,r depending on the normalization of the graph Laplacian. Here \rho is the probability density function according to which the points in X are distributed. Understanding the spectral properties of \mathcal L is therefore fundamental to the analysis of a large number of unsupervised and semi-supervised learning algorithms in the large data limit.
 
 
density
Black and white image of two neuron cells input as a density \rho in definition of \mathcal L.

evec2
First eigenvalue of \mathcal L. Sign of the eigenvector clearly identifies the two distinct neurons (blue is negative and red is positive).
 

 

 

 

 
We analyze the spectrum of \mathcal L focusing on cases where \rho is concentrated on certain clusters and show that, depending on the values of p,q,r, a gap manifests in the spectrum of \mathcal L that reveals the number of clusters in \rho and the geometry of the clusters is apparent in the eigenfunctions of \mathcal L associated to the small eigenvalues.
 
 
  • Franca Hoffmann, Bamdad Hosseini, Assad A. Oberai and Andrew M. Stuart “Spectral analysis of weighted Laplacians arising in data clustering” (2019). url:https://arxiv.org/abs/1909.06389

Semi-supervised learning

Semi-supervised learning (SSL) is the problem of labelling a collection of unlabelled points X from noisily observed labels of a small subset X' \subset X. The probit and one-hot methods are two widely used approaches to SSL — probit recovers binary labels while one-hot can handle finitely many labels. Both methods combine the observed labels with geometric information about X to label the rest of the points. Similar to spectral clustering a graph Laplacian operator L is constructed on X. The labels on all of X are then found by solving an optimization problem consisting of a data fidelity term for the observed labels on X', and a Dirichlet energy regularization term involving the graph Laplacian L that injects the geometry of X encoded in the spectrum of L.

We analyze consistency of SSL and show that, under appropriate geometric assumptions the probit and one-hot methods recover the correct label of all points in X in the limit of small observational noise. Our analysis reveals interesting interactions between different hyperparameters in both methods.

u01
Using probit to label pixels in the above black and white image to identify two separate neurons. The red and blue colors indicate different classes. The grey color indicates a third class, i.e, the background.
  • Franca Hoffmann, Bamdad Hosseini, Zhi Ren and Andrew M. Stuart “Consistency of semi-supervised learning algorithms on graphs: Probit and one-hot methods”. Journal Of Machine Learning Research (2020, In press). url:https://arxiv.org/abs/1906.07658