# Data science

Below I showcase some of my projects in data science and machine learning. For an up to date list of my papers please refer to the publications page.

## Conditional sampling with MGANs

Conditional sampling is a fundamental problem in statistics and machine learning. Consider the supervised learning problem of predicting an output $y$ at an input $x$. We cast this problem as the problem of identifying the conditional measure $y|x$ from a training data set consisting of input-output samples $(x_i, y_i)$

Monotone Generative Adversarial Networks (MGANs) are a variant of the standard GANs, by adding appropriate structural and monotonicity conditions, that are able to sample the desired conditionals $y|x$. More precisely, MGANs give a mapping $T(x,y)$ so that for any new input $x^\ast$ the map $T(x^\ast, \cdot)$ pushes a standard Gaussian on $y$ to the desired conditional $y|x^\ast$

## Spectral clustering

Clustering is an unsupervised learning technique aiming to identify meaningful coarse structures in a point cloud $X$.  Spectral clustering is a particularly successful approach to this problem where a graph is constructed on $X$ as well as a graph Laplacian operator $L$. Then an embedding $F$ is defined using the eigenvector of $L$ that maps $X$ to a low-dimensional space where the clustering of $X$ is revealed easier.

Theoretical analysis of spectral clustering is still quite scarce.  We studied the discrete and the continuum limits of spectral clustering by analyzing the geometry of the Laplacian embedding $F$ when the points in $X$ are i.i.d. with respect to a mixture model. We showed that as the components of this mixture become better separated spectral clustering will identify the correct clusters in $X$ with high probability.

## Continuum limit of graph Laplacians

It was shown in this article that, in the limit as the number of points in $X$
goes to infinity, the graph Laplacian operators $L$ converge to weighted elliptic operators of the form
$\mathcal L : u \mapsto - \rho^{-p} \text{div} ( \rho^q \nabla (u \rho^{-r}))$
for certain values of the parameters $p,q,r$ depending on the normalization of the graph Laplacian. Here $\rho$ is the probability density function according to which the points in $X$ are distributed. Understanding the spectral properties of $\mathcal L$ is therefore fundamental to the analysis of a large number of unsupervised and semi-supervised learning algorithms in the large data limit.

We analyze the spectrum of $\mathcal L$ focusing on cases where $\rho$ is concentrated on certain clusters and show that, depending on the values of $p,q,r$, a gap manifests in the spectrum of $\mathcal L$ that reveals the number of clusters in $\rho$ and the geometry of the clusters is apparent in the eigenfunctions of $\mathcal L$ associated to the small eigenvalues.

• Franca Hoffmann, Bamdad Hosseini, Assad A. Oberai and Andrew M. Stuart “Spectral analysis of weighted Laplacians arising in data clustering” (2019). url:https://arxiv.org/abs/1909.06389

## Semi-supervised learning

Semi-supervised learning (SSL) is the problem of labelling a collection of unlabelled points $X$ from noisily observed labels of a small subset $X' \subset X$. The probit and one-hot methods are two widely used approaches to SSL — probit recovers binary labels while one-hot can handle finitely many labels. Both methods combine the observed labels with geometric information about $X$ to label the rest of the points. Similar to spectral clustering a graph Laplacian operator $L$ is constructed on $X$. The labels on all of $X$ are then found by solving an optimization problem consisting of a data fidelity term for the observed labels on $X'$, and a Dirichlet energy regularization term involving the graph Laplacian $L$ that injects the geometry of $X$ encoded in the spectrum of $L$.

We analyze consistency of SSL and show that, under appropriate geometric assumptions the probit and one-hot methods recover the correct label of all points in $X$ in the limit of small observational noise. Our analysis reveals interesting interactions between different hyperparameters in both methods.

• Franca Hoffmann, Bamdad Hosseini, Zhi Ren and Andrew M. Stuart “Consistency of semi-supervised learning algorithms on graphs: Probit and one-hot methods”. Journal Of Machine Learning Research (2020, In press). url:https://arxiv.org/abs/1906.07658