Below I showcase some of my projects in data science and machine learning. For an up to date list of my papers please refer to the publications page.
Conditional sampling with MGANs
Conditional sampling is a fundamental problem in statistics and machine learning. Consider the supervised learning problem of predicting an output at an input . We cast this problem as the problem of identifying the conditional measure from a training data set consisting of input-output samples .
Monotone Generative Adversarial Networks (MGANs) are a variant of the standard GANs, by adding appropriate structural and monotonicity conditions, that are able to sample the desired conditionals . More precisely, MGANs give a mapping so that for any new input the map pushes a standard Gaussian on to the desired conditional .
- Nikola B. Kovachki, Ricardo Baptista, Bamdad Hosseini, and Youssef M. Marzouk, “Conditional sampling with monotone GANs” (2020). url:https://arxiv.org/abs/2006.06755
Clustering is an unsupervised learning technique aiming to identify meaningful coarse structures in a point cloud . Spectral clustering is a particularly successful approach to this problem where a graph is constructed on as well as a graph Laplacian operator . Then an embedding is defined using the eigenvector of that maps to a low-dimensional space where the clustering of is revealed easier.
Theoretical analysis of spectral clustering is still quite scarce. We studied the discrete and the continuum limits of spectral clustering by analyzing the geometry of the Laplacian embedding when the points in are i.i.d. with respect to a mixture model. We showed that as the components of this mixture become better separated spectral clustering will identify the correct clusters in with high probability.
- Nicolas Garcia Trillos, Franca Hoffmann and Bamdad Hosseini “Geometric structure of graph Laplacian embeddings” (2019). url:https://arxiv.org/abs/1901.10651.
Continuum limit of graph Laplacians
goes to infinity, the graph Laplacian operators converge to weighted elliptic operators of the form
for certain values of the parameters depending on the normalization of the graph Laplacian. Here is the probability density function according to which the points in are distributed. Understanding the spectral properties of is therefore fundamental to the analysis of a large number of unsupervised and semi-supervised learning algorithms in the large data limit.
- Franca Hoffmann, Bamdad Hosseini, Assad A. Oberai and Andrew M. Stuart “Spectral analysis of weighted Laplacians arising in data clustering” (2019). url:https://arxiv.org/abs/1909.06389
Semi-supervised learning (SSL) is the problem of labelling a collection of unlabelled points from noisily observed labels of a small subset . The probit and one-hot methods are two widely used approaches to SSL — probit recovers binary labels while one-hot can handle finitely many labels. Both methods combine the observed labels with geometric information about to label the rest of the points. Similar to spectral clustering a graph Laplacian operator is constructed on . The labels on all of are then found by solving an optimization problem consisting of a data fidelity term for the observed labels on , and a Dirichlet energy regularization term involving the graph Laplacian that injects the geometry of encoded in the spectrum of .
We analyze consistency of SSL and show that, under appropriate geometric assumptions the probit and one-hot methods recover the correct label of all points in in the limit of small observational noise. Our analysis reveals interesting interactions between different hyperparameters in both methods.
- Franca Hoffmann, Bamdad Hosseini, Zhi Ren and Andrew M. Stuart “Consistency of semi-supervised learning algorithms on graphs: Probit and one-hot methods”. Journal Of Machine Learning Research (2020, In press). url:https://arxiv.org/abs/1906.07658