Below, there are links to representative publications of my work (with a preference to journal publications) organized in different topics. Column Subset SelectionThe Column Subset Selection Problem (CSSP) is the following combinatorial optimization problem on matrices: given an matrix and an integer select columns from in a matrix such that (or ) is minimized among all possible choices for the matrix . CSSP is important because a solution to it reveals the most ‘‘influential" columns in . My work on CSSP has focused on the algorithmic aspects of the problem, providing polynomialtime methods (see FOCS paper) as well as inputsparsitytime methods (see STOC paper) with optimal approximation bounds. A representative result is this: given matrix , integer and accuracy parameter there exists a randomized algorithm to select columns of and form a matrix such that the following bound holds in expectation: here, is the best rank matrix computed via the SVD of : .
Principal Component AnalysisGiven an matrix and an integer the principal component analysis (PCA) problem is defined as finding matrix and
matrix with orthonormal columns that simultaneously minimize the error . An optimal PCA is available via setting and where
is the matrix containing the top left singular vectors of . I have studied principal component analysis from different perspectives:
Kmeans ClusteringIn the kmeans clustering problem we are given a set of points and the number of clusters . The problem asks to find a partition of the points into k disjoint sets (clusters) such that the function is minimized among all possible partitions; . My work on this problem has focused on designing provably accurate dimensionality reduction methods: feature selection and feature extraction methods. Such dimensionality reduction methods can be used to solve the kmeans problem in two steps: 1) ‘‘project’’ the input points to some dimensional space (); 2) use some standard means algorithm on the lowdimensional points to find the desired partition . A highlight of the results that I have obtained for this problem is the following: for given point set , number of clusters and there is a randomized algorithm to select features from the input point set such that solving the kmeans problem on these dimensional points gives a approximation to the optimal kmeans partition. Indeed, this result is the first provably accurate feature selection method for means clustering.
Leastsquares ProblemsA linear algebraic view of a leastsquares problem is: given and , find to minimize
In this generic form, this problem has found a number of applications in numerical linear algebra, machine learning and statistical data analysis.
My work on leastsquares problems/algorithms spans different topics:
Canonical Correlation AnalysisThe goal of Canonical Correlation Analysis (CCA) is to analyze the relation between a pair of datasets (each in matrix form, say and ). From a statistical point of view, CCA finds the direction of maximal correlation between and . From a linear algebraic point of view, CCA measures the similarities between and . From a geometric point of view, CCA computes the cosine of the principal angles between and . In the paper below, we presented and analyzed a fast approximation algorithm for CCA. Our algorithm is based on dimensionality reduction via fast random projections, meaning that given two tallandskinny matrices and , our algorithm first multiplies the two matrices from the left with some random matrix to significantly reduce the number of rows in and , and then employs some standard deterministic CCA algorithm on the ‘‘small" matrices.
DeterminantThe determinant of a matrix is one of the most important quantities associated with it and it has countless applications in scientific computing. In the paper below, we designed, analyzed, and implemented a fast approximation algorithm for the logarithm of the determinant of an arbitrary Symmetric Positive Definite (SPD) matrix. To our best knowledge, our algorithm is the first that comes with strong worstcase theoretical bounds and works for all SPD matrices.
Spectral ClusteringSpectral clustering is the process of clustering a given set of points via using the eigenvectors of the Laplacian matrix of the similarity graph corresponding to those points. My work on this topic has focused on fast approximation algorithms for spectral clustering. The first paper below presents a survey of such algorithms including several numerical experiments. The more recent paper presents a rigorous theoretical analysis of the use of the socalled power method to speed up spectral clustering.
Support Vector MachinesA support vector machine (SVM) is a classification algorithm to classify highdimensional data. I did work on the topic of ‘‘dimensionality reduction’’ for SVN classification. The paper below provides a provably correct, randomprojectionbased such dimensionality reduction algorithm. There are also similar results on support vector regression.
Nonnegative Matrix FactorizationIn Nonnegative Matrix Factorization (NMF), we are given an matrix with nonnegative entries and an integer ; the problem asks to find an matrix with nonnegative entries and an matrix with nonnegative entries such that the error is as small as possible. During my undergraduate studies, I have developed an algorithm to find matrices and that are ‘‘good’’ starting points for some iterative algorithm that is being used to solve the NMF problem.
