Karthik Devarajan, PhD

Associate Member
Medical Science Division


Karthik.Devarajan@fccc.edu
Phone: 215-728-2794


The explosion in DNA microarray technology in the last decade has given rise to extensive biological data in the form of expression profiles of tens of thousands of genes and proteins, often from only a handful of tissue samples. The principal objective of a high-throughput experiment can be generally characterized as one of class comparison, class prediction or molecular pattern discovery. Class comparison studies are designed to identify differentially expressed genes between different classes such as tissue types, patients or experimental conditions. In class prediction, the emphasis is on building a predictive gene set based on the class labels and expression profiles of known samples, and applying it to a new sample to predict its class. In molecular pattern discovery, however, the classes are not defined independently of the gene expression profiles and are unknown a priori.

The focus of my research is the development of novel statistical methodology that will enable analysis of large-scale biological data stemming from high-throughput experiments such as microarrays, comparative genomics hybridization, and proteomics. This includes methods for relating outcome variables (qualitative or quantitative) with large numbers of covariates, and molecular pattern discovery, based on supervised and unsupervised learning methods. Two methods we are currently investigating are non-negative matrix factorization and support vector machines. Class comparisons for identification of differential expression and class prediction fall within this framework. Our focus is on combining an information-theoretic approach with learning-theoretic methods for the discrimination of competing models and elucidation of clusters and hidden variables within such large-scale data.

We are exploring various matrix factorization methods in order to gain an understanding of their relative strengths and weaknesses, applicability and relationship to each other in this setting. One specific problem of interest is an attempt to associate large-scale molecular data and clinical data with survival time in the presence of censoring. This is an important issue in translational medicine; however, very little research has been done in this area.