## Software for

(Last Update: Dec. 18th, 2005. By Shuangge Ma)

Regularized ROC for Disease Classification and Biomarker Selection

Shuangge Ma, Ph.D.

Collaborative Health Studies Coordinating Center and Department of Biostatistics

University of Washington, Seattle, WA 98195

Email: shuangge@u.washington.edu

Jian Huang, Ph.D.

Department of Statistics & Actuarial Science and Program in Public Health Genetics

University of Iowa, Iowa, IA 52242

Email: jian@stat.uiowa.edu

Introduction:

We provide the software (written in R) for estimation and variable selection using the regularized ROC approach proposed in the paper Regularized ROC method for disease classification and biomarker selection with microarray data published in Bioinformatics . The paper is coauthored by Shuangge Ma and Jian Huang. Below is the abstract of the paper. We strongly recommand you reading the full paper before using the software.

Abstract

An important application of microarrays is to discover genomic biomarkers, among tens of thousands of genes assayed, for disease classification. Thus there is a need for developing statistical methods that can efficiently use such high-throughput genomic data, select biomarkers with discriminant power and construct classification rules. The ROC (receiving operator characteristic) technique has been widely used in disease classification with low dimensional biomarkers because (1). it does not assume a parametric form of the class probability as required for example in the logistic regression method; (2). it accommodates case-control designs; and (3). it allows treating false positives and false negatives differently. However, due to computational difficulties, the ROC based classification has not been used with microarray data. Moreover, the standard ROC technique does not incorporate built-in biomarker selection.

We propose a novel method for biomarker selection and classification using the ROC technique for microarray data. The proposed method uses a sigmoid approximation to the area under the ROC curve as the objective function for classification and the threshold gradient descent regularization method for estimation and biomarker selection. Tuning parameter selection based on the V-fold cross validation and predictive performance evaluation are also investigated. The proposed approach is demonstrated with a simulation study, the Colon data and the Estrogen data. The proposed approach yields parsimonious models with excellent classification performance.

Special Notes:

We would like to thank you for your interest in our study. You are encouraged to try out the software. However, please note:

The software is a "research software". Shuangge Ma and Jian Huang assume no responsibility for any results produced by the software. (12/18/05)

You are welcome to download the software and build improved version. Please acknowledge our paper or the software properly if you use the software in your study. (12/18/05)

Please kindly inform us, if you see any error of the software.

Documentation:

Regularized ROC method for disease classification and biomarker selection with microarray data Shuangge Ma and Jian Huang. Bioinformatics 2005 21: 4356-4362

Source Code and a Toy Example:

The following files are the source code and a toy example demonstrating the use of the software. The software is written using R and has been tested under standard Linux operating system. Certain modifications may be needed for different operating systems.

Consider a simple example with sample size 50 and 500 covariates. The covariates are assumed to be independent and marginally normal distributed. The first 0.05*500 covariates (genes) are differentially expressed.

The data generation code can be downloaded here.

Sample outcome data.

Sample covariate data.

V-fold cross validation is used to select the optimal tuning parameter. The cross validation code with the threshold equal to 1 can be download here.

The cross validation code generates two files. The first one shows the CV score as a function of iterations. The second one identifies the number of iterations needed from the first file.

Note that we randomly permute the data in the cross validation to avoid any "hidden pattern" (for example the first half subjects all from the case group). So when running the cross validation code, you may get quite similar, however different results.

For cross validation over different threshold, you need to change it manually. Adjustment of the "increase" value may be needed.

After cross validation, estimation can be achieved using the code here, which should generate a file showing the final estimate.

Acknowledgment:

The work of Ma is partially supported by the NIH grant N01-HC-95159. The work of Huang is supported in part by the NIH grant HL72288.

For questions, comments or suggestions, email shuangge@u.washington.edu or jian@stat.uiowa.edu.