photo


Software for
Clustering Threshold Gradient Descent Regularization

(Last Update: May 24th, 2006. By Shuangge Ma)


Shuangge Ma, Ph.D.
Collaborative Health Studies Coordinating Center and Department of Biostatistics
University of Washington, Seattle, WA 98195
Email: shuangge@u.washington.edu

Jian Huang, Ph.D.
Department of Statistics & Actuarial Science and Program in Public Health Genetics
University of Iowa, Iowa, IA 52242
Email: jian@stat.uiowa.edu


Introduction:
We provide the source code written in R for estimation and variable selection using the Clustering Threshold Gradient Descent Regularization (CTGDR) method proposed in the manuscript software (written in R) for estimation and variable selection in the logistic regression and Cox proportional hazards models. Detailed description of the algorithm can be found in the paper Clustering Threshold Gradient Descent Regularization: with Applications to Microarray Studies . The manuscript is coauthored by Shuangge Ma and Jian Huang. Below is the abstract of the paper. We strongly recommand you reading the full paper before using the software.

Abstract
Motivation:
An important application of microarray technology is to discover important genes and pathways that are correlated with clinical outcomes such as disease status and survival. While a typical microarray experiment surveys gene expressions on a global scale, there may be only a small number of genes that have significant influence on a clinical outcome of interest. In addition, expression data have cluster structures and the genes within a cluster have coordinated influence on the response, but the effects of individual genes in the same cluster may be different. Accordingly, we seek to build statistical models with the following properties. First, the model is sparse in the sense that only a subset of the parameter vector is non-zero. Second, the cluster structures of covariates (genes) are properly accounted for.

Results:
For microarray studies with smooth objective functions and well defined cluster structure for genes, we propose a clustering threshold gradient descent regularization (CTGDR) method, for simultaneous cluster selection and within cluster gene selection. We apply this method to regression models for binary classification and censored survival data with microarray gene expression data as covariates, assuming known cluster structures of gene expressions. Compared to the standard TGDR and other regularization methods, the CTGDR takes into account the cluster structure and carries out feature selection at both the cluster level and within-cluster gene level. We demonstrate the CTGDR on two studies of cancer classification using microarray data and two studies of correlating survival of lymphoma patients with microarray data.


Special Notes:
We would like to thank you for your interest in our study. You are encouraged to try out the software. However, please note:

  • The software is a "research software". Shuangge Ma and Jian Huang assume no responsibility for any results produced by the software. (5/24/06)

  • You are welcome to download the software and build improved version. Please acknowledge our paper or the software properly if you use the software in your study. (5/24/06)


  • Please kindly inform us, if you see any error of the software.


    Documentation:

  • Clustering Threshold Gradient Descent Regularization: with Applications to Microarray Studies Shuangge Ma and Jian Huang. Technical Report 348, Department of Statistics and Actuarial Science, University of Iowa.



  • Source Code and Illustrative Examples:

    Survival analysis of right censored data, Cox model

  • Data Files:
    Gene expression levels. To make the genes more comparable, normalization of the genes to zero means and unit variances are suggested.
    Survival information. The first column is sample ID; The second column indicates training set versus testing set for this specific data; The third column is event time; The fourth column is event indicator.
    Clustering information. The first column is the natural rank of genes. The second column is their cluster position.


  • Source Code:

  • V-fold cross validation.
    Estimation with cross validated k.


    Binary classification, Logistic regression model

  • Data Files:
    Gene expression levels. To make the genes more comparable, normalization of the genes to zero means and unit variances are suggested.
    Binary outcome.
    Clustering information. The first column is the natural rank of genes. The second column is their cluster position.


  • Source Code:

  • V-fold cross validation.
    Estimation with cross validated k.

    Acknowledgment:
    The work of Ma is partially supported by the NIH grant N01-HC-95159. The work of Huang is supported in part by the NIH grant HL72288. We thank the Department of Statistics, University of Iowa for hosting the website.


    Mailbox  For questions, comments or suggestions, email shuangge@u.washington.edu or jian@stat.uiowa.edu.