..,xip)T, i = 1,…,n. Gene expression data on p genes for n mRNA samples may be summarized by an n × p matrix X = (xij)n × p. Let Ck be indices of the nk samples selleck in class k, where nk denotes the number of observations belonging to class k, n = n1+…+nK. A predictor or classifier for K tumor classes can be built from a learning set L by C(.,L); the predicted class for an observation x* is C(x*,L). The jth component of the centroid for class k is , the jth component of the overall centroid is . Prediction analysis for microarrays/nearest shrunken centroid method,
PAM/NSC PAM [3] algorithm tries to shrink the class centroids ( ) towards the overall centroid . (1) where dkj is a t statistic for gene j, comparing class k to the overall centroid, and sj is the pooled within-class standard deviation for gene j: (2) and , s0 is a positive constant and usually equal to the median value of the sj over the set of genes. Equation(1) can be transformed to (3)
PAM method shrinks each dkj toward zero, and giving yielding shrunken centroids (4) Soft thresholding is defined by (5) where + means positive part (t+ = t if t>0 and zero otherwise). For a gene j, if dkj is shrunken to zero for all classes k, then the centroid for gene j is , the same for all classes. Thus gene j does not contribute to the nearest-centroid computation. Soft threshold Δ was chosen by cross-validation. Shrinkage discriminant nearly analysis, SDA In SDA, Feature selection is controlled using higher Histone Methyltransferase inhibitor criticism threshold (HCT) or false
non-discovery rates (FNDR) [5]. The HCT is the order statistic of the Z-score corresponding to index i maximizing , πi is the p-value associated with the ith Z-score and π(i) is the i th order statistic of the collection of p-values(1 ≤ i ≤ p). The ideal threshold optimizes the classification error. SDA consists of Shrinkage linear discriminant analysis (SLDA) and Shrinkage diagonal discriminant analysis (SDDA) [15, 16]. Shrunken centroids regularized discriminant analysis, SCRDA There are two parameters in SCRDA [4], one is α (0<α<1), the other is soft threshold Δ. The choosing the optimal tuning parameter pairs (α, Δ) is based on cross-validation. A “”Min-Min”" rule was followed to identify the optimal parameter pair (α, Δ): First, all the pairs (α, Δ) that corresponded to the minimal cross-validation error from training samples were found. Second, the pair or pairs that used the minimal number of genes were selected. When there was more than one optimal pair, the average test error based on all the pairs chosen would be calculated. As traditional LDA is not suitable to deal with the “”large p, small N “” paradigm, so we did not adopt it to select feature genes.