Stats: Differential expression in microarray data (no date)

Differential expression in microarray data (no date)

This page is moving to a new website.

You can compute an expression ratio for each gene by taking the average of the log expression levels in the treatment group and subtracting the average of the log expression levels in the control group. This actually produces a log ratio, and you can compute the actual ratio by taking the antilog.

m2.control <- as.matrix(m2.normalize3[,1:4]) m2.treat <- as.matrix(m2.normalize3[,5:8]) m2.logratio <- apply(m2.treat,1,mean)-apply(m2.control,1,mean)

Which of these expression levels is significantly different from 1? The answer to this question is surprisingly difficult. Here are some approaches.

Declare any ratio bigger than 2 or smaller than 0.5 to be significant

Compute a two-sample t-test for each gene

Modify the significance level of the t-test using Bonferroni

Add a fudge factor to the denominator of each t-test.

Pool results across similar genes

The simplest approach, but an approach that is probably not defensible, is to state that any ratios beyond a certain threshold represent genes with differential expression. Typically, this approach highlights ratios larger than 2 (and smaller than 1/2) though threshold of 3 (1/3) and 5 (1/5) have also been used.

Although I cannot recommend this approach, here is how it would work in R. Recall that a ratios of 2 and 1/2 correspond to log ratios of -1 and +1 if you use base 2 logarithms.

diff1 <- m2.logratio[abs(m2.logratio)>1] names(diff1) [1] "5" "11" "16" "17" "37" "49" "50" "53" "54" "57" [11] "58" "60" "63" "64" "65" "68" "80" "98" "102" "105" [21] "107" "116" "125" "133" "134" "147" "150" "158" "162" "166" [31] "168" "171" "173" "177" "193" "205" "218" "224" "232" "236" . . . [601] "3253" "3260" "3261" "3271" "3272" "3282" "3289" "3290" "3311" "3331" [611] "3348" "3350" "3354" "3358" "3367" "3379" "3404" "3421" "3423"

Another approach is to apply a two-sample t-test or a related statistic to each gene.

f.custom.t <- function(x) { t.test(x[1:4],x[5:8])$p.value } m2.t.tests <- apply(m2.normalize3,1,f.custom.t) diff2 <- m2.t.tests[m2.t.tests < 0.05] names(diff2) [1] "3" "5" "8" "9" "11" "13" "16" "17" "19" "21" [11] "22" "26" "27" "30" "32" "33" "35" "36" "37" "40" [21] "42" "47" "49" "50" "53" "54" "55" "57" "58" "59" [31] "60" "63" "64" "65" "68" "72" "76" "79" "80" "81" . . . [1731] "3401" "3403" "3404" "3405" "3407" "3408" "3410" "3412" "3421" "3423" [1741] "3430" "3433"

The problem with this approach, of course, is that you are running thousands of t-test simultaneously. Some of these tests will end up being significant, even if nothing is going on.

To adjust for the large number of multiple tests, you can use a correction, such as Bonferroni. Here's how Bonferroni would work.

diff3 <- m2.t.tests[m2.t.tests < (0.05/m2.n.genes)] names(diff3) [1] "5" "64" "102" "224" "232" "305" "311" "339" "354" "384" [11] "462" "516" "560" "591" "603" "736" "745" "758" "778" "798" [21] "805" "834" "931" "967" "987" "988" "1000" "1029" "1065" "1103" [31] "1129" "1157" "1202" "1235" "1263" "1269" "1290" "1305" "1352" "1356" . . . [101] "2905" "2910" "2990" "3031" "3063" "3142" "3195" "3209" "3220" "3225" [111] "3252" "3267" "3311" "3354" "3358"

The Bonferroni approach has been criticized for being too conservative. There are several alternatives that look at p-values in sequence and apply a more strict standard for the smallest p-values, but loosen this standard as more p-values are examined.

Holm-Bonferroni

Holm-Sidak

Hochberg

Westfall-Young

The Holm-Bonferroni approach is the easiest one to follow. Sort the p-values from smallest to largest. Adjust the smallest p-value by mulitplying by K where K is the number of genes being tested. Multiply the next smallest p-value by (K-1), the next smallest by (K-2) and so forth.

o <- order(m2.t.tests) m2.sorted.pvalues <- m2.t.tests[o] m2.sorted.names <- names(m2.t.tests)[o] m2.adjusted.pvalues <- cummax(pmin(m2.sorted.pvalues*(m2.n.genes:1),1.0)) m2.sorted.names[m2.adjusted.pvalues < 0.05] [1] "354" "2854" "1103" "1379" "3225" "2252" "2766" "3358" "1458" "1546" [11] "1696" "3195" "2067" "1690" "2744" "1430" "384" "1629" "560" "1758" [21] "1269" "3031" "1915" "1499" "2199" "988" "2445" "3063" "1751" "2910" [31] "798" "1821" "1434" "339" "3252" "3311" "1548" "1352" "745" "2038" . . . [101] "1931" "3267" "1925" "311" "591" "834" "1889" "1671" "64" "516" [111] "2905" "2810" "2651" "2863" "2023" "2686"

Differential expression (DE)

DE - Within gene analysis

[Explain]

DE - Borrowing strength across genes

[Explain]

Local pooled error. At various levels of expression, pool standard deviations of multiple genes.

Model variance as a function of the mean. Sometimes this approach is not flexible enough.

Shrinkage estimates. Add a constant to the denominator.

DE - Adjusting for multiple tests

Family Wise Error Rate: Bonferroni, Holm (1979) step down, Hochberg.

False Discovery Rate

Classic reference Alizadeh et al (2000)