Stats: Normalization for microarray data (no date)

Normalization for microarray data (no date) [incomplete]

This page is moving to a new website.

Normalization is the process of adjusting values in a microarray experiment to improve consistency and reduce bias.

Image processing

[explain]

Background correction

[explain]

Log transformation

[explain]

Normalization

If you run the same biological sample on two separate microarrays (or on both the red and the green channel of a two color array), you will get slightly different results. This is just part of the inherent variation that you have with any laboratory assay.

Normalization is a method that attempts to remove some of this variation. There are several approaches which can be used separately or in combination to normalize a set of microarrays.

1. Multiply each array by a constant to make the mean (median) intensity the same for each individual array.

2. Adjust the arrays using some control or housekeeping genes that you would expect to have the same intensity level across all of the samples.

3. Match the percentiles of each array.

4. Adjust using a nonlinear smoothing curve.

5. Adjust using control genes

Amaratunga and Cabrera (2004) include a library of R functions with their book and that library includes several sample data sets. The data set mice2.txt represents a simple microarray experiment with four control mice and four treated mice. There is substantial variation in the arrays.

The following program reads the mice2.txt file, compute the base 2 logarithm for each value, and then estimates the average intensity.

rDirectory <- "c:/Program Files/R/rw1090" FileName <- "/library/DNAMR/data/mice2.txt" m2.dat <- read.table(file=paste(rDirectory,FileName,sep="")) m2.log <- log(m2.dat,base=2) m2.array.means <- apply(m2.log,2,mean) print(round(m2.array.means,2))

Here is what the output looks like (I slightly modified the appearance of this output and some of the later output so it would fit better on this web page).

cnt1011a1 9.07 cnt1011a2 9.03 cnt1011b1 8.93 cnt1011b2 9.2 trt2501a1 8.8 trt2501a2 9.08 trt2501b1 9.49 trt2501b2 9.23

boxplot(split(as.matrix(m2.log),col(as.matrix(m2.log))))

Notice that even within the control and treatment arrays there is some variation in the average intensity. There are several factors that can cause this. Perhaps one array got slightly more DNA, or maybe there are slight variations during the production of the arrays. Maybe there were variations in the laboratory environment (temperature or humidity) during the preparation of these samples that influenced the readings.

Here's a direct comparison of the seventh and eighth arrays.

par(mar=c(5,4,0,0)+0.1) plot(m2.log[,7],m2.log[,8], xlim=range(m2.log),ylim=range(m2.log)) abline(a=0,b=1)

Notice how the bulk of the data lies below the diagonal line. This is an indication that signal intensity was lower, across the board, for the eighth array.

What you are seeing with these eight arrays is not too much different than what you might experience if you burn your own mix of songs onto a CD. Some of the songs might be recorded at a louder or softer volume than others, and if you take no steps to adjust the recording, then you would have to be constantly adjusting your volume as you listen to that CD.

The simplest adjustment is to take each signal and divide it by the average signal for each array. This guarantees that the adjusted signals will all have a mean of 1.0. Here is the R code to normalize each array to have the same mean intensity.

m2.n.genes <- dim(m2.log)[[1]] m2.n.arrays <- dim(m2.log)[[2]] m2.normalize1 <- matrix(NA,m2.n.genes,m2.n.arrays) for (i in 1:m2.n.arrays) { m2.normalize1[,i] <- m2.log[,i]/mean(m2.log[,i]) }

By the way, experts in R would probably point out the inefficiencies in this code. In general, if you can use matrix operations to avoid using loops, you should. Here's a bit more efficient code

one <- matrix(1,3434,1) m2.normalize1a <- m2.log / (one%*%m2.array.means)

You might wonder if this normalization is good or bad. It is possible, perhaps, that the treatment changes the cells by upregulating most genes to produce more mRNA. This is indeed possible, and normalization would effectively remove most of the effect of the treatment before you had a chance to analyze it.

Fortunately, the more likely scenario is that a treatment would upregulate only a few genes, downregulate only a few more genes, and would leave the vast majority of genes unaffected. There's safety in numbers for most microarrays, so normalization does not end up distorting your data.

Since the mean is influenced by outliers, some people prefer to adjust by the median intensity level rather than the mean level. Another commonly used choice is to adjust using the 75th percentile. The rationale for this choice is that about half the genes might not show any significant expression, so the 75th percentile represents the median of the remaining 50%.

> round(f.concor(m2.log),2)      cna1 cna2 cnb1 cnb2 tra1 tra2 trb1 trb2 cna1 1.00 0.97 0.97 0.96 0.78 0.80 0.75 0.78 cna2 0.97 1.00 0.97 0.96 0.78 0.81 0.76 0.79 cnb1 0.97 0.97 1.00 0.96 0.82 0.80 0.75 0.79 cnb2 0.96 0.96 0.96 1.00 0.77 0.81 0.79 0.81 tra1 0.78 0.78 0.82 0.77 1.00 0.92 0.85 0.89 tra2 0.80 0.81 0.80 0.81 0.92 1.00 0.92 0.96 trb1 0.75 0.76 0.75 0.79 0.85 0.92 1.00 0.95 trb2 0.78 0.79 0.79 0.81 0.89 0.96 0.95 1.00 > round(f.concor(m2.normalize1),2)      cna1 cna2 cnb1 cnb2 tra1 tra2 trb1 trb2 cna1 1.00 0.97 0.98 0.97 0.80 0.80 0.79 0.79 cna2 0.97 1.00 0.97 0.97 0.79 0.81 0.80 0.80 cnb1 0.98 0.97 1.00 0.98 0.82 0.81 0.80 0.81 cnb2 0.97 0.97 0.98 1.00 0.80 0.82 0.81 0.81 tra1 0.80 0.79 0.82 0.80 1.00 0.94 0.95 0.94 tra2 0.80 0.81 0.81 0.82 0.94 1.00 0.96 0.97 trb1 0.79 0.80 0.80 0.81 0.95 0.96 1.00 0.96 trb2 0.79 0.80 0.81 0.81 0.94 0.97 0.96 1.00

plot(m2.normalize1[,7],m2.normalize1[,8], xlim=range(m2.normalize1),ylim=range(m2.normalize1)) abline(a=0,b=1)

boxplot(split(as.matrix(m2.normalize1), col(as.matrix(m2.normalize1))))

This normalization helps only a little. There is still too much of the data below the diagonal line. The problem is that the variation from array to array is often intensity dependent and these arrays show systematic variations at low intensities that differ from variations seen at medium and high intensities.

Another approach is quantile normalization. If you look at the sorted values of any two arrays, they will deviate from an identity line.

plot(sort(m2.log[,7]),sort(m2.log[,8]), xlim=range(m2.log),ylim=range(m2.log)) abline(a=0,b=1)

Quantile normalization will match up these values across all the arrays so that the smallest value on each array is identical, the second smallest is identical, and so forth. Note that the smallest value for one array might be a different gene than the smallest value on another array.

The DNAMR package includes a function, f.qn() that performs quantile normalization. Let's apply this to the data and see what we get.

library("DNAMR") m2.normalize2 <- f.qn(m2.log) plot(sort(m2.normalize2[,7]),sort(m2.normalize2[,8]), xlim=range(m2.normalize2),ylim=range(m2.normalize2)) abline(a=0,b=1)

> round(f.concor(m2.normalize2),2)      cna1 cna2 cnb1 cnb2 tra1 tra2 trb1 trb2 cna1 1.00 0.97 0.98 0.97 0.80 0.82 0.83 0.81 cna2 0.97 1.00 0.97 0.97 0.80 0.82 0.83 0.81 cnb1 0.98 0.97 1.00 0.99 0.82 0.82 0.84 0.83 cnb2 0.97 0.97 0.99 1.00 0.81 0.82 0.83 0.82 tra1 0.80 0.80 0.82 0.81 1.00 0.95 0.97 0.94 tra2 0.82 0.82 0.82 0.82 0.95 1.00 0.96 0.97 trb1 0.83 0.83 0.84 0.83 0.97 0.96 1.00 0.97 trb2 0.81 0.81 0.83 0.82 0.94 0.97 0.97 1.00

plot(m2.normalize2[,7],m2.normalize2[,8], xlim=range(m2.normalize2),ylim=range(m2.normalize2)) abline(a=0,b=1)

boxplot(split(as.matrix(m2.normalize2), col(as.matrix(m2.normalize2))))

Notice that the concordance correlations are just a bit better. The seventh and eighth arrays line up better as well, with about half the genes above and below the diagonal.

It's interesting to look at the code behind the f.qn() function.

> f.qn function (x) { xm <- apply(x, 2, sort) xxm <- f.rmedian.na(xm) xr <- c(apply(x, 2, rank)) array(approx(1:nrow(x), xxm, xr)$y, dim(x), dimnames(x)) }

The program creates a temporary matrix xm consisting of the sorted columns of x. Then it computes a median across all the rows of xm. Then the approx() function applies linear interpolation to the data.

A third approach to normalization uses smoothing curves to adjust. Define a median signal across all the arrays and then look at the relationship between the smooth fit and the diagonal line. Adjust the array values up or down depending on whether the smooth fit is below or above the diagonal line.

m2.median <- apply(m2.log,1,median) m2.deviation <- matrix(NA,m2.n.genes,m2.n.arrays) for (i in 1:8) { smooth.fit <- fitted(loess(m2.log[,i]~m2.median)) m2.deviation[,i] <- smooth.fit - m2.median plot(m2.median,m2.log[,i],     xlim=range(m2.log),ylim=range(m2.log)) abline(a=0,b=1) points(m2.median,smooth.fit,col=3) }

m2.normalize3 <- m2.log - m2.deviation round(f.concor(m2.normalize3),2)      cna1 cna2 cnb1 cnb2 tra1 tra2 trb1 trb2 cna1 1.00 0.98 0.98 0.98 0.83 0.82 0.80 0.82 cna2 0.98 1.00 0.98 0.98 0.82 0.82 0.79 0.81 cnb1 0.98 0.98 1.00 0.99 0.85 0.83 0.81 0.82 cnb2 0.98 0.98 0.99 1.00 0.82 0.81 0.78 0.80 tra1 0.83 0.82 0.85 0.82 1.00 0.95 0.96 0.94 tra2 0.82 0.82 0.83 0.81 0.95 1.00 0.95 0.96 trb1 0.80 0.79 0.81 0.78 0.96 0.95 1.00 0.96 trb2 0.82 0.81 0.82 0.80 0.94 0.96 0.96 1.00 plot(m2.normalize3[,7],m2.normalize3[,8], xlim=range(m2.normalize2),ylim=range(m2.normalize2)) abline(a=0,b=1)

boxplot(split(as.matrix(m2.normalize3), col(as.matrix(m2.normalize3))))

This normalization seems to work about as well as the quantile normalization did.

You can get a better feel for the need for normalization by using an MvA plot. This plot effectively rotates the plot 45 degrees so that the diagonal reference line becomes a horizontal reference line. Here's some code for creating an MvA plot and an example of how that plot differs from the one shown just above.

mva.plot <- function(x,y) { m <- y-x a <- (x+y)/2 plot(a,m) abline(h=0) } mva.plot(m2.normalize3[,7],m2.normalize3[,8])

You can do your normalization calculations on the MvA plot, though they should not be significantly different from the normalizations described above. The MvA plot is also called a Bland-Altman plot.

Depending on what you know about the microarrays, you might want to consider additional normalization steps. For example, if you know the physical location of the spots on the microarray, you might want to see if there are physical regions of the array where signals are stronger or weaker. If the array was printed used a set of 16 print tips, you might want to see if some of the tips produced stronger or weaker signals than the others.

I did not discuss the use of control genes for normalization. There are several problems with the use of control genes. First, these genes might not cover the full range of intensity levels on the array. Second, there is some question about whether certain genes will truly serve effectively as controls, and they may not have the same expression levels across different samples.

Once the data is normalized, you may wish to store in an object of class "exprSet". Here are the R commands to do this.

library("marray") m2.indicators <- data.frame(ic=c(1,1,1,1,0,0,0,0), it=c(0,0,0,0,1,1,1,1),row.names=dimnames(m2.dat)[[2]]) covLabels <- list("Indicator for Control","Indicator for Treatment") names(covLabels) <- names(m2.indicators) m2.phenoData <- new("phenoData",pData=m2.indicators,varLabels=covLabels) m2.exprSet <- new("exprSet",exprs=as.matrix(m2.normalize3), phenoData=m2.phenoData) print(m2.exprSet) Expression Set (exprSet) with 3434 genes 8 samples phenoData object with 2 variables and 8 cases varLabels ic: Indicator for Control it: Indicator for Treatment

You can then use this object with most of the Bioconductor routines for differential expression, clustering, and so forth.