Importing data from microarray studies (no date) [incomplete]

This page is moving to a new website.

There are so many different ways that data can come to you in a microarray experiment that it is hard to document how to import the data. Here are a few examples, plus some random notes and thoughts.

The data structure for Affymetrix chips

Affymetrix has several formats, DAT (image), CEL (probe), and CHP. [Explain what these formats are]

The data structure for cDNA arrays

A cDNA array (spotted array) is interesting from a statistical perspective because there is a pairing that adds precision but which also adds a layer of complexity.

Red signal is also called the Cy5 signal. The green signal is also called the Cy3 signal.

Typically, data from these arrays appears in large text files with tab or comma delimiters. There are several header lines at the top of the file before you see data on individual spots. Because these formats are text files, you can manipulate them easily. But there is only limited standardization of these files at this stage. A nice summary of the conflicting data formats appears at www.scmbb.ulb.ac.be/~jvanheld/web_course_microarrays/practicals/data_formats.html.

There is usually one file per slide with a header of several lines. Here's an example of the header file for array lc4b007 from the website providing supplemental data for the Alizadeh 2000 Nature Study. This file was produced by the ScanAlyze software system which is available for free for academic and non-profit researchers from Eisen Lab at Lawrence Berkeley National Lab..

The first line provides names for each of the columns to follow. The SPOT, GRID, TOP, LEFT, BOT, RIGHT ROW, and COL give information about the physical location of the spot. The values for CH*I (* equals 1 for channel 1 and 2 for channel 2) are the green and red intensities for an individual spot on the array. The values for CH*B are median background intensities and CH*BA are mean background intensities. SPIX and BGPIX are the number of pixels used for the spot and the background, respectively. The quantities MRAT, REGR, CORR, and LFRAT represent quality checks on the spot image. In a large production environment, these values might be useful in a quality control chart. Additional parameters for assessing the quality of a spot are CH*GTB1 which represent how many pixels in the spot exceed background. For a weak spot, these values will be close to 0.5. Alternate measures CH*GTB2, CH*KSD, and CH*KSP represent other criteria for identifying weak spots. FLAG is a user defined variable to identify spots that the user has special information on.

Here's what the header looks like for another file, produced locally with Affymetrix Jaguar Image Analysis software:

The spot intensities are stored under "Girl 647" and "Normal 555." The standard deviation of the spot intensity is labelled SD, Pixels is the number of pixels in the spot, BG is the intensity of the background, and BGSD is the standard deviation of the background. The 647/555 column represents the ratio of spot intensities (unadjusted for background levels).

Genepix uses a format (GPR) that has similar information about the signal and background, with some quality checks along the lines of those produced by SnanAlyze software. The specification for the GPR format is on the web at www.axon.com/gn_GenePix_File_Formats.html#gpr.

The GPR format is also a text file. It has several header lines that give information about the particular experiment. After the header lines comes data about particular spots:

The first seven columns of data give the location and information about individual spots. The green signal information is labeled as 635 and the red signal information is labeled as 532, though later in the file the signals are 1 and 2, respectively. The letter F refers to the Foreground and B refers to the background.

Spot is an open source software program for image analysis of microarrays that also uses the R programming language. The files produced by Spot has a single header line

The letter G represents the green signal, R represents red. and bg represents background information.

Layout files

A second file will provide information linking particular spots for an array to gene names. For the Alizadeh study, this appears as two separate text files. The first text file links spots on a particular slide to an inhouse DNA ID number. The second text file links the inhouse DNA ID number to CLONEID and curated names. This is a bit messy because several different batches of chips were used and the gene locations moved around from one batch to another.

Genepix software has a format for a layout file (GAL format) that appears to be widely used. The specification for the GAL format is on the web at www.axon.com/gn_GenePix_File_Formats.html#gal.

Here the header lines for a GAL file that comes with Bioconductor.

The "19 5" in the second line tells you that there are 19 header lines and that the information about individual spots comes in 5 columns. This particular array has a four by four grid of printing tips and the Block information gives details about where these tips produced spots on the microarray. On this microarray, each spot is uniquely identified by the printing tip (block), row and column. Id is an internal code name and Name is the actual name of the gene. 

If your layout information does not come in a GAL file, it is not too difficult to convert it to this format. You can skip all of the block information, if you like. It helps some software link particular data values to locations on the image file, but that is completely optional. A minimal GAL file would have the following header:

If your data has only one block then number that block "1" for all of the spots on your array.

Here's some R code that reads in the two layout files for the Alizadeh study, selects the rows corresponding to the

Image files

In addition to these text files, microarray data sets will often include an image of the array in TIFF format.

Reading information into Bioconductor

In all of these files, the four key pieces of information you definitely need are the green signal, the green background, the red signal, and the red background.

The marrayInput library of Bioconductor has

for reading various microarray formats. While you are reading in the data, you can also specify layout values, probe sequence information, and/or target sample information.

MAGE-ML or Microarray Gene Expression Markup Language

[Explain]

Importing data from the prenatal liver study

I received 22 Excel files for this project.

These were tab delimited files, with key information stored in the name of the file itself. The filename can be split into eight pieces:

 The first line in each file included the following names for the data:

The second column, the signal, is the most important piece of information. Third column gives a detection code based on the fourth column, the detection p-value. The three codes are:

I had to convert from a tab delimited file to a comma separated file. There are several ways to do this. For example, you can read the tab delimited file into Excel and then save it as a .CSV file). I found it faster to search and replace the TAB character with a comma. It is hard to search directly for a tab character, but if your software allows it, you can look for the ASCII code 09.

Once I had the comma separated values, I created the pieces of the file separately and pasted them together. This took longer than just typing the names of the files, but in the long run I save time because several of these pieces become important variables in the analysis.

Here is the R code for importing this data.

us <- "_"
dir <- "x:/sleeder/csv/"
id1 <- as.character(c(6286:6294,7446:7458))
loc <- c(   "UM",   "UM",   "UM",    "H",    "H",    "H",
             "H",    "H",    "H",   "UM",   "UM",   "UM",
            "UM",    "H",    "H",    "H",    "H",    "H",
             "H",   "WU",   "WU",   "WU")
id2 <- c( "1589", "1589", "1589","18058","18058","18058",
         "17869","17869","17869", "1690", "1621", "1631",
          "1566","18354","18381","18390","18401","18508",
	 "18535", "0831", "3881", "5025")
tis <- c(   "Ki",   "Li",   "Lu",   "Ki",   "Lu",   "Li",
            "Lu",   "Ki",   "Li",   "Li",   "Li",   "Li",
            "Li",   "Li",   "Li",   "Li",   "Li",   "Li",
            "Li",   "Li",   "Li",   "Li")
age <- c(  "133",  "133",  "133",  "098",  "098",  "098",
           "075",  "075",  "075",  "140",  "130",  "133",
           "134",  "096",  "096",  "094",  "076",  "076",
	   "075",  "1.7",  "3.0",  "2.7")
unt <- c(     "",     "",     "",     "",     "",     "",
              "",     "",     "",     "",     "",     "",
              "",     "",     "",     "",     "",     "",
              "",    "y",    "y",    "y")
sex <- c(    "F",    "F",    "F",     "F",    "F",   "F",
             "X",    "X",    "X",     "M",    "M",   "M",
             "M",    "M",    "M",     "M",    "X",   "X",
             "X",    "F",    "M",     "F")
rac <- c(   "AA",   "AA",   "AA",   "XX",   "XX",   "XX",
            "CA",   "CA",   "CA",   "CA",   "CA",   "AA",
            "AA",   "XX",   "XX",   "XX",   "XX",   "XX",
            "XX",   "CA",   "AA",   "XX")
ext <- ".csv"

fz <- paste(dir,id1,us,loc,id2,tis,us,age,unt,us,sex,rac,ext,sep="")

signal.all <- matrix(-1,54675,22)
detect.all <- matrix(-1,54675,22)
for (i in 1:22) {
	tmp <- read.csv(fz[i])
	signal.all[,i] <- tmp[,2]
	detect.all[,i] <- tmp[,3]
}

The value of breaking the filenames up into segments becomes apparent when you want to look at a particular subset of the genes. For example, the R code

signal.li <- signal.all[,tis=="Li"]

will create an array consisting of the 16 chips associated with liver tissue. We can get the gender information for these 16 patients with the command

gender.li <- gender[tis=="Li"]

and so forth.

Averting a disaster in the prenatal liver study

When I was trying to normalize the data, I noticed that three of the arrays had rather unusual properties. When trying to normalize array 6287 versus the median array, the R vs I plot looked like

which was much more scattered than most of the other plots, such as 7446.

When I plotted pairs of arrays versus each other, it became even more apparent. Here is what 6287 versus 7446 looked like.

Compare this to 7446 versus 7447.

It turns out that the order of the genes were not the same in all of the files. For example in file 6287, the first ten genes were

  1. 1007_s_at
  2. 1053_at
  3. 117_at
  4. 121_at
  5. 1255_g_at
  6. 1294_at
  7. 1316_at
  8. 1320_at
  9. 1405_i_at
  10. 1431_at

while in file 7446, the first ten genes were

  1. 117_at
  2. 121_at
  3. 177_at
  4. 179_at
  5. 320_at
  6. 336_at
  7. 564_at
  8. 632_at
  9. 823_at
  10. 1053_at

By assuming that all the files listed their genes in the exact same order, I had effectively shuffled the values of three of the arrays and effectively ruined any analyses. To fix this, I had to sort the CSV files to insure that the gene names were in the same order for each file. Then I added a couple of extra lines of code to double-check that the files were now in a consistent order. First, I got the probeset list from the first file. Then when reading in the remaining files, I compared the probeset list to the first file. If there were any mismatches, then the sum would equal a value larger than zero.

tmp <- read.csv(fz[1])
signal.all[,1] <- tmp[,2]
detect.all[,1] <- tmp[,3]
gene.probeset <- trimWhiteSpace(as.character(tmp$Probeset))

for (i in 2:22) { tmp <- read.csv(fz[i])
        signal.all[,i] <- tmp[,2]
        detect.all[,i] <- tmp[,3]
        check.probeset <- trimWhiteSpace(as.character(tmp$Probeset))
        print(sum(gene.probeset!=check.probeset))
}

I should have been more careful at the beginning, but at least I caught the problem before I ran any serious analyses. Whew!