Importing data from microarray studies (no date) [incomplete]
This page is moving to a new website.
There are so many different ways that data can come to you in a microarray experiment that it is hard to document how to import the data. Here are a few examples, plus some random notes and thoughts.
The data structure for Affymetrix chips
Affymetrix has several formats, DAT (image), CEL (probe), and CHP. [Explain what these formats are]
The data structure for cDNA arrays
A cDNA array (spotted array) is interesting from a statistical perspective because there is a pairing that adds precision but which also adds a layer of complexity.
Red signal is also called the Cy5 signal. The green signal is also called the Cy3 signal.
Typically, data from these arrays appears in large text files with tab or comma delimiters. There are several header lines at the top of the file before you see data on individual spots. Because these formats are text files, you can manipulate them easily. But there is only limited standardization of these files at this stage. A nice summary of the conflicting data formats appears at www.scmbb.ulb.ac.be/~jvanheld/web_course_microarrays/practicals/data_formats.html.
There is usually one file per slide with a header of several lines. Here's an example of the header file for array lc4b007 from the website providing supplemental data for the Alizadeh 2000 Nature Study. This file was produced by the ScanAlyze software system which is available for free for academic and non-profit researchers from Eisen Lab at Lawrence Berkeley National Lab..
- HEADER SPOT GRID TOP LEFT BOT RIGHT ROW COL CH1I CH1B CH1AB CH2I CH2B CH2AB SPIX BGPIX EDGE RAT2 MRAT REGR CORR LFRAT CH1GTB1 CH2GTB1 CH1GTB2 CH2GTB2 CH1EDGEA CH2EDGEA FLAG CH1KSD CH1KSP CH2KSD CH2KSP
- REMARK SOFTWARE ScanAlyze
- REMARK SOFTVERS 2.30
- REMARK CH1 IMAGE lc4b007g1
- REMARK CH2 IMAGE lc4b007r1
- REMARK GRID FILE ..\..\AshGrids\lc4b007.SAG
- REMARK DATE 8/28/99
- REMARK TIME 10:59:34 AM
The first line provides names for each of the columns to follow. The SPOT, GRID, TOP, LEFT, BOT, RIGHT ROW, and COL give information about the physical location of the spot. The values for CH*I (* equals 1 for channel 1 and 2 for channel 2) are the green and red intensities for an individual spot on the array. The values for CH*B are median background intensities and CH*BA are mean background intensities. SPIX and BGPIX are the number of pixels used for the spot and the background, respectively. The quantities MRAT, REGR, CORR, and LFRAT represent quality checks on the spot image. In a large production environment, these values might be useful in a quality control chart. Additional parameters for assessing the quality of a spot are CH*GTB1 which represent how many pixels in the spot exceed background. For a weak spot, these values will be close to 0.5. Alternate measures CH*GTB2, CH*KSD, and CH*KSP represent other criteria for identifying weak spots. FLAG is a user defined variable to identify spots that the user has special information on.
Here's what the header looks like for another file, produced locally with Affymetrix Jaguar Image Analysis software:
- Row Col SpotQuality Girl647 ALEXA647SD ALEXA647Pixels ALEXA647BG ALEXA647BGSD Normal555 ALEXA555SD ALEXA555Pixels ALEXA555BG ALEXA555BGSD `647/555 Name Name Name Name
The spot intensities are stored under "Girl 647" and "Normal 555." The standard deviation of the spot intensity is labelled SD, Pixels is the number of pixels in the spot, BG is the intensity of the background, and BGSD is the standard deviation of the background. The 647/555 column represents the ratio of spot intensities (unadjusted for background levels).
Genepix uses a format (GPR) that has similar information about the signal and background, with some quality checks along the lines of those produced by SnanAlyze software. The specification for the GPR format is on the web at www.axon.com/gn_GenePix_File_Formats.html#gpr.
The GPR format is also a text file. It has several header lines that give information about the particular experiment. After the header lines comes data about particular spots:
- Block Column Row Name ID X Y Dia. F635Median F635Mean F635SD B635Median B635Mean B635SD %>B635+1SD %>B635+2SD F635%Sat. F532Median F532Mean F532SD B532Median B532Mean B532SD %>B532+1SD %>B532+2SD F532%Sat. RatioOfMedians RatioOfMeans MedianOfRatios MeanOfRatios RatioSD RgnRatio RgnR2 FPixels BPixels SumOfMedians SumOfMeans LogRatio Flags Normalize F1Median-B1 F2Median-B2 F1Mean-B1 F2Mean-B2 SNR1 F1TotalIntensity Index UserDefined
The first seven columns of data give the location and information about individual spots. The green signal information is labeled as 635 and the red signal information is labeled as 532, though later in the file the signals are 1 and 2, respectively. The letter F refers to the Foreground and B refers to the background.
Spot is an open source software program for image analysis of microarrays that also uses the R programming language. The files produced by Spot has a single header line
- indexs grid.r grid.c spot.r spot.c area Gmean Gmedian GIQR Rmean Rmedian RIQR bgGmean bgGmed bgGSD bgRmean bgRmed bgRSD valleyG valleyR morphG morphG.erode morphG.close.open morphR morphR.erode morphR.close.open logratio perimeter circularity badspot
The letter G represents the green signal, R represents red. and bg represents background information.
Layout files
A second file will provide information linking particular spots for an array to gene names. For the Alizadeh study, this appears as two separate text files. The first text file links spots on a particular slide to an inhouse DNA ID number. The second text file links the inhouse DNA ID number to CLONEID and curated names. This is a bit messy because several different batches of chips were used and the gene locations moved around from one batch to another.
Genepix software has a format for a layout file (GAL format) that appears to be widely used. The specification for the GAL format is on the web at www.axon.com/gn_GenePix_File_Formats.html#gal.
Here the header lines for a GAL file that comes with Bioconductor.
- ATF 1.0
- 19 5
- "Type=GenePix ArrayList V1.0"
- "BlockCount=16"
- "BlockType=0"
- "Block1= 500, 500, 100, 24, 180, 22, 180"
- "Block2= 4996, 500, 100, 24, 180, 22, 180"
- "Block3= 9492, 500, 100, 24, 180, 22, 180"
- "Block4= 13988, 500, 100, 24, 180, 22, 180"
- "Block5= 500, 4996, 100, 24, 180, 22, 180"
- "Block6= 4996, 4996, 100, 24, 180, 22, 180"
- "Block7= 9492, 4996, 100, 24, 180, 22, 180"
- "Block8= 13988, 4996, 100, 24, 180, 22, 180"
- "Block9= 500, 9492, 100, 24, 180, 22, 180"
- "Block10= 4996, 9492, 100, 24, 180, 22, 180"
- "Block11= 9492, 9492, 100, 24, 180, 22, 180"
- "Block12= 13988, 9492, 100, 24, 180, 22, 180"
- "Block13= 500, 13988, 100, 24, 180, 22, 180"
- "Block14= 4996, 13988, 100, 24, 180, 22, 180"
- "Block15= 9492, 13988, 100, 24, 180, 22, 180"
- "Block16= 13988, 13988, 100, 24, 180, 22, 180"
- "Block" "Row" "Column" "ID" "Name"
The "19 5" in the second line tells you that there are 19 header lines and that the information about individual spots comes in 5 columns. This particular array has a four by four grid of printing tips and the Block information gives details about where these tips produced spots on the microarray. On this microarray, each spot is uniquely identified by the printing tip (block), row and column. Id is an internal code name and Name is the actual name of the gene.
If your layout information does not come in a GAL file, it is not too difficult to convert it to this format. You can skip all of the block information, if you like. It helps some software link particular data values to locations on the image file, but that is completely optional. A minimal GAL file would have the following header:
- ATF 1.0
- 1 4
- "Type=GenePix ArrayList V1.0"
- "Block" "Column" "Row" "ID"
If your data has only one block then number that block "1" for all of the spots on your array.
Here's some R code that reads in the two layout files for the Alizadeh study, selects the rows corresponding to the
- #
- # f0 gives the path, f1 and f2 the file names
- #
- f0 <- "d:/Data/work040524/Bioconductor/"
- f1 <- "chip_spot_well_tab.txt"
- f2 <- "well_cloneid_name_tab.txt"
- #
- # the read.table function gets data from a text file.
- # sep="\t" for tab delimited file
- # fill=T
- #
- chip.spot <- read.table(paste(f0,f1,sep=""),header=FALSE,sep="\t",fill=T)
- well.clone <- read.table(paste(f0,f2,sep=""),header=FALSE,sep="\t",fill=T)
- #
- # Since there is no header, we provide variable names below
- #
- names(chip.spot) <- c("ChipBatch","SpotNum","InhouseID")
- names(well.clone) <- c("InhouseID","CloneID","CuratedName")
- #
- # select the rows of chip.spot corresponding to lc4b
- #
- ch0 <- chip.spot[chip.spot$ChipBatch=="lc4b",]
- #
- # combine these rows with well.clone
- # all.x=T preserves any unmatched rows
- #
- ch1 <- merge(ch0,well.clone,all.x=T)
- #
- # This layout file does not include rows and columns
- # so we compute them ourselves
- #
- row <- ceiling(ch1$SpotNum/24)
- col <- ch1$SpotNum-24*(row-1)
- #
- # We need four lines at the beginning of the file
- # Since the GAL format uses double quotes,
- # you should surround these strings with single quotes
- #
- h1 <- 'ATF 1.0'
- h2 <- '1 4'
- h3 <- '"Type=GenePix ArrayList V1.0"
- h4 <- '"Block" "Column" "Row" "ID"'
- #
- # Combine the block (always=1), row, col, and CloneID
- # sep="\t" places tabs between each value
- #
- tail <- paste(rep(1,9216),row,col,ch1$CloneID,sep="\t")
- #
- # f3 is the name of the new file
- #
- f3 <- paste("lc4b",".gal",sep="")
- write(c(h1,h2,h3,h4,tail),file=paste(f0,f3))
Image files
In addition to these text files, microarray data sets will often include an image of the array in TIFF format.
Reading information into Bioconductor
In all of these files, the four key pieces of information you definitely need are the green signal, the green background, the red signal, and the red background.
The marrayInput library of Bioconductor has
- read.Genepix()
- read.SMD()
- read.Spot()
- read.marrayRaw()
for reading various microarray formats. While you are reading in the data, you can also specify layout values, probe sequence information, and/or target sample information.
MAGE-ML or Microarray Gene Expression Markup Language
[Explain]
Importing data from the prenatal liver study
I received 22 Excel files for this project.
6286_UM1589Ki_133_FAA.txt
6287_UM1589Li_133_FAA.txt
6288_UM1589Lu_133_FAA.txt
6289_H18058Ki_098_FXX.txt
6290_H18058Lu_098_FXX.txt
6291_H18058Li_098_FXX.txt
6292_H17869Lu_075_XCA.txt
6293_H17869Ki_075_XCA.txt
6294_H17869Li_075_XCA.txt
7446_UM1690Li_140_MCA.txt
7447_UM1621Li_130_MCA.txt
7448_UM1631Li_133_MAA.txt
7449_UM1566Li_134_MAA.txt
7450_H18354Li_096_MXX.txt
7451_H18381Li_096_MXX.txt
7452_H18390Li_094_MXX.txt
7453_H18401Li_076_XXX.txt
7454_H18508Li_076_XXX.txt
7455_H18535Li_075_XXX.txt
7456_WU0831Li_1.7y_FCA.txt
7457_WU3881Li_3.0y_MAA.txt
7458_WU5025Li_2.7y_FXX.txt
These were tab delimited files, with key information stored in the name of the file itself. The filename can be split into eight pieces:
- id1 (a unique id code for each microarray chip)
- loc (location. UM=University of Maryland, H=Harvard, WU=Washington University)
- id2 (a second id code)
- tis (tissue type. Ki=kidney, Li=liver, Lu=lung)
- age (age in days if prenatal, in years if postnatal)
- unt (unit, empty=days, y=years)
- sex (F=female, M=male, X=unknown)
- rac (race/ethnicity. AA=African American, CA=Caucasian, XX=unknown)
The first line in each file included the following names for the data:
- Probeset
- 6286_Signal
- 6286_Detection
- 6286_Detection P Value
- 6286_Stat Pairs
- 6286_Stat Pairs Used
The second column, the signal, is the most important piece of information. Third column gives a detection code based on the fourth column, the detection p-value. The three codes are:
- A (if the p-value is larger than 0.065),
- M (if the p-value is between 0.065 and 0.05), and
- P (if the p-value is smaller than 0.05).
I had to convert from a tab delimited file to a comma separated file. There are several ways to do this. For example, you can read the tab delimited file into Excel and then save it as a .CSV file). I found it faster to search and replace the TAB character with a comma. It is hard to search directly for a tab character, but if your software allows it, you can look for the ASCII code 09.
Once I had the comma separated values, I created the pieces of the file separately and pasted them together. This took longer than just typing the names of the files, but in the long run I save time because several of these pieces become important variables in the analysis.
Here is the R code for importing this data.
us <- "_" dir <- "x:/sleeder/csv/" id1 <- as.character(c(6286:6294,7446:7458)) loc <- c( "UM", "UM", "UM", "H", "H", "H", "H", "H", "H", "UM", "UM", "UM", "UM", "H", "H", "H", "H", "H", "H", "WU", "WU", "WU") id2 <- c( "1589", "1589", "1589","18058","18058","18058", "17869","17869","17869", "1690", "1621", "1631", "1566","18354","18381","18390","18401","18508", "18535", "0831", "3881", "5025") tis <- c( "Ki", "Li", "Lu", "Ki", "Lu", "Li", "Lu", "Ki", "Li", "Li", "Li", "Li", "Li", "Li", "Li", "Li", "Li", "Li", "Li", "Li", "Li", "Li") age <- c( "133", "133", "133", "098", "098", "098", "075", "075", "075", "140", "130", "133", "134", "096", "096", "094", "076", "076", "075", "1.7", "3.0", "2.7") unt <- c( "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "y", "y", "y") sex <- c( "F", "F", "F", "F", "F", "F", "X", "X", "X", "M", "M", "M", "M", "M", "M", "M", "X", "X", "X", "F", "M", "F") rac <- c( "AA", "AA", "AA", "XX", "XX", "XX", "CA", "CA", "CA", "CA", "CA", "AA", "AA", "XX", "XX", "XX", "XX", "XX", "XX", "CA", "AA", "XX") ext <- ".csv" fz <- paste(dir,id1,us,loc,id2,tis,us,age,unt,us,sex,rac,ext,sep="") signal.all <- matrix(-1,54675,22) detect.all <- matrix(-1,54675,22) for (i in 1:22) { tmp <- read.csv(fz[i]) signal.all[,i] <- tmp[,2] detect.all[,i] <- tmp[,3] }
The value of breaking the filenames up into segments becomes apparent when you want to look at a particular subset of the genes. For example, the R code
signal.li <- signal.all[,tis=="Li"]
will create an array consisting of the 16 chips associated with liver tissue. We can get the gender information for these 16 patients with the command
gender.li <- gender[tis=="Li"]
and so forth.
Averting a disaster in the prenatal liver study
When I was trying to normalize the data, I noticed that three of the arrays had rather unusual properties. When trying to normalize array 6287 versus the median array, the R vs I plot looked like
which was much more scattered than most of the other plots, such as 7446.
When I plotted pairs of arrays versus each other, it became even more apparent. Here is what 6287 versus 7446 looked like.
Compare this to 7446 versus 7447.
It turns out that the order of the genes were not the same in all of the files. For example in file 6287, the first ten genes were
- 1007_s_at
- 1053_at
- 117_at
- 121_at
- 1255_g_at
- 1294_at
- 1316_at
- 1320_at
- 1405_i_at
- 1431_at
while in file 7446, the first ten genes were
- 117_at
- 121_at
- 177_at
- 179_at
- 320_at
- 336_at
- 564_at
- 632_at
- 823_at
- 1053_at
By assuming that all the files listed their genes in the exact same order, I had effectively shuffled the values of three of the arrays and effectively ruined any analyses. To fix this, I had to sort the CSV files to insure that the gene names were in the same order for each file. Then I added a couple of extra lines of code to double-check that the files were now in a consistent order. First, I got the probeset list from the first file. Then when reading in the remaining files, I compared the probeset list to the first file. If there were any mismatches, then the sum would equal a value larger than zero.
tmp <-
read.csv(fz[1])
signal.all[,1] <- tmp[,2]
detect.all[,1] <- tmp[,3]
gene.probeset <- trimWhiteSpace(as.character(tmp$Probeset))
for (i in 2:22) { tmp <- read.csv(fz[i])
signal.all[,i] <- tmp[,2]
detect.all[,i] <- tmp[,3]
check.probeset
<- trimWhiteSpace(as.character(tmp$Probeset))
print(sum(gene.probeset!=check.probeset))
}
I should have been more careful at the beginning, but at least I caught the problem before I ran any serious analyses. Whew!