P.Mean >> Category >> Data mining (created 2007-06-18).

Data mining is a broad class of statistical tools that are designed for massive data sets. Many of the links in this category refer to methods for genetic data sets, especially microarray studies. Articles are arranged by date with the most recent entries at the top. You can find outside resources at the bottom of this page. Other entries about data mining can be found in the data mining page at the StATS website.


1. P.Mean: Comparing a set of microarray experiments to a model experiment (created 2008-11-01). I have a matrix of effect sizes from numerous microarray experiments.  For example, in one matrix I have 200 genes (rows) and 107 experiments (columns).  In addition, I also have a sort of “model experiment” which contains the values in which I am most interested. For each gene, I am trying to determine which genes are not statistically different from the “model experiment” value.

Outside resources:

E E Schadt, C Li, C Su, W H Wong. Analyzing high-density oligonucleotide gene expression array data. J. Cell. Biochem. 2000;80(2):192-202. Abstract: "We have developed methods and identified problems associated with the analysis of data generated by high-density, oligonuceotide gene expression arrays. Our methods are aimed at accounting for many of the sources of variation that make it difficult, at times, to realize consistent results. We present here descriptions of some of these methods and how they impact the analysis of oligonucleotide gene expression array data. We will discuss the process of recognizing the "spots" (or features) on the Affymetrix GeneChip(R) probe arrays, correcting for background and intensity gradients in the resulting images, scaling/normalizing an array to allow array-to-array comparisons, monitoring probe performance with respect to hybridization efficiency, and assessing whether a gene is present or differentially expressed. Examples from the analyses of gene expression validation data are presented to contrast the different methods applied to these types of data." [Accessed February 22, 2011]. Available at: http://www.ncbi.nlm.nih.gov/pubmed/11074587.

Eugene Chudin, Randal Walker, Alan Kosaka, et al. Assessment of the relationship between signal intensities and transcript concentration for Affymetrix GeneChip arrays. Genome Biol. 2002;3(1):RESEARCH0005. Abstract: "BACKGROUND: Affymetrix microarrays have become increasingly popular in gene-expression studies; however, limitations of the technology have not been well established for commercially available arrays. The hybridization signal has been shown to be proportional to actual transcript concentration for specialized arrays containing hundreds of distinct probe pairs per gene. Additionally, the technology has been described as capable of distinguishing concentration levels within a factor of 2, and of detecting transcript frequencies as low as 1 in 2,000,000. Using commercially available arrays, we assessed these representations directly through a series of 'spike-in' hybridizations involving four prokaryotic transcripts in the absence and presence of fixed eukaryotic background. The contribution of probe-target interactions to the mismatch signal was quantified under various analyte concentrations. RESULTS: A linear relationship between transcript abundance and signal was consistently observed between 1 pM and 10 pM transcripts. The signal ceased to be linear above the 10 pM level and commenced saturating around the 100 pM level. The 0.1 pM transcripts were virtually undetectable in the presence of eukaryotic background. Our measurements show that preponderance of the signal for mismatch probes derives from interactions with the target transcripts. CONCLUSIONS: Landmark studies outlining an observed linear relationship between signal and transcript concentration were carried out under highly specialized conditions and may not extend to commercially available arrays under routine operating conditions. Additionally, alternative metrics that are not based on the difference in the signal of members of a probe pair may further improve the quantitative utility of the Affymetrix GeneChip array." [Accessed February 22, 2011]. Available at: http://genomebiology.com/2001/3/1/research/0005.

Torsten Hothorn, Berthold Lausen, Axel Benner, Martin Radespiel-Tröger. Bagging survival trees. Stat Med. 2004;23(1):77-91. Abstract: "Predicted survival probability functions of censored event free survival are improved by bagging survival trees. We suggest a new method to aggregate survival trees in order to obtain better predictions for breast cancer and lymphoma patients. A set of survival trees based on B bootstrap samples is computed. We define the aggregated Kaplan-Meier curve of a new observation by the Kaplan-Meier curve of all observations identified by the B leaves containing the new observation. The integrated Brier score is used for the evaluation of predictive models. We analyse data of a large trial on node positive breast cancer patients conducted by the German Breast Cancer Study Group and a smaller 'pilot' study on diffuse large B-cell lymphoma, where prognostic factors are derived from microarray expression values. In addition, simulation experiments underline the predictive power of our proposal." [Accessed February 22, 2011]. Available at: http://www.ncbi.nlm.nih.gov/pubmed/14695641.

Paola Rancoita, Marcus Hutter, Francesco Bertoni, Ivo Kwee. Bayesian DNA copy number analysis. BMC Bioinformatics. 2009;10(1):10. BACKGROUND:Some diseases, like tumors, can be related to chromosomal aberrations, leading to changes of DNA copy number. The copy number of an aberrant genome can be represented as a piecewise constant function, since it can exhibit regions of deletions or gains. Instead, in a healthy cell the copy number is two because we inherit one copy of each chromosome from each our parents. Bayesian Piecewise Constant Regression (BPCR) is a Bayesian regression method for data that are noisy observations of a piecewise constant function. The method estimates the unknown segment number, the endpoints of the segments and the value of the segment levels of the underlying piecewise constant function. The Bayesian Regression Curve (BRC) estimates the same data with a smoothing curve. However, in the original formulation, some estimators failed to properly determine the corresponding parameters. For example, the boundary estimator did not take into account the dependency among the boundaries and succeeded in estimating more than one breakpoint at the same position, losing segments.RESULTS:We derived an improved version of the BPCR (called mBPCR) and BRC, changing the segment number estimator and the boundary estimator to enhance the fitting procedure. We also proposed an alternative estimator of the variance of the segment levels, which is useful in case of data with high noise. Using artificial data, we compared the original and the modified version of BPCR and BRC with other regression methods, showing that our improved version of BPCR generally outperformed all the others. Similar results were also observed on real data.CONCLUSIONS:We propose an improved method for DNA copy number estimation, mBPCR, which performed very well compared to previously published algorithms. In particular, mBPCR was more powerful in the detection of the true position of the breakpoints and of small aberrations in very noisy data. Hence, from a biological point of view, our method can be very useful, for example, to find targets of genomic aberrations in clinical cancer samples. [Accessed February 24, 2009]. Available at: http://www.biomedcentral.com/1471-2105/10/10.

Joseph G Ibrahim, Ming-Hui Chen, Robert J Gray. Bayesian Models for Gene Expression With DNA Microarray Data. Journal of the American Statistical Association. 2002;97(457):88-99. Abstract: "Two of the critical issues that arise when examining DNA microarray data are (1) determination of which genes best discriminate among the different types of tissue, and (2) characterization of expression patterns in tumor tissues. For (1), there are many genes that characterize DNA expression, and it is of critical importance to try and identify a small set of genes that best discriminate between normal and tumor tissues. For (2), it is critical to be able to characterize the DNA expression of the normal and tumor tissue samples and develop suitable models that explain patterns of DNA expression for these types of tissues. Toward this goal, we propose a novel Bayesian model for analyzing DNA microarray data and propose a model selection methodology for identifying subsets of genes that show different expression levels between normal and cancer tissues. In addition, we propose a novel class of hierarchical priors for the parameters that allow us to borrow strength across genes for making inference. The properties of the priors are examined in detail. We introduce a Bayesian model selection criterion for assessing the various models, and develop Markov chain Monte Carlo algorithms for sampling from the posterior distributions of the parameters and for computing the criterion. We present a detailed case study in endometrial cancer to demonstrate our proposed methodology." [Accessed February 22, 2011]. Available at: http://pubs.amstat.org/doi/abs/10.1198/016214502753479257.

Journal article: Chang Gue Son, Sven Bilke, Sean Davis, Braden T. Greer, Jun S. Wei, Craig C. Whiteford, Qing-Rong Chen, Nicola Cenacchi, Javed Khan. Database of mRNA gene expression profiles of multiple human organs Genome Research. 2005;15(3):443 -450. Abstract: "Genome-wide expression profiling of normal tissue may facilitate our understanding of the etiology of diseased organs and augment the development of new targeted therapeutics. Here, we have developed a high-density gene expression database of 18,927 unique genes for 158 normal human samples from 19 different organs of 30 different individuals using DNA microarrays. We report four main findings. First, despite very diverse sample parameters (e.g., age, ethnicity, sex, and postmortem interval), the expression profiles belonging to the same organs cluster together, demonstrating internal stability of the database. Second, the gene expression profiles reflect major organ-specific functions on the molecular level, indicating consistency of our database with known biology. Third, we demonstrate that any small (i.e., n ~ 100), randomly selected subset of genes can approximately reproduce the hierarchical clustering of the full data set, suggesting that the observed differential expression of >90% of the probed genes is of biological origin. Fourth, we demonstrate a potential application of this database to cancer research by identifying 19 tumor-specific genes in neuroblastoma. The selected genes are relatively underexpressed in all of the organs examined and belong to therapeutically relevant pathways, making them potential novel diagnostic markers and targets for therapy. We expect this database will be of utility for developing rationally designed molecularly targeted therapeutics in diseases such as cancer, as well as for exploring the functions of genes." [Accessed on October 10, 2011]. http://genome.cshlp.org/content/15/3/443.abstract.

Amber Hackstadt, Ann Hess. Filtering for Increased Power for Microarray Data Analysis. BMC Bioinformatics. 2009;10(1):11. BACKGROUND:Due to the large number of hypothesis tests performed during the process of routine analysis of microarray data, a multiple testing adjustment is certainly warranted. However, when the number of tests is very large and the proportion of differentially expressed genes is relatively low, the use of a multiple testing adjustment can result in very low power to detect those genes which are truly differentially expressed. Filtering allows for a reduction in the number of tests and a corresponding increase in power. Common filtering methods include filtering by variance, average signal or MAS detection call (for Affymetrix arrays). In this paper, we study the effects of filtering in combination with the Benjamini-Hochberg method for false discovery rate control and q-value for false discovery rate estimation.RESULTS:Three case studies are used to compare three different filtering methods in combination with the two false discovery rate methods and three different preprocessing methods. For the case studies considered, filtering by detection call and variance (on the original scale) consistently led to an increase in the number of differentially expressed genes identified. On the other hand, filtering by variance on the log2 scale had a detrimental effect when paired with MAS5 and PLIER preprocessing methods, even when the testing was done on the log2 scale. A simulation study was done to further examine the effect of filtering by variance. We find that filtering by variance leads to higher power, often with a decrease in false discovery rate, when paired with either false discovery rate method. This holds regardless of the proportion of genes which are differentially expressed or whether we assume dependence or independence among genes.CONCLUSIONS:The case studies show that both detection call and variance filtering are viable methods of filtering which can increase the number of differentially expressed genes identified. The simulation study demonstrates that when paired with a false discovery rate method, filtering by variance can increase power while still controlling the false discovery rate. Filtering out $50\%$ of probe sets seems reasonable as long as the majority of genes are not expected to be differentially expressed. [Accessed February 24, 2009]. Available at: http://www.biomedcentral.com/1471-2105/10/11.

J. C. Barrett, B. Fry, J. Maller, M. J. Daly. Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics. 2005;21(2):263 -265. Abstract: "Summary: Research over the last few years has revealed significant haplotype structure in the human genome. The characterization of these patterns, particularly in the context of medical genetic association studies, is becoming a routine research activity. Haploview is a software package that provides computation of linkage disequilibrium statistics and population haplotype patterns from primary genotype data in a visually appealing and interactive interface." [Accessed February 22, 2011]. Available at: http://bioinformatics.oxfordjournals.org/content/21/2/263.abstract.

Newspaper article: Gina Kolata. How a New Hope in Cancer Fell Apart. The New York Times. July 7, 2011. Excerpt: "The episode is a stark illustration of serious problems in a field in which the medical community has placed great hope: using patterns from large groups of genes or other molecules to improve the detection and treatment of cancer. Companies have been formed and products have been introduced that claim to use genetics in this way, but assertions have turned out to be unfounded. While researchers agree there is great promise in this science, it has yet to yield many reliable methods for diagnosing cancer or identifying the best treatment." [Accessed on July 9, 2011]. http://www.nytimes.com/2011/07/08/health/research/08genes.html.

Jing Tang, Chong Tan, Matej Oresic, Antonio Vidal-Puig. Integrating post-genomic approaches as a strategy to advance our understanding of health and disease. 2009. [Accessed April 22, 2009]. Available at: http://genomemedicine.com/content/1/3/35/.

T. R. Golub, D. K. Slonim, P. Tamayo, et al. Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring. Science. 1999;286(5439):531 -537. Abstract: "Although cancer classification has improved over the past 30 years, there has been no general approach for identifying new cancer classes (class discovery) or for assigning tumors to known classes (class prediction). Here, a generic approach to cancer classification based on gene expression monitoring by DNA microarrays is described and applied to human acute leukemias as a test case. A class discovery procedure automatically discovered the distinction between acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL) without previous knowledge of these classes. An automatically derived class predictor was able to determine the class of new leukemia cases. The results demonstrate the feasibility of cancer classification based solely on gene expression monitoring and suggest a general strategy for discovering and predicting cancer classes for other types of cancer, independent of previous biological knowledge." [Accessed February 22, 2011]. Available at: http://www.sciencemag.org/content/286/5439/531.abstract.

Victoria Martin-Requena, Antonio Munoz-Merida, M.Gonzalo Claros, Oswaldo Trelles. PreP+07: improvements of a user friendly tool to pre-process and analyse microarray data. BMC Bioinformatics. 2009;10(1):16. BACKGROUND:Nowadays, microarray gene expression analysis is a widely used technology that scientists handle but whose final interpretation usually requires the participation of a specialist. The need for this participation is due to the requirement of some background in statistics that most users lack or have a very vague notion of. Moreover, programming skills could also be essential to analyse these data. An interactive, easy to use application seems therefore necessary to help researchers to extract full information from data and analyse them in a simple, powerful and confident way.RESULTS:PreP+07 is a standalone Windows XP application that presents a friendly interface for spot filtration, inter- and intra-slide normalization, duplicate resolution, dye-swapping, error removal and statistical analyses. Additionally, it contains two unique implementation of the procedures--double scan and Supervised Lowess--, a complete set of graphical representations--MA plot, RG plot, QQ plot, PP plot, PN plot-- and can deal with many data formats, such as tabulated text, GenePix GPR and ArrayPRO. PreP+07 performance has been compared with the equivalent functions in Bioconductor using a tomato chip with 13056 spots. The number of differentially expressed genes considering p-values coming from the PreP+07 and Bioconductor Limma packages were statistically identical when the data set was only normalized; however, a slight variability was appreciated when the data was both normalized and scaled. CONCLUSIONS:PreP+07 implementation provides a high degree of freedom in selecting and organizing a small set of widely used data processing protocols, and can handle many data formats. Its reliability has been proven so that a laboratory researcher can afford a statistical pre-processing of his/her microarray results and obtain a list of differentially expressed genes using PreP+07 without any programming skills. All of this gives support to scientists that have been using previous PreP releases since its first version in 2003. [Accessed February 24, 2009]. Available at: http://www.biomedcentral.com/1471-2105/10/16.

Richard Pearson, Xuejun Liu, Guido Sanguinetti, et al. puma: a Bioconductor package for propagating uncertainty in microarray analysis. BMC Bioinformatics. 2009;10(1):211. BACKGROUND:Most analyses of microarray data are based on point estimates of expression levels and ignore the uncertainty of such estimates. By determining uncertainties from Affymetrix GeneChip data and propagating these uncertainties to downstream analyses it has been shown that we can improve results of differential expression detection, principal component analysis and clustering. Previously, implementations of these uncertainty propagation methods have only been available as separate packages, written in different languages. Previous implementations have also suffered from being very costly to compute, and in the case of differential expression detection, have been limited in the experimental designs to which they can be applied.RESULTS:puma is a Bioconductor package incorporating a suite of analysis methods for use on Affymetrix GeneChip data. puma extends the differential expression detection methods of previous work from the 2-class case to the multi-factorial case. puma can be used to automatically create design and contrast matrices for typical experimental designs, which can be used both within the package itself but also in other Bioconductor packages. The implementation of differential expression detection methods has been parallelised leading to significant decreases in processing time on a range of computer architectures. puma incorporates the first R implementation of an uncertainty propagation version of principal component analysis, and an implementation of a clustering method based on uncertainty propagation. All of these techniques are brought together in a single, easy-to-use package with clear, task-based documentation.CONCLUSION:For the first time, the puma package makes a suite of uncertainty propagation methods available to a general audience. These methods can be used to improve results from more traditional analyses of microarray data. puma also offers improvements in terms of scope and speed of execution over previously available methods. puma is recommended for anyone working with the Affymetrix GeneChip platform for gene expression analysis and can also be applied more generally. [Accessed August 19, 2009]. Available at: http://www.biomedcentral.com/1471-2105/10/211.

Alexander Karpikov, Joel Rozowsky, Mark Gerstein. Tiling array data analysis: a multiscale approach using wavelets. BMC Bioinformatics. 2011;12(1):57. Abstract: "BACKGROUND: Tiling array data is hard to interpret due to noise. The wavelet transformation is a widely used technique in signal processing for elucidating the true signal from noisy data. Consequently, we attempted to denoise representative tiling array datasets for ChIP-chip experiments using wavelets. In doing this, we used specific wavelet basis functions, Coiflets, since their triangular shape closely resembles the expected profiles of true ChIP-chip peaks. RESULTS: In our wavelet-transformed data, we observed that noise tends to be confined to small scales while the useful signal-of-interest spans multiple large scales. We were also able to show that wavelet coefficients due to non-specific cross-hybridization follow a log-normal distribution, and we used this fact in developing a thresholding procedure. In particular, wavelets allow one to set an unambiguous, absolute threshold, which has been hard to define in ChIP-chip experiments. One can set this threshold by requiring a similar confidence level at different length-scales of the transformed signal. We applied our algorithm to a number of representative ChIP-chip data sets, including those of Pol II and histone modifications, which have a diverse distribution of length-scales of biochemical activity, including some broad peaks. CONCLUSIONS: Finally, we benchmarked our method in comparison to other approaches for scoring ChIP-chip data using spike-ins on the ENCODE Nimblegen tiling array. This comparison demonstrated excellent performance, with wavelets getting the best overall score." [Accessed February 22, 2011]. Available at: http://www.biomedcentral.com/1471-2105/12/57.

Don Maier, Farrell Wymore, Gavin Sherlock, Catherine Ball. The XBabelPhish MAGE-ML and XML Translator. BMC Bioinformatics. 2008;9(1):28. Abstract: "BACKGROUND: MAGE-ML has been promoted as a standard format for describing microarray experiments and the data they produce. Two characteristics of the MAGE-ML format compromise its use as a universal standard: First, MAGE-ML files are exceptionally large - too large to be easily read by most people, and often too large to be read by most software programs. Second, the MAGE-ML standard permits many ways of representing the same information. As a result, different producers of MAGE-ML create different documents describing the same experiment and its data. Recognizing all the variants is an unwieldy software engineering task, resulting in software packages that can read and process MAGE-ML from some, but not all producers. This Tower of MAGE-ML Babel bars the unencumbered exchange of microarray experiment descriptions couched in MAGE-ML. RESULTS: We have developed XBabelPhish - an XQuery-based technology for translating one MAGE-ML variant into another. XBabelPhish's use is not restricted to translating MAGE-ML documents. It can transform XML files independent of their DTD, XML schema, or semantic content. Moreover, it is designed to work on very large (> 200 Mb.) files, which are common in the world of MAGE-ML. CONCLUSION: XBabelPhish provides a way to inter-translate MAGE-ML variants for improved interchange of microarray experiment information. More generally, it can be used to transform most XML files, including very large ones that exceed the capacity of most XML tools." [Accessed February 22, 2011]. Available at: http://www.biomedcentral.com/1471-2105/9/28.

Creative Commons License All of the material above this paragraph is licensed under a Creative Commons Attribution 3.0 United States License. This page was written by Steve Simon and was last modified on 2010-04-11. The material below this paragraph links to my old website, StATS. Although I wrote all of the material listed below, my ex-employer, Children's Mercy Hospital, has claimed copyright ownership of this material. The brief excerpts shown here are included under the fair use provisions of U.S. Copyright laws.


36. StATS: Justifying the sample size for a microarray study (August 9, 2007). I'm helping out with a grant proposal that is using microarrays for part of the analysis. A microarray is system for quantitative measurement of circulating mRNA in human, animal, or plant tissue. A microarray will typically measure thousands or tens of thousands of different mRNA sequences. An important issue for this particular grant (and many grants involving microarray data) is how to justify the sample size. Here are a few references that I will use to develop such a justification.

35. StATS: Resources for fMRI data analysis (February 8, 2007). I was asked to provide feedback on a grant that will use functional magnetic resonance imaging (fMRI) as one component of the research. This technique is used to quantify brain activity by measuring changes in blood flow in various regions of the brain. It effectively produces information in the three dimensions of the brain structure, plus a dimension of time. The technology today can produce images localized to a cube with dimensions of approximately 2-4 mm, and these can be measured every 1-4 seconds. Each individual cubic region is called a voxel, a contraction of the words "volume" and "pixel."

34. StATS: Resources describing biplots (January 15, 2007). I've written some code in R to present a graphical summary of a complex data set using biplots. I write most of the code myself using the singular value decomposition function (svd) in R. There are a wide range of techniques that can be loosely classified as biplots, such as principal components analysis, multidimensional scaling, correspondence analysis, and canonical variate analysis.


33. StATS: Univariate Model Based Clustering (April 18, 2006). Back in 2001, I attended an excellent short course on a new approach to cluster analysis taught by Adrian Raftery and Chris Fraley at the Joint Statistics Meetings. Their approach, model based clustering, examined the fits of mixtures of normal distributions. This approach is useful for unidimensional and multidimensional data and has many advantages over other clustering approaches like hierarchical clustering and k-means clustering.

32. StATS: Pharmacogenetics Research Network (September 14, 2006). I received an email today discussing a special conference being held by the the Pharmacogenetics Research Network (PGRN). It's an "invitation only" conference, so I must have had someone recommend me for this group.

31. StATS: Seminar notes: Working with molecular biologists (July 17, 2006). One of the talks at the 18th Annual Applied Statistics in Agriculture Conference, sponsored by Kansas State University was "A visual aid to help a statistician work with a molecular biologist" by Debbie Boykin of the USDA Agricultural Research Service, coauthored by Earl Taliercio, also of the USDA Agricultural Research Service and Rowena Kelly from Mississippi State University. The original title of the talk was "Improving Power of Microarray Experiments by Adjusting Data so Fewer Differentially Expressed Genes are Overlooked" but Dr. Boykin reviewed the material and decided to change the focus.

30. StATS: A simple function for a Biplot in R (January 24, 2006). I regularly use a biplot or principal components plot as an initial exploratory tool for microarray analyses, but I have not found a good package that does this for me automatically. Rather than re-inventing the code every time, I created a simple R function that does the job for me. It's not the fanciest or best code in the world, but I wanted to put it here and comment on the various alternative forms of the biplot and principal components plot when I have time.

29. StATS: How do the various clustering algorithms work? (January 31, 2006). I'm working with someone on some clustering models for his microarray experiment. He asked how the various clustering algorithms work.

28. StATS: An Ensembl search (February 1, 2006). While working on a microarray experiement with a researcher, we had to find a bit of information about a gene with the gene symbol NCOA3. We went to the Ensembl web site (www.ensembl.org) and did a search which yielded the following information.

27. StATS: Methods for haplotype analysis (May 31, 2006). I am not an expert on haplotype analysis, but as I understand it, a haplotype is a combination of several SNPs (Single Nucleotide Polymorphisms) that show a stronger association with disease than any single SNP might.

26. StATS: Haplotype analysis (January 13, 2006). One of the people I work with wants to include a haplotype analysis in their research grant. I know nothing about haplotype analysis, so I am currently investigating various publications, web sites, and software. I want to include these resources here and eventually organize a web page that describes the statistical approach to haplotype analysis.

25. StATS: The Healthcare Cost and Utilization Project (April 11, 2006). On April 20, I will be attending a webcast sponsored by the Agency for Healthcare Research and Quality (AHRQ) on a large data set they collected, the Healthcare Cost and Utilization Project. The acronym for this data set is HCUP, which I always pronounced HICCUP, but apparently, you are supposed to pronounce it H-CUP. That's not the first time I mispronounced an important acronym. I use a program called STATA, and I used to use a soft A (STAH-TA). But it is actually a hard A (STAY-TA).

24. StATS: Machine Learning tools in R (January 24, 2006). There are a variety of different models that perform supervised learning or classification problems


23. StATS: Application of the ROC curve to microarray data (May 26, 2005). Life is full of surprises. When I was looking at whether the software package R could compute and analyze Receiver Operating Characteristic (ROC) curves, I found out that there is an application of ROC curves for microarray data. Apparently, the positive false discovery rate can be conceived of in a diagnostic testing format as relating to the positive predictive value.

22. StATS: Permutation tests for microarrays (July 27, 2005). A very simple approach to estimating the proportion of differentially expressed genes uses a permutation approach.

21. StATS: A totally negative microarray experiment (October 14, 2005). I've been cleaning out my old emails and am finding some real gems of good advice. Someone wrote into the Bioconductor email list wondering what to do when the lowest adjusted p-value in the entire experiment was still very large (0.66). A nice response outlined three strategies:

20. StATS: Naming conventions for genes, proteins, etc. (September 8, 2005). When you are analyzing a microarray experiment, the mRNA sequences can be referred to by several different names.

19. StATS: Step-down procedures for multiple comparisons (June 16, 2005). In some research studies, you have a large and difficult to manage set of outcome measures. This is especially true in microarray experiments, where for thousands or tens of thousands of genes, you are measuring the difference in expression levels between two types of tissue. A simple p-value is worthless in this situation, because it will be swamped by thousands of other p-values.

18. StATS: RMA normalization of microarrays (October 24, 2005). If you ask most statisticians if they want raw data or processed data, they will, for the most part, prefer to look at the raw data. There are two reasons for this. First, the statisticians want to understand the processing of the data and how that might influence the precision of any further calculations based on the raw data. Second, statisticians may want to try alternative approaches for processing the data and see if that produces better results.

17. StATS: Statistical Analysis of Microarrays by Insightful (August 31, 2005). I attended a seminar presented by Michael O'Connell of Insightful Corporation on microarray analysis. Insightful Corporation has a program, S+ArrayAnalyzer, and this talk showed some of the capabilities of this software.

16. StATS: Publicly available microarray data (August 18, 2005). A working paper in the Johns Hopkins Biostatistics department, Searching for Differentially Expressed Gene Combinations. Marcel Dettling, Edward Gabrielson, and Giovanni Parmigiani uses two microarray data sets to test their methodology.

15. StATS: Microarray data analysis, again (April 22, 2005). One of these days, I will have a coherent set of pages talking about microarray data analysis, but for now, all I have is a haphazard set of pages and weblog entries, most of which are woefully incomplete. In an effort to try to pull these together, I am listing below all of these links.

14. StATS: More articles on microarrays (March 10, 2005). It's impossible to keep up with the flood of research on microarrays, but here are a few articles published in the journal, Statistical Applications in Genetics and Molecular Biology that sounded interesting.

13. StATS: Review articles on microarrays (March 7, 2005). The Medical Science Monitor listed these three articles among their most frequently requested downloads. They all look like good overviews of microarray technology.

12. StATS: Analysis of Gene Expression Data Short Course (July 26, 2005). I'll be taking a short course at the Joint Statistical Meetings next month. It will be taught be Terry Speed, Jean Yang, Ben Bolstad, and James Wettenhall.

11. More on discovering gene information (October 12, 2005). I was reading an interesting microarray article: A mechanism of cyclin D1 action encoded in the patterns of gene expression in human cancer. Lamb J, Ramaswamy S, Ford HL, Contreras B, Martinez RV, Kittrell FS, Zahnow CA, Patterson N, Golub TR, Ewen ME. Cell 2003: 114(3); 323-34. [Medline] [Abstract] [PDF] and was curious what information I could find about cyclin D1. The article mentions the gene symbol (CCND1) but provides no other obvious clues (at least clues that were obvious to me).

10. StATS: Finding more information about a gene (September 6, 2005). I ran a few simple experiments using microarray data from a public source: www.genome.org/cgi/content/full/15/3/443/DC1 This is the data set used in the publication Database of mRNA gene expression profiles of multiple human organs. Son CG, Bilke S, Davis S, Greer BT, Wei JS, Whiteford CC, Chen QR, Cenacchi N, Khan J. Genome Res 2005: 15(3); 443-50.

9. StATS: Dimension reduction in a microarray experiment (May 25, 2005). Given the large number of genes in a microarray experiment, you need to find some way of looking at subsets or linear combinations of these genes. Assume that you have G genes and M microarrays and that the normalized signals are in a matrix X with G rows and M columns. Assume that information about the particular tissues (phenotypic data) is in a matrix Y with G rows and P columns.

8. StATS: Two cautionary tales about data mining (January 6, 2005). I attended a 7am seminar this morning on data warehousing and data mining, which was quite good. It reminded me of a couple of stories I heard about the pitfalls of data mining.


7. StATS: A simple microarray experiment (September 21, 2004). Someone just gave me some data with a small microarray. There are four exposed animals (Exp.1 through Exp.4) and four control animals (Exp.5 through Exp.8). The microarray has 96 genes, as well as some housekeeping genes

6. StATS: Microarray data analysis (March 18, 2004). The large amount of data is a typical DNA microarray assays makes for a lot of challenges for us statisticians. I've wanted to write a simple introductory web page on this topic for a while, but have never found the time to do it well. There are a couple of recent articles on microarrays published on BioMed Central with full text available on line. The first, The Limits of Log-Ratios by Sharov et al, points out that there are mathematical constraints on the Ratio-Intensity plot and that if your data bumps up against these constraints, you may see some artefactual patterns in the plot. The second, Universal Reference RNA as a standard for microarray experiments by Novoradovskaya et al, develops a reference standard useful "for monitoring and controlling intra- and inter-experimental variation." These articles are too recent to have appeared in Medline yet. An earlier article about universal reference RNA samples is also worth mentioning.

5. StATS: Data Mining with Clementine (June 15, 2004). I attended an SPSS web seminar about their Clementine program, which performs data mining. The talk was oriented to business applications, but still had some interesting general insights. The speaker started with the claim that projects that incorporated data mining technologies had a much greater return on investment than other projects. -- www.spss.com/dk/IDC%20Predictive%20Analytics%20and%20ROI%20Report.pdf

4. StATS: S+ArrayAnalyzer web seminar (June 22, 2004). Michael O'Connell and Richard Park gave a nice web seminar on the S+ArrayAnalyzer, a software program for analysis of microarray data that is marketed by Insightful Corporation. This company makes a lot of very nice software S+, an object oriented language for statistical analysis, S+SeqTrial, a system for designing and analyzing group sequential trials, Insightful Miner, data mining software, Infact, text mining software.

3. StATS: Acuity microarray analysis software. I got a request to evaluate some software by Axon Instruments for the analysis of microarray data. The software, Acuity version 3.1, costs $4,000 per person and has to compete with other commercial software such as S+ ArrayAnalyzer from Insightful Corporation, Microarray Solution from SAS Corporation, GeneSpring from Silicon Genetics, and DecisionSite Statistics from Spotfire Corporation. StatSci.org has a nice list of companies and institutions that produce statistical analysis software for microarrays as part of their overview of microarray data analysis.

2. StATS: Microarray data analysis (March 18, 2004). The large amount of data is a typical DNA microarray assays makes for a lot of challenges for us statisticians. I've wanted to write a simple introductory web page on this topic for a while, but have never found the time to do it well. There are a couple of recent articles on microarrays published on BioMed Central with full text available on line. The first, The Limits of Log-Ratios by Sharov et al, points out that there are mathematical constraints on the Ratio-Intensity plot and that if your data bumps up against these constraints, you may see some artefactual patterns in the plot. The second, Universal Reference RNA as a standard for microarray experiments by Novoradovskaya et al, develops a reference standard useful "for monitoring and controlling intra- and inter-experimental variation." These articles are too recent to have appeared in Medline yet. An earlier article about universal reference RNA samples is also worth mentioning.


1. StATS: Steps in a typical data mining model (September 22, 2003). I'm not an expert on data mining, but I wanted to outline some of the basic issues associated with data mining problems. This material is based largely on notes that I took during a training class on data mining taught by Richard De Veaux.

No date

Microarray bibliography and links. Here are some resources if you (as I) are just starting to learn about microarray data analysis.

Data management in a microarray experiment.

Design of microarray experiments. There are a variety of research designs that you can use in a microarray experiment.

Differential expression in microarray data. You can compute an expression ratio for each gene by taking the average of the log expression levels in the treatment group and subtracting the average of the log expression levels in the control group. This actually produces a log ratio, and you can compute the actual ratio by taking the antilog.

Importing data from microarray studies. There are so many different ways that data can come to you in a microarray experiment that it is hard to document how to import the data. Here are a few examples, plus some random notes and thoughts.

Normalization for microarray data. Normalization is the process of adjusting values in a microarray experiment to improve consistency and reduce bias.

Software for microarray data analysis: R and the Bioconductor package. There is a wide range of software available for the analysis of microarrays. I will use Bioconductor which is a set of libraries for a statistical programming language called R. Both Bioconductor and R are open source, which means that you can obtain the pacakge at no cost.

Supervised learning. Here are some documented examples of how to use supervised learning methods of the analysis of microarray data.

Unsupervised learning. Here are some documented examples of how to use unsupervised learning methods of the analysis of microarray data.

What is a microarray? A microarray is a tool for measuring the amount of messenger RNA (mRNA) that is circulating in a cell. It is the mRNA that transfers information from the genes from DNA inside the nucleus of a cell to create various proteins. Even though they have the exact same DNA, different cells have different amounts of various mRNA because they need to produce different proteins. For example, only certain cells in the pancreas produce insulin even though the DNA code for producing insulin exists inside all cells.

What now?

Browse other categories at this site

Browse through the most recent entries

Get help

Creative Commons License This work is licensed under a Creative Commons Attribution 3.0 United States License. This page was written by Steve Simon and was last modified on 2010-04-11.