Category >> Statistical computing (created 2007-08-22)

These pages are moving to a new website.

These pages describe the computational aspects of statistics. Also see Category: Data management, Category: R software, Category: SPSS software, Category: Statistical theory, Category: Unusual data. Other entries about statistical computing can be found in the statistical computing page at the StATS website.

2011

36. P.Mean: Validation of OpenEpi software (created 2011-07-27). I was asked to "validate" a software program called OpenEpi. If you want to validate software, you need to show that it produces correct answers for a variety of test cases. This webpage outlines the range of test cases and demonstrates validity for those cases by comparing them to an alternative program and to published peer-reviewed research sources.

35. P.Mean: What I'd look for in a new computer (created 2011-05-16). I am hardly an expert on computing, but I do try to help out when someone asks me about what sort of computer they should buy for statistical analyses. Here are some general guidelines that I offer. I'm assuming that you want a system that can run Windows and the advice here is not all that helpful if you are using the MacOS or Linux.

34. P.Mean: Macros in Stata (created 2011-03-08). I have just started using macros in Stata. I like R better, but Stata has a pretty good set of macro facilities, once you get the hang of things. Here is a simple example.

2010

33. P.Mean: Exponential interpolation (created 2010-02-11). Someone wanted an exponential interpolation formula. It's not quite a statistics question, but it caught my interest.

32. P.Mean: SPSS or Stata? (created 2010-01-19). I am an SPSS user. Some of my friends are choosing to leave SPSS and learn STATA. What are the advantages of STATA over SPSS?

Other resources:

Murrell J. 1 Guy vs. 1-Click - guy wins. Description: Software patents are controversial, and the one patent that best illustrates this controversy is the Amazon 1-Click patent. This entry from October 17, 2007 Good Morning Silicon Valley blog traces the history of this patent and a lone individual who was able to successfully challenge this patent. This website was last verified on 2007-10-17. URL: svextra.com/blogs/gmsv/2007/10/1_guy_vs_1-click_--_guy_wins.html

Knusel L. Accuracy of Statistical Packages. Description: An overview of the numerical accuracy of statistical algorithms in Gauss, Excel, Matlab, SAS, and S-Plus. This website was last verified on 2008-02-25. URL: www.stat.uni-muenchen.de/~knuesel/elv/accuracy.html

The University of Texas M. D. Anderson Cancer Center. BAM Software Download Site. Link last verified on 2008-07-08. Description: This page lists a wide range of programs. Each program focuses on a single purpose. Two examples are BP1CI (Computes Clopper-Pearson confidence intervals for one-sample binomial and Poisson) and Adaptive Randomization (Outcome-adaptive randomization for clinical trials). biostatistics.mdanderson.org/SoftwareDownload/

Heiser DA. Errors, Faults and Fixes for Excel Statistical Functions and Routines (as of June, 2009). Excerpt: Spreadsheets have become indispensable tools for getting the informational work done. They are empowering tools that are expressive and apparently simple, yet underneath very complex. Text and numbers can be intermingled. They can be subservient in that they facilitate peer-to-peer sharing, non-technical people can do analysis and share the data and results. They facilitate back-channel, behind the scenes communications. They also create enormous problems with errors in data entry, errors in equations, misuse of data sections, and incorrect use of functions. Available at: http://www.daheiser.info/excel/frontpage.html [Accessed September 22, 2009].

Webpage: Rick Wicklin. Estimating popularity based on Google searches: Why it's a bad idea - The DO Loop Excerpt: "Some people search the Internet for a set of topics and then use the number of search results ("hits") for each topic to rank the relative popularity of the topics. At the 2011 Joint Statistical Meetings (JSM), I had the opportunity to attend several talks by statisticians from Google and other large Internet companies. When I chatted with some of these statisticians after talks, they confirmed what I had suspected: it's a bad idea to estimate the popularity of a person or product based on the results of an Internet search." [Accessed on September 15, 2011]. http://blogs.sas.com/content/iml/2011/08/19/estimating-popularity-based-on-google-searches-why-its-a-bad-idea/.

EuSpRIG. European Spreadsheet Risks Interest Group - spreadsheet risk management and solutions conference. Excerpt: "EuSpRIG is the largest source of information on practical methods for introducing into organisations processes and methods to inventory, test, correct, document, backup, archive, compare and control the legions of spreadsheets that support critical corporate infrastructure." [Accessed May 1, 2010]. Available at: http://www.eusprig.org/.

SAS Institute. FASTats: Frequently Asked-For Statistics. Usage Note 30333.  Description: This webpage lists hundreds of specialized statistics and explains how to compute them using SAS software. [Accessed February 11, 2009] Available at: http://support.sas.com/kb/30/333.html.

Webpage: Robert Muenchen. The Popularity of Data Analysis Software Excerpt: "Abstract: This page presents various ways of measuring the popularity or market share of BMDP, JMP, Minitab, R, R-PLUS, Revolution R, S-PLUS, SAS, SPSS, Stata, Statistica, and Systat, as well as two implementations of the SAS Lanugage, Carolina and WPS. I plan to update this paper twice a year at http://r4stats.com to provide an ongoing view of the software. The most recent update was on March 20, 2011 when I added updated most the data overall and added coverage of Statistica, WPS and Carolina." Accessed on March 30, 2011]. http://r4stats.com/popularity

Cryer JD. Problems With Using Microsoft Excel for Statistics. Excerpt: In this talk I will illustrate Excel�s serious deficiencies in five areas of basic statistics: Graphics Help Screens Computing Algorithms Treatment of Missing Data and Regression. http://www.cs.uiowa.edu/~jcryer/JSMTalk2001.pdf

Patrick Vandewalle, Jelena Kovacevic, Martin Vetterli. Reproducible Research. Excerpt: "Welcome on this site about reproducible research. This site is intended to gather a lot of information and useful links about reproducible research. As the authors (Patrick Vandewalle, Jelena Kovacevic and Martin Vetterli) are all doing research in signal/image processing, that will also be the main focus of this site." [Accessed October 5, 2010]. Available at: http://reproducibleresearch.net.

Webpage: Margaret Yau. Statcato: Open-Source Java Software for Elementary Statistics. Excerpt: "Statcato is a free Java software application developed for elementary statistical computations. Its features are tailored for community college statistics students and instructors" [Accessed on May 29, 2012]. http://www.statcato.org/statcato/.

Keeling & Pavur, R.J. (2012). Statistical accuracy of spreadsheet software. American Statistician, 65, 265-273.

Michael N. Mitchell. Strategically using General Purpose Statistics Packages: A Look at Stata, SAS and SPSS. Abstract: "This report describes my experiences using general purpose statistical software over 20 years and for over 11 years as a statistical consultant helping thousands of UCLA researchers. I hope that this information will help you make strategic decisions about statistical software { the software you choose to learn, and the software you choose to use for analyzing your research data." [Accessed January 14, 2010]. Available at: http://www.ats.ucla.edu/stat/technicalreports/number1_editedFeb_2_2007/ucla_ATSstat_tr1_1.1_0207.pdf

SAS Institute. Ten Great Reasons Why A Statistician Should Update to SAS 9.2.  Description: I have not used SAS in ten years, but it helps to know what I am missing. The latest version of SAS has generalized linear mixed models, quantile regression, and Markov Chain Monte Carlo solutions for several procedures. SAS has also implemented certain model selection approaches like LAR and LASSO.  [Accessed May 26, 2009] Available at: http://support.sas.com/rnd/app/da/stat_top10.html.

Burns P. Spreadsheet Addiction. Excerpt: The goal of computing is not to get an answer, but to get the correct answer. Often a wrong answer is much worse than no answer at all. There are a number of features of spreadsheets that present a challenge to error-free computing.  [Accessed August 18, 2009] Available at: www.burns-stat.com/pages/Tutor/spreadsheet_addiction.html.

Creative Commons License All of the material above this paragraph is licensed under a Creative Commons Attribution 3.0 United States License. This page was written by Steve Simon and was last modified on 2017-06-15. The material below this paragraph links to my old website, StATS. Although I wrote all of the material listed below, my ex-employer, Children's Mercy Hospital, has claimed copyright ownership of this material. The brief excerpts shown here are included under the fair use provisions of U.S. Copyright laws.

2007

31. Stats: Some UNIX humor (March 22, 2007). The website www.netfunny.com has a lot of computer humor. Here's one that caught my eye because I am taking a class in Bioinformatics and am learning a lot about UNIX.

30. Stats: Manipulating graphic images using a touchscreen interface (January 23, 2007). There is a cute video out showing some examples of manipulating graphic images using a touchscreen interface: www.flixxy.com/minority-report-interface.htm. In particular, the operator can expand images and zoom in by stretching two fingers apart.

2006

29. Stats: Citing statistical software in research papers (August 25, 2006). One of the most common questions I get is how to properly acknowledge the use of a program like SPSS in a research publication. In a series of messages in the MEDSTATS discussion group (August 2006) about the citation of statistical software, one writer (JU) pointed out that a bibliographic citation in a consistent format is important because various sources track the frequency of citations in research.

28. Stats: Likelihood software for clinical trials (June 29, 2006). A member of our local IRB forwarded a press release announcing new software that would revolutionize the conduct of clinical trials. The company is called Analytical Edge and they offer a software program called Pure Likelihood.

27. Stats: Resources for Item Response Theory (June 28, 2006). One area that I would like to investigate when I have time is Item Response Theory (IRT). This is an approach that is sometimes called a latent trait model or a Rasch model, though there are some technical distinctions between these terms. IRT is popular in testing as it allows for the modeling of the difficulty of individual questions on a test and their ability to discriminate among individuals of different abilities. It can also be used for examining the validity of a scale that consists of the sum of a number of Likert scale items (or in general a sum of ordinal scale items).

26. Stats: Software for structural equations models (March 14, 2006). A recent email on the Medstats discussion group discussed software for structural equations models.

25. Stats: The Journal of Statistical Software (January 18, 2006). The January 2006 issue of Amstat News had the following announcement: The Journal of Statistical Software is now an official ASA journal. Published by UCLA Statistics since 1996, the editorial board and the website will remain as they now are, and UCLA Statistics will continue to host and manage the journal. Access will remain free for both articles and software.

2005

24. Stats: Web page for Fisher's Exact test (November 17, 2005). I get lots of simple requests like calculate Fisher's Exact test for this two by two table, and I am happy to help out, but there are free web pages that perform these calculations for you.

23. Stats: Free statistics software (September 15, 2005). The number and quality of free statistical software offerings appears to be increasing. Here are two interesting sites that offer good quality software for free.

22. Stats: Computing normal probabilities (May 19, 2005). One of the programs here at Children's Mercy Hospital is helping me take an Excel spreadsheet and developing a Windows program that will do the same calculations faster and cleaner. The Excel spreadsheet uses the NORMSDIST function, which computes the probability that a standard normal distribution is less than or equal to a given value. Basically, it approximates the area under the bell shaped curve. He wanted to know what the formula was for this.

21. Stats: The S+ CorrelatedData Library (May 19, 2005). Jose Pinheiro gave a web seminar on the S+ CorrelatedData Library. The CorrelatedData library extends the Generalized Linear Model (GLM) to single level and multi-level group problems.

20. Stats: S-plus version 7 (April 19, 2005). I attended a web seminar by David Smith introducing the latest version of S-plus, version 7.This software, produced by Insightful Corporation, is one of my favorite products for producing advanced statistical analyses.

19. Stats: Using Mathematica and Matlab for Statistics (March 22, 2005). Mathematica, produced by Wolfram Research, is a program which performs numeric and symbolic computations. Matlab, produced by Mathworks, is a high level language for solving mathematical problems.

18. Stats: Seminar notes, S-PLUS Clinical Safety Miner (March 10, 2005). I attended a web seminar by Michael O�Connell, "Applications in Drug Discovery and Development. S-PLUS� Clinical Safety Miner." Michael O�Connell is the Director, Life Science Solutions at Insightful Corporation, the company that produces S-plus software.

17. Stats: Publicon software (March 14, 2005). Wolfram Research, makers of the famous software for symbolic mathematical calculations, Mathematica, have released a new product for technical publications, Publicon. According to their website, Publicon provides an easy-to-use graphical interface for creating publication-quality technical documents that integrate text, searchable typeset equations, graphics, hyperlinks, endnotes, and references. Built-in palettes, templates, and style sheets simplify the creation of documents that conform to established formats but also allow for complete customization.

16. Stats: Optimization using the MM algorithm (February 14, 2005). Optimization using a computer is a rather difficult and complex process because an approach that works well for one set of problems may perform poorly for another set. A nice tutorial on the state of the art for optimization is: A Tutorial on MM Algorithms. Hunter DR, Lange K. The American Statistician 2004: 54(1); 30-37.

2004

15. Stats: Surromed and the DecisionSite S-Plus Server Solution (December 8, 2004). Keith Joho from Surromed, Michael O'Connell from Insightful and Ed Tobin from Spotfire presented a web seminar on Suromed and the DecisionSite S-Plus Server Solution.

14. Stats: Software patents (November 19, 2004). I have always found the concept of software patents confusing. I am listed as a co-inventor on a software patent, so perhaps that raises a conflict of interest. Still, here is my best understanding of the issue.

13. Stats: Automating statistical analyses (October 25, 2004).  Many of my projects involve writing a program in my text editor (Textpad), copying it into the clipboard, and then pasting it into a statistical program like S-plus or Stata. I just found out through the Stata Text Editor FAQ about a new product, AutoIt, that can simplify this process. In the past, I had used a commercial product, QuicKeys, to automate some of my work, but have not used it recently. AutoIt is an Open Source program which means that the source code is open for others to modify or improve it.

12. Stats: Statistical software (August 25, 2004). Back when I was writing the FAQ (Frequently Asked Questions) for stat-l/sci.stat.consult, I had a question about how to contact commercial statistical software vendors and where to find free statistical software. Unfortunately, stat-l has fallen on hard times, so I have not updated this FAQ for several years. But I do want to convert those two lists to a different part of my web pages and update them.

11. Stats: Microsoft Excel Pivot Tables (August 5, 2004). Someone at my church wanted some advice about exploring relationships in a survey that he had taken. He asked a bunch of demographic questions (age, sex, income, etc.) and then some yes/no questions. He had the data in an Excel spreadsheet and didn't quite know what to do with it. I'm not a big fan of Excel for statistical analysis, but a few simple pivot tables would be a nice start.

10. Stats: Statalist (July 29, 2004). I recently upgraded to version 8 of Stata, which is a nice program for advanced statistical analyses. During the registration of the software, the program asked me if I wanted to join Statalist, which is a listserv for discussion about Stata.

9. Stats: Aliasing patterns (July 19, 2004). When you draw lines and curves on a computer screen, most of them end up with a subtle staircase pattern because you are using discrete pixels to represent a smooth line or curve. Most of the time, this pattern is barely noticeable. But when you try to fit too many lines or curves together, aliasing can create some false and artificial patterns. I wrote a simple program in R to illustrate this.

8. Stats: Provisional means algorithm (July 9, 2004). One of the new fellows asked me about a data summary for two groups of patients. The ages were quite different, the mean age was 4.6 years in the exposure group and 8.5 years in the control group. But the control group happened to include a 21 year old patient (a bit of an outlier in a pediatric hospital). In the treatment group, the oldest patient was 10 years old. To what extent is the outlier influencing the difference in average ages? I did not have the original data from this study, but there is a cute trick that allows you to remove a single data point from your summary calculations. It relies on an algorithm for the mean and variance calculations known as the provisional means algorithm.

7. Stats: The impact of XML on Statistics (June 23, 2004). XML (eXtensible Markup Language) is standard for sorting information that has the flexibility needed to handle complex data. It is similar to HTML (HyperText Markup Language) in that it uses tags to denote information. It is a simple text file which means that you can use a text editor to view the raw data. Although this may not be the most efficient way to look at an XML file, this allows you to peek at the information in any XML file without any special software. There are three areas that show how useful XML is for Statistics.

6. Stats: S+ArrayAnalyzer web seminar (June 22, 2004). Michael O'Connell and Richard Park gave a nice web seminar on the S+ArrayAnalyzer, a software program for analysis of microarray data that is marketed by Insightful Corporation.

5. Stats: Data Mining with Clementine (June 15, 2004). I attended an SPSS web seminar about their Clementine program, which performs data mining. The talk was oriented to business applications, but still had some interesting general insights. The speaker started with the claim that projects that incorporated data mining technologies had a much greater return on investment than other projects.

4. Stats: Monte Carlo methods (May 31, 2004). I got a question from someone in my office who is taking a class on research methods. She asked me to define the term "Monte Carlo" that her teacher had used without much of an explanation.

3. Stats: Acuity microarray analysis software (May 6, 2004). I got a request to evaluate some software by Axon Instruments for the analysis of microarray data. The software, Acuity version 3.1, costs $4,000 per person and has to compete with other commercial software such as S+ ArrayAnalyzer from Insightful Corporation, Microarray Solution from SAS Corporation, GeneSpring from Silicon Genetics, and DecisionSite Statistics from Spotfire Corporation.

2. Stats: EM Algorithm (March 15, 2004). I received an email question about the EM Algorithm. This is a computational approach that works well for missing data problems and data models with latent (unobserved) variabels. The basic approach is to estimate the missing or latent data (E-step), compute maximum likelihood estimates that incorporates the missing/latent estimates (M-step), then update the missing or latent data (E-step) and so forth.

2003

1. Stats: Accuracy computations (November 26, 2003). Dear Professor Mean, I've heard a lot about accuracy problems with Microsoft Excel, but I'd like to see an example where this really is a problem.

What now?

Browse other categories at this site

Browse through the most recent entries

Get help