[The Monthly Mean] December 2008--Combining measures on different scales (released 2008-12-05)

Monthly Mean newsletter, December 2008

You are viewing the webpage version of the Monthly Mean newsletter for December 2008. This newsletter was sent out on December 5, 2008.

The monthly mean for December is 16.0.

Welcome to the Monthly Mean newsletter for December 2008. If you are having trouble reading this newsletter in your email system, please go to www.pmean.com/news/2008-12.html. If you are not yet subscribed to this newsletter, you can sign on at www.pmean.com/news. If you no longer wish to receive this newsletter, there is a link to unsubscribe at the bottom of this email. Here's a list of topics.

Monthly Mean Quote.

Combining measures on different scales

A simple example of overfitting

Explaining CART models in simple terms

Should you compare a two-sided p-value to 0.025?

Monthly Mean Article: What is meant by intention to treat analysis? Survey of published randomised controlled trials.

Monthly Mean Blog: Realizations in Biostatistics by Random John.

Monthly Mean Book: Biostatistics, The Bare Essentials, Geoffrey R. Norman and David L. Streiner.

Monthly Mean Website: EDF 5481. Methods of Educational Research. Susan Carol Losh

Nick News: Nicholas the artist.

Very bad joke: How many IRB members does it take to screw in a light bulb?

Tell me what you think

1. Monthly Mean Quote.

"The statistics on sanity are that one out of every four Americans is suffering from some form of mental illness. Think of your three best friends. If they're okay, then it's you." Rita Mae Brown, as quoted at www.quotationspage.com/quote/1380.html.

2. Combining measures on different scales.

Someone was asking about meta-analysis and the process of combining outcomes measured on different scales. Some of the papers in the meta-analysis described their outcomes as a percentage change from baseline, and others as a simple difference in means. The difference in means has a unit attached to it (e.g. mg/ dL, mmol/ L, etc.), but the percentage change is unitless. There is a way to combine these measures, of course. Simply convert them to a standardized scale (Z-score) and then combine the Z-scores. The question is whether this is a legitimate approach.

I responded that there's plenty of precedent for this, and I think you could find justification by looking in any book about meta-analysis. My library is totally disorganized because of my change in careers, so I can't check my books for a specific page number.

The issue remains, though, about whether combining different outcomes introduces so much heterogeneity that the results just can't be trusted. This is a "gut" call and there is no metric that you can apply to show whether the measures are too different to allow a quantitative pooling. I like to think that apples and oranges can be combined (fruit salad), but sometimes the heterogeneity is so bad that it becomes an "apples and onions" situation. I don't know of any cook who would deliberately mix apples and onions in the same recipe.

Someone wrote me, by the way, with a recipe that uses apples and onions, so I have to take that last comment back.

I talk about heterogeneity a lot in my book, Statistical Evidence in Medical Trials. I start out the chapter on meta-analysis with a New Yorker cartoon. You can view this cartoon by Dana Fradon at the Cartoonbank website. I won't ruin the punchline here, but clearly some numbers should never be combined. By the way, if you regularly use cartoons in your PowerPoint talks, you should pay the author of the cartoon a royalty. Cartoonbank makes it easy to do so. I paid $150 to include this cartoon in my book and it only costs $20 to include this cartoon in your PowerPoint talk.

Later in the same chapter of my book, I give a more specific example.

Example: In a systematic review of beta-2 agonists for treating chronic obstructive pulmonary disease (Husereau 2004), researchers identified 12 studies. But the authors could not pool the results because they "found that even commonly measured outcomes, such as FEV1, could not be combined by meta-analysis because of differences in how they were reported. For example, in the six trials comparing salmeterol with placebo, FEV1 was reported as a mean change in percent predicted, a mean change overall, a mean difference between trial arms, no difference (without data), baseline and overall FEV1 (after 24 hrs without medication) and as an 0 to 12 hour area-under-the-curve (FEV1-AUC) function. We were not successful in obtaining more data from study authors. We also had concerns about the meta-analysis of data from trials of parallel and crossover design and differences in spirometry protocols including allowable medications. Therefore, we decided on a best evidence synthesis approach instead."

Long acting beta2 agonists for stable chronic obstructive pulmonary disease with poor reversibility: a systematic review of randomised controlled trials. D. Husereau, V. Shukla, M. Boucher, S. Mensinkai, R. Dales. BMC Pulm Med 2004: 4(1); 7. Full free text is available at www.biomedcentral.com/1471-2466/4/7

Dr. Husureau doesn't get any royalties and isn't expecting any because (bless his soul) he published in an open source journal.

So, yes, you can use Z-scores to combine outcomes measured on different scales, and yes you can find justification for it. Even so, you should still take a slow and careful look at the problem and decide if you are really comfortable with this approach.

Did you like this article? Visit http://www.pmean.com/category/SystematicOverviews.html for related links and pages.

3. A simple example of overfitting

A couple of the Internet discussion groups that I participate in have been discussing the concept of overfitting. Overfitting occurs when a model is too complex for a given sample size. I want to show a simple example of the negative consequences of overfitting.

In a previous page discussing segmented regression models, I used a data set showing firearm deaths in Australia per 100,000 people over a fifteen year span (1983 to 1997). The following graph shows a linear regression fit (order 1 polynomial) to the data.

Most of you would stop here and say that there is a roughly linear decline in firearm deaths in Australia over the fifteen year span. Or you might try a simple log transformation of the firearm deaths. If you had data on the exact number of deaths rather than the rate, you might consider a Poisson regression model or some variation of it.

But for the benefit of pedagogy, what would happen if you threw all caution to the wind and fit a ridiculously complex model. Something like a seventh order polynomial. Here's what the fit would look like.

There are some weird things going on here. For example, the fitted value halfway between 1983 and 1984 is actually lower than the observed rates at either 1983 and 1984. This is a symptom of overfitting, the choice of a model that is far too complex for the amount of data that you have. If you continue to add more terms to the polynomial, the results look more and more bizarre. Take a look at the whole series of pictures from linear polynomial to a fourteenth order polynomial at www.pmean.com/08/OverfittingExample.html.

4. Explaining CART models in simple terms

Someone asked about CART models and when you would use them instead of linear or logistic regression.

CART was developed to handle problems with overfitting. This is its primary advantage over stepwise regression. Generally, CART models are considered exploratory rather than confirmatory.

From the perspective of critical appraisal, classification and regression trees (CART) are not too much different from linear and logistic regression. They are an approach to make sense of data where there are multiple predictor variables. They work reasonably well if the data is good quality, but like any statistical procedure, the quality of the analysis is limited by the quality of the data coming in.
A classification tree is used when the outcome variable is categorical and a regression tree is used when the outcome variable is continuous. Both methods rely on a similar approach, known as recursive partitioning.
Generally, this approach is used when there are numerous predictor variables and the researcher desires a simple prediction involving a small number of these predictor variables.
Recursive partitioning divides each predictor variables into discrete groups. The groups are typically required to have a minimum sample size (usually 5), but otherwise are allowed to vary considerably. So if one of the predictor variables is the one minute apgar score, then the possible groups to be considered are
1 versus 2-10 1-2 versus 3-10 1-3 versus 4-10 . . 1-9 versus 10
For a discrete variable with levels A, B, C and D and no particular order among the categories, the possible groups to be considered are
A versus BCD AB versus CD AC versus BD AD versus BC B versus ACD . . .
A CART model examines all possible partitions among all possible variables and selects the partition that produces the best possible prediction of the outcome variable. Once that partition is selected, each of the two subgroups is examined versus all possible remaining partitions.
The result of a CART model is a tree diagram. I show some examples of these tree diagrams at www.pmean.com/08/ExplainingCart.html.

5. Should you compare a two-sided p-value to 0.025?

Ray, one of my colleagues at Children's Mercy Hospital, asked a question a few weeks before I left that I couldn't answer in time, so let me put the answer here. He was looking at an article that computed a p-value for a two-sided test and wondered if he should compare the p-value to 0.025 instead of 0.05 because it was a two-sided test.

The answer is related to the famous O'Henry story, "The Gift of the Magi". In that story a married couple is in a quandary about what gifts to get each other for Christmas, as money is very tight. The wife sees a beautiful chain for the pocketwatch that the husband treasures. The husband sees a beautiful comb for the wife's lovely long tresses of hair. Neither has enough spare cash to buy such a lavish gift, so they both make a major sacrifice because of their love for each other. The woman cuts off her hair and sells it to a wig maker. The husband sells his beloved watch. When they present each other with their gifts, they realize that their sacrifices to get their gift have made the other person's gift useless.

The p-value is already adjusted for the type of hypothesis used. It is always compared to the traditional alpha level of 0.05. If you also adjust the alpha level, you are cutting off your hair just before you are about to get that beautiful comb. Don't do it, Ray!

If you see a p-value in a published report, you don't have to think about whether it is computed from a one-sided hypothesis or from a two-sided hypothesis. Just compare it to 0.05 (or 0.01 if you are more conservative--0.10 if you are more liberal).

Now you still need to be careful when you are using computer software. Most computer software will assume, unless you tell it otherwise, that you have a two-sided hypothesis. Some software will allow you to specify a one-sided hypothesis, but others will make you do the math yourself. The formula to convert a p-value for a two-sided hypothesis to a p-value for a one-sided hypothesis is quite simple, but I won't show the details here.

You also have to be careful if the author of a report uses a one-sided hypothesis and you prefer a two-sided hypothesis. The adjustment here is easy as well, but I won't include the formulas. Write to me if you want the details.

The p-value is tailored to the particular hypothesis so you never have to adjust the alpha level. If you forget this, you're a bald woman with a beautiful comb.

6. Monthly Mean Article: What is meant by intention to treat analysis? Survey of published randomised controlled trials.

"What is meant by intention to treat analysis? Survey of published randomised controlled trials." Hollis S, Campbell F. British Medical Journal 1999; 319(7211): 670-674. Full free text is available at www.bmj.com/cgi/content/full/319/7211/670. This is an old article, but the confusion about intention to treat remains today, so it is well worth reading. There is ambiguity in how authors describe intention to treat analysis. When there are missing outcomes for some of the patients, the trouble is compounded. Read this article before you state that you used "intention to treat" in your research publication to make sure you know what you're talking about.

7. Monthly Mean Blog: Realizations in Biostatistics by Random John.

I like this blog because the author says nice things about me. Flattery is a very powerful motivator, and if anyone who reads this is willing to add a comment to my testimonials page, I would be forever grateful. More seriously, I love the list of topics that this blog covers: "Biostatistics, clinical trial design, critical thinking about drugs and healthcare, skepticism, the scientific process". Here's a sampling of the blog entries that caught my eye. A November 13 entry notes an outrageous study linking autism rates with rainfall patterns, two entries (October 17, 2008 and another one also on October 17, 2008) talk about getting output from R into Microsoft Word, and another nice pair of articles explain the value of blinding and randomization (May 25, 2008 and May 14, 2008). This blog has lots of links to other very good blogs. URL: realizationsinbiostatistics.blogspot.com

8. Monthly Mean Book: Biostatistics, The Bare Essentials, Geoffrey R. Norman and David L. Streiner

If you want to start to get an understanding of the fundamentals of Biostatistics, this book is a great place to start. It has all the formulas that you need, but does not assume that you know or remember calculus. The coverage is quite good, and the authors stress many pragmatic issues.

This book uses a lot of fictional examples, with an effort to make the examples extremely bizarre. The start of the chapter on analysis of variance, for example, starts out with this fictional scenario.

To further the goal of "Safe Sex for Sinners," you decide to investigate which is the most cost-effective condom. You are rapidly discouraged by the challenge, as a visit to the local pharmacy reveals an overwhelming array of choices. What you really want to do is select a few brands and determine if any difference overall exists among the group means, then try to find out what affects these differences.

The author's philosophy on the use of fictional data is described in the preface.

Most Chapters begin with an example to set the stage. Usually the examples were dreamt up in our fertile imaginations and are, we hope, entertaining. Occasionally we reverted to real world data, simply because sometimes the real world is at least as bizarre as anything imagination could invent. Although many reviews of statistics books praise the users of real examples and castigate others, we are unapologetic in our decision for several reasons: (1) the book is aimed at all types of health professionals and we didn't want to waste your time and ours explaining the intricacies of podiatry for others; (2) the real world is a messy place, and it is difficult, or well nigh impossible, to locate real examples that illustrate the pedagogic points simply; and (3) we happen to believe, and can cite good psychological evidence to back it up, that memorable (read "bizarre") examples are potent allies in learning and remembering concepts.

I generally dislike the use of fictional examples, but it seems to work here. The key thing that these examples provide is a level of light-heartedness that makes the book easy to pick up and hard to put down.

There are several editions, and the most recent addition includes SPSS examples. I am most familiar with the first edition, published in 1998. This is probably the best book for a beginner wanting to learn more on their own about Statistics.

9. Monthly Mean Website: EDF 5481. Methods of Educational Research. Susan Carol Losh.

It seems like just about everybody is putting their course notes up on the web these days. This is one of the better ones I found because it is fairly detailed and it has a range of topics that are not covered as much as they should be on the web. The site offers good definitions for terms like conceptual hypothesis, operational variable, construct validity. There is also some excellent material on pilot studies and survey research. URL: edf5481-01.fa01.fsu.edu/Overview.html

10. Nick News: Nicholas the artist

Until just recently, Nicholas has not taken a great interest in art projects. There was a recent project, though, that inspired him. The students in his classroom were all asked to color a turkey as a special character. Nicholas chose a Spiderman turkey. We brought out his Spiderman t-shirt and figured which parts of the turkey should be red and which should be blue. The spiderweb lines on the red were the finishing touch.

There are a couple more pictures of recent artwork at www.pmean.com/personal/artist.html.

11. Very bad joke: How many IRB members does it take to screw in a light bulb?

I wrote this joke, and you can find it at my old website: www.childrens-mercy.org/stats/plan/irb.asp.

It has been reproduced (without permission and three out of four times without proper attribution, but that's okay) at

John Furedy's website on the North American Bioethics Industry,

the Society for Academic Freedom and Scholarship newsletter,

K.H. Grobman's website on Developmental Psychology, and

Maynard Clark's Veggie and Boston blog.

How many IRB members does it take to screw in a lightbulb?

As documented in 45 CFR 46.107(a), this review board must consist of five (5) or more members, and at least one of these members must possess a background in Electrical Engineering. In addition, at least one of the members must come from a home without any electricity. Any member of the IRB who owns stock in an electrical utility or who regularly pays bills to an electrical utility should recuse themselves from participation in the review of this research.

If the bulb should burn too brightly, burn too dimly, or flicker, then an adverse event report should be sent to the IRB (21 CFR 312.32). If the light bulb is dropped, then a serious adverse event report should be sent to the FDA by telephone or by facsimile transmission no later than seven (7) calendar days after the sponsor's initial receipt of the information.

If this is a multi-center light bulb trial, then a data and safety monitoring board (DSMB) may be needed (NIH Policy for Data and Safety Monitoring, June 10, 1998, grants.nih.gov/grants/guide/notice-files/not98-084.html, accessed on October 9, 2002). The DSMB should review any adverse event reports and interim results. If the clinical equipoise of the light bulb is lost, then the DSMB should terminate the study and provide all previously recruited light bulbs with the best available light bulb socket.

In order to maintain scientific integrity, the use of a placebo socket may be necessary. The placebo socket should have the same taste, appearance, and smell of a regular socket and the fact that this socket has no electricity should be hidden from the light bulb and from the person screwing in the light bulb. According to the 2000 revision of the Declaration of Helsinki, paragraph 29, the use of placebo sockets is acceptable where no proven prophylactic, diagnostic, or therapeutic socket exists.

A systematic review of all previous research into light bulbs must be presented so that the IRB can determine, per 45 CFR 46.11(a)(2), that the risks to the light bulb are reasonable in relation to anticipated benefits. The IRB should also ensure that the selection of light bulbs is equitable (45 CFR 46.11(a)(3)). If the light bulb has less than 18 watts of power, then additional requirements (45 CFR 46.401 through 409) apply.

The IRB must ensure that an informed consent document be prepared in language that the light bulb understands (45 CFR 46.116). This document should explain the expected duration of the light bulb's participation in the research, any reasonably foreseeable risks, and the extent to which the confidentiality of the light bulb will be maintained. This document should also emphasize that participation is voluntary and the light bulb can withdraw itself from the socket at any time without any penalty or loss of benefits.

12. Tell me what you think.

How did you like this newsletter? I have three short open ended questions that I'd like to ask. It's totally optional on your part. Your responses will be kept anonymous, and will only be used to help improve future versions of this newsletter. I got some very positive feedback from three people last month. Two people cited the article on ANOVA versus regression as being the most helpful. The one suggestion for improvement was to drop the section on Wikipedia. I agree and in the next newsletter I might replace it with a "current events" entry that highlights a recent newspaper article touching on issues in Statistics. some suggested topics for future newsletters are principles for good graphical displays (such as the guidance from Edward Tufte) and important events in the history of Statistics.

What now?

Sign up for the Monthly Mean newsletter

Review the archive of Monthly Mean newsletters

Take a peek at an early draft of the latest newsletter

Go to the main page of the P.Mean website

Get help

This work is licensed under a Creative Commons Attribution 3.0 United States License. This page was written by Steve Simon and was last modified on 2017-06-15.