[Previous issue] [Next issue]

[[Image available at http://www.pmean.com/news/images/MonthlyMean.png]]

The Monthly Mean newsletter, February 2013. Released 2013-03-07.

--> Introduction
--> Why do I have to work so hard to find all those articles?
--> Stop and think before you start entering your data
--> Monthly Mean Article (peer reviewed): Bias in reporting of end points of efficacy and toxicity in randomized, clinical trials for women with breast cancer
--> Monthly Mean Book. Stef van Buuren. Flexible Imputation of Missing Data.
--> Monthly Mean Trivia Question: Most movie sequels include ...
--> Nick News: Still more fun in the snow
--> Very bad joke: Some clinical trials are impossible to blind...
--> Tell me what you think.
--> Join me on Facebook, LinkedIn and Twitter
--> Permission to re-use any of the material in this newsletter

--> Introduction. Welcome to the Monthly Mean newsletter for February 2013. The Monthly Mean is a newsletter with articles about Statistics with occasional forays into research ethics and evidence based medicine. If you are having trouble reading this newsletter in your email system, please go to the web version (www.pmean.com/news/201302.html). If you are not yet subscribed to this newsletter, you can sign on at the newsletter page (www.pmean.com/news). If you no longer wish to receive this newsletter, there is a link to unsubscribe at the bottom of this email.

--> Why do I have to work so hard to find all those articles? There was a recent discussion on the Evidence-Based Health list about how hard you need to work to find ALL the articles associated with a specific systematic review. Would it be okay to live with a rapid review that didn't get all of the available literature but which got the answer more quickly and which allowed more reviews to be done? That's a difficult question to answer and it relates to the commonly quoted dictum, "The good is the enemy of the great" or "The perfect is the enemy of the good." It would be great if I could find the perfect words for this quote, but what I have here is good enough.

I won't try to answer the question of whether the extra searching effort is justified, but it wouldn't hurt to review what the problems are with a less than perfect search.

Publication bias. Publication bias is the tendency for negative results to appear in the peer-reviewed literature less frequently than positive results. It was originally attributed to certain restrictive editorial policies, but now it appears that it is more likely to occur because the authors themselves report negative results less frequently. It certainly seems consistent with human behavior. If you send out two articles and they both come back with scathing comments from the peer reviewers, you are more likely to shelve the article that has a negative result and more likely to persist and resubmit (or submit to a different journal) for the article that has a positive result. Failure to publish is a serious ethical breech and in some settings could be considered fraud. Publication bias can also occur if both positive and negative studies get published, but the negative studies take longer to appear in the peer-reviewed literature. There is a lot of empirical evidence that publication bias is real. For an example, see Fujian Song, Sheetal Parekh-Bhurke, Lee Hooper, et al. Extent of publication bias in different categories of research cohorts: a meta-analysis of empirical studies. BMC Medical Research Methodology. 2009;9(1):79. Full text available.

Duplicate publication bias. There is a surprising factor in systematic overviews that represents the flip side of the coin of publication bias. Some articles based on the exact same data set get published in two different journals. This is done covertly, so that the reader of either of the two articles would have no direct indication that the other article existed. Duplicate publication is a serious ethical problem because it wastes the limited journal space available for publication and squanders the time of volunteer editors and peer reviewers. Translation of articles would be an obvious exception as would a layperson's summary that appears in the popular press. There is empirical data to show that duplicate publications are more likely to be positive. If the data from these duplicate studies are double counted, this could seriously bias the results of the systematic overview. For more information, see Martin R Tramer, D John M Reynolds, R Andrew Moore, Henry J McQuay. Impact of covert duplicate publication on meta-analysis: a case study. BMJ. 1997;315(7109):635-640. Abstract available.

FUTON bias. FUTON stands for Full Text on the Net. FUTON bias occurs if you restrict your attention to those articles that are freely available on the Internet. This is certainly tempting, because it costs time and/or money to access articles that do not have the full text available on the Internet. There has been some speculation on whether FUTON bias is akin to publication bias, but as far as I know, there is no empirical data to support this. For more information, see Reinhard Wentz. Visibility of research: FUTON bias. The Lancet. 2002;360(9341):1256. Full text available.

NAA bias. NAA stands for No Abstract Available. This is closely related to FUTON bias. The term "No Abstract Available" appears in your PubMed search for certain items, such as letters to the editor, that do not have an abstract. If you skip over the items that don't have an abstract and review only those items that have an abstract in PubMed, you save time. The abstract allows you to quickly decide whether an article obviously does not belong without having to retrieve and review the full item. Again, there is speculation that NAA bias is akin to publication bias, but no empirical data to support this. The Wentz article cited above is a nice resource for NAA bias as well.

Language bias. Most of the medical research is published in English, but there is still a large amount of publication in other languages as well. Language bias is the tendency for negative articles to appear more often in publications written in languages other than English. It is an understandable bias. If you were an author and had two possible studies to publish, which one would you be most likely to send to an English language journal rather than a journal in your native language? It seems likely that you would use an English language publication for the study that is positive. It would take more work, but you would be more likely to justify that extra work for a positive study. The negative study would be just fine, you think, for the native language publication. There is empirical evidence to support language bias. For more details, see M Egger, T Zellweger-Zahner, M Schneider, et al. Language bias in randomised controlled trials published in English and German. Lancet. 1997;350(9074):326-329. [Accessed May 19, 2010]. PubMed citation available.

Medline bias. Medline is a database of "journal citations and abstracts for biomedical literature from around the world" produced by the National Library of Medicine. It can be accessed using PubMed, a freely available computerized too for searching for information in Medline. While Medline includes thousands of peer-reviewed journals, it does not include all journals. There is some speculation that failure to include articles found in other databases but not in Medline would produce a biased result, but as far as I know, there is no empirical evidence to support this. There is a lot of commentary on the web about how Medline excludes nutritional studies, for example, but these reports are subjective and anecdotal. The Cochrane Handbook (section recommends against relying on Medline alone in a systematic review, but again it does not offer any empirical support for the belief that relying only on Medline would produce bias.

Google Scholar bias. Google Scholar is a search engine available to the general public that searches across the indexed text of numerous research articles. It is often described as a useful complement to PubMed, and to various commercial research databases but there are questions about whether reliance Google Scholar, by itself, would lead to bias. One article argues in favor of Google Scholar (Jean-Francois G, Laetitia R, Stefan D. Is the coverage of Google Scholar enough to be used alone for systematic reviews. BMC Med Inform Decis Mak. 2013 Jan 9;13:7. Full text available), but it uses a rather questionable methodology. There are some empirical comparisons of search by PubMed and Google Scholar, but these are not written from the perspective of systematic overviews.

Those are the big issues that you should think about when debating how comprehensive you need to make your search. Did I miss anything?

Did you like this article? Visit my Publication Bias category page for related links and pages.

--> Stop and think before you start entering your data. Your research study is over and you have all this data on paper. It might be a lab notebook, or it might be a pile of filled out questionnaires. The only thing you have to do now is to start entering your data into the computer. It seems easy enough. But if you make a mistake here, you may end up paying dearly for it when you start your data analysis. Don't plunge immediately into data entry. Stop and think a bit first about how you are going to do this. Here are some general guidelines that you should consider when you are thinking about data entry.

Step 1. Arrange your data in a rectangular grid. A rectangular grid is a table where every cell (every intersection of a row and a column) contains one and only one number. Sometimes you need to spend some time re-structuring your data before it can fit into a rectangular grid.

[[Image available at http://www.pmean.com/news/images/201302/2a.png]]

The data shown above has a ragged format, with four rows of data for the women who were not breast feeding at six months, another three rows of data for the women who were breast feeding at six months and another two rows of data for women who were lost to follow-up. While you could enter this data in as is, it would create difficulty with almost the first data analysis. Notice, for example that the mother's age appears in three different columns, so it would be tricky to compute an average age across all subjects.

You should re-arrange the data so that each row of your data corresponds to a different mother. When you do this, you need to supplement your data set with a column reminding you which group the mother belonged to.

[[Image available at http://www.pmean.com/news/images/201302/2b.png]]

The table above shows the data re-arranged into a rectangular format. There are still some "holes" in the data that you need to think about (see below).

There is a tendency among some researchers to try to squeeze two data values into one cell (a cell is the intersection of a row and column). A common example of this is listing systolic and diastolic pressure together. This would produce values like 120/80. Fitting two data values into a single cell is a bad idea for two reasons. First, it does not allow you to calculate a separate mean for systolic and diastolic pressure. Second, the slash in the cell will confuse some programs. They'll interpret the slash as division and replace 120/80 with 1.5.

Step 2. Create codes for missing values. Many data sets have missing values. You need to document the reason that the data are missing, which is the topic of its own. But just as importantly, you need to create a code for those values that are missing. Don't leave a cell empty to represent a missing value. First, it makes it impossible for you to tell whether a value is missing or you haven't gotten around to the data entry for this value just yet. Second, some systems will convert an empty cell to zero when importing the data.

Generally, you want to choose a code for missing value that is well outside the range of real possible values to insure that it will won't accidentally get confused with the rest of your data. An extreme value like 9, 99, or 999 often works well, and sometimes a -1 works well. It depends entirely on the variable you are considering.

[[Image available at http://www.pmean.com/news/images/201302/2c.png]]

In this data set, I converted missing ages to 99, since very few 99 year old mothers have a child that is still breast feeding. I also used a 9 for missing marital status, and a -1 for a missing birth weight. A birth weight of -1 would represent a baby that, after it was born, would float to the ceiling.

If there are several distinct reasons why a data value might be missing (e.g., missing because it was not applicable, missing because the subject did not answer, or missing because the handwriting was illegible), then you should use separate missing codes (e.g., 97, 98, and 99) to represent these reasons.

Step 3. Add variable names at the top of your data. Each column of data should start with a variable name, a brief name that describes what the data in a column represents. This is especially important if you have more than a couple dozen variables. The variable names will help you negotiate through the maze of analyses much more easily. Keep the variable name in the first row and don't add more to the variable name in a second or third row, please. Some software programs will treat anything beyond the first row as data.

You should come up with a descriptive variable name, but it also needs to be reasonably short. Try to keep the length of the variable name around 8-16 characters. A mixture of numbers and letters is okay, but avoid special symbols such as $, &, or %. Most statistical software will reserve these special symbols for other purposes. The one major exception is the underscore (_), which is found usually paired on the same key with the minus sign. Finally, don't rely on case alone to distinguish between two variable names (e.g., X and x).

It's a good idea to avoid embedded blanks. In most statistical software, an embedded blank will cause the software to presume that you are referring to two variables. SPSS, for example, gets confused when you ask for a histogram for mom age and will try instead to product two histograms, one for mom and one for age. Here's where the underscore comes in handy. The variable mom_age is easy to read. Compare this to the alternative, momage, which looks like a nonsense word rhyming with homage. A period is also find for most statistical software, so you might consider mom.age. A third choice is to use mixed capitalization (also known as Camel Case) to make your variable name more readable (MomAge).

There's a story I tell here about why the underscore, period, or Camel Case is important. There was a group called Writer's Exchange that wanted to set up a website. They followed the web tradition of just running words together and asked for a domain name of writersexchange.com. They did not realize until after the website was up and running that many people were reading the name as three words rather than two.

Here's the data set with the lengthy multi-line descriptions replaced with a brief variable name.

[[Image available at http://www.pmean.com/news/images/201302/2d.png]]

Step 4. Assign number or letter codes to your categorical data. If you have categorical data, assign a code to each category level. Use the code during data entry to save time and minimize errors.

Here are some examples of number codes: Gender 1=Male, 2=Female, 9=Unknown; Race 1=White, 2=Black, 3=Asian, 4=Hispanic, 5=Native American, 8=Multiracial, 9=Unknown; Likert scale 1=Strongly Disagree, 2=Disagree, 3=Neutral, 4=Agree, 5=Strongly Agree, 9=No answer. While I prefer to use number codes, there are some advantages to using short letter codes. Here are some examples of letter codes: Gender M, F, and U (Male, Female, and Unknown); Race W, B, A, H, N, M, and U (White, Black, Asian-American, Hispanic, Native American, Mixed, and Unknown); Likert scale SD, D, N, A, SA, NA (Strongly Disagree, Disagree, Neutral, Agree, Strongly Agree, No Answer). Letter codes are easier to remember, and sometimes can be used effectively as plotting symbols.

I prefer number codes because they offer more flexibility during statistical analysis. For example, SPSS will not allow you to draw a scatterplot when one of your variables uses letter codes. Other software will alphabetize your letter codes, which may not be what you intended. For example, an alphabetized Likert scale would be printed in the following meaningless order: Agree, Disagree, Neutral, No Answer, Strongly Agree, Strongly Disagree.

For binary variables (categorical variables with only two possible values), I prefer to use the number codes 0 and 1. The 0 represents absence of a certain characteristic and 1 represent its presence. So my placebo group is always 0 and the active medication is always 1. The unexposed group is 0 and the exposed group is 1. I also use 0-1 coding for gender, using 0 for females and 1 for males. I remember it as presence or absence of the Y chromosome. There are some advantages for this system, but it is still a fairly arbitrary choice.

Let's assign number codes for the categorical variables in the breast feeding data example. For br_feed, let 0=No, 1=Yes, and 9=Lost. For mar_st, let 0=Single; 1=Married, and 9=Missing. With this change, the data will look like this:

[[Image available at http://www.pmean.com/news/images/201302/2e.png]]

Step 5. Add unique identification codes to each row of your data set. If practical, place a unique code in each row. A unique ID makes it much easier to track down errors during data entry. Unique codes are critical if you plan to combine your data with data from another source. Don't rely on the order of the data as it is entered as a way of telling which observations came from which records. Some data management and data analysis tasks will require you to reorder your data and then you lose the ability to check your records.

If you do use a unique code for each row, make sure it is not something that could compromise the privacy of the people who provided you with this data. Numbers from a medical record system are bad, and often patient initials are bad as well. Anything that might compromise privacy should be kept in a separate location. For example, you could have a random id attached to each row of the data set and then in a totally separate database, you could have information that links this random id to the medical record number.

Here's the database with unique identifiers.

[[Image available at http://www.pmean.com/news/images/201302/2f.png]]

Step 6. Write down all the relevant information about your data set. Lay out on a single sheet (or multiple sheets for a very large data set), all of the relevant information about your data set that someone ought to know. For each variable name, provide a more detailed explanation of what the variable represents including the units of measurement. List the missing value code(s) and document the reasons why the data are missing. If you used number or letter codes, document those here as well.


id: a randomly generated identification code for each subject.

br_feed: breast feeding status at six months. 0 = no breast feeding, 1 = full or partial breast feeding, 9 = missing (lost to follow-up).

mom_age: mother's age in years. -1 = missing (mother did not answer the question).

mar_st: marital status. 1 = married, 0 = single, divorced, or widowed, 9 = missing (mother did not answer the question).

birth_wt: birth weight of the child in kilograms. -1 = missing (not found in the medical record).

I talk a bit about trying to squeeze two numbers into one cell in the January 2009 issue of The Monthly Mean and debate the pros and cons of entering your data in a spreadsheet versus a database in the March/April 2011 issue. You should also look at my general guide for data entry, written back in 1999.

Did you like this article? Visit my Data Management category page for related links and pages.

--> Monthly Mean Article: F. E. Vera-Badillo, R. Shapiro, A. Ocana, E. Amir, I. F. Tannock. Bias in reporting of end points of efficacy and toxicity in randomized, clinical trials for women with breast cancer Ann Oncol. 2013. Background: Phase III randomized, clinical trials (RCTs) assess clinically important differences in end points that reflect benefit to patients. Here, we evaluate the quality of reporting of the primary end point (PE) and of toxicity in RCTs for breast cancer. Methods: PUBMED was searched from 1995 to 2011 to identify RCTs for breast cancer. Bias in the reporting of the PE and of toxicity was assessed using pre-designed algorithms. Associations of bias with the Journal Impact Factor (JIF), changes in the PE compared with information in ClinicalTrials.gov and funding source were evaluated. Results Of 164 included trials, 33% showed bias in reporting of the PE and 67% in the reporting of toxicity. The PE was more likely to be reported in the concluding statement of the abstract when significant differences favoring the experimental arm were shown; 59% of 92 trials with a negative PE used secondary end points to suggest benefit of experimental therapy. Only 32% of articles indicated the frequency of grade 3 and 4 toxicities in the abstract. A positive PE was associated with under-reporting of toxicity. Conclusion: Bias in reporting of outcome is common for studies with negative PEs. Reporting of toxicity is poor, especially for studies with positive PEs. [Accessed on January 17, 2013]. Full text available.

--> Monthly Mean Book. Stef van Buuren. Flexible Imputation of Missing Data. Chapman & Hall. ISBN-13: 978-1439868249. Here's the blurb from the back cover. "Missing data form a problem in every scientific discipline, yet the techniques required to handle them are complicated and often lacking. One of the great ideas in statistical science—multiple imputation—fills gaps in the data with plausible values, the uncertainty of which is coded in the data itself. It also solves other problems, many of which are missing data problems in disguise. Flexible Imputation of Missing Data is supported by many examples using real data taken from the author's vast experience of collaborative research, and presents a practical guide for handling missing data under the framework of multiple imputation. Furthermore, detailed guidance of implementation in R using the author's package MICE is included throughout the book. Assuming familiarity with basic statistical concepts and multivariate methods, Flexible Imputation of Missing Data is intended for two audiences: (Bio)statisticians, epidemiologists, and methodologists in the social and health sciences, and substantive researchers who do not call themselves statisticians, but who possess the necessary skills to understand the principles and to follow the recipes. This graduate-tested book avoids mathematical and technical details as much as possible: formulas are accompanied by a verbal statement that explains the formula in layperson terms. Readers less concerned with the theoretical underpinnings will be able to pick up the general idea, and technical material is available for those who desire deeper understanding. The analyses can be replicated in R using a dedicated package developed by the author."

--> Monthly Mean Trivia Question. Most movie sequels include the number "2" in their title (Godfather, Part 2) and sequels of sequels usually have the number "3" in their title (Toy Story 3). Can you name a sequel that has a number other than "2" in its title and a sequel of that sequel that has a number other than "3" in its title? Roman numerals and other variations on "2" and "3" don't count. I can think of two answers, actually, but I am sure there are more. The first person to answer correctly and the first person to provide a different unique answer will get their name mentioned in the next newsletter.

Last month's trivia question was: What Beatles song has the profound mathematical insight "One and one and one is three"? The answer is "Come Together." Several people came up with the answer quickly, but Simon Blomberg provided the first answer, arriving only 5 minutes after the newsletter was sent out. Congratulations!

--> Nick News: Still more fun in the snow. We got blanketed with two big blizzards in February. Both storms brought around a foot of snow. Arriving just a few weeks before these storms from Amazon was a snow brick mold. Nicholas put it to very good use.

[[Image available at http://www.pmean.com/news/images/201302/nick01.jpg]]

Here is one of his forts with six layers of snow bricks. Nicholas is hugging his dog Cinnamon, who loves the snow almost as much as Nicholas does. These forts look impressive, but they melt pretty quickly.

--> Very bad joke: Some clinical trials are impossible to blind and you shouldn't even try. If one of your treatment arms is a bilateral orchiectomy, sooner or later the patient is going to notice that something is missing. This is an original joke of mine that I actually published in a peer reviewed article.

--> Tell me what you think. How did you like this newsletter? Give me some feedback by responding to this email. Unlike most newsletters where your reply goes to the bottomless bit bucket, a reply to this newsletter goes back to my main email account. Comment on anything you like, but I am especially interested in answers to the following three questions.
--> What was the most important thing that you learned in this newsletter?
--> What was the one thing that you found confusing or difficult to follow?
--> What other topics would you like to see covered in a future newsletter?

If you send a comment, I'll mention your name and summarize what you said in the next newsletter. It's a small thank you and acknowledgement to those who take the time to help me improve my newsletter. If you send feedback and you want to remain anonymous, please let me know.

I received feedback from several people. One anonymous reader liked the description of collinearity, but was confused by my comment about several of the variables adding up to close to a constant being a type of collinearity. Let me try to elaborate a bit. Suppose you have four variables that add up to exactly 100. That represents a pure redundancy in the data: if you know three of the variables, you can compute the fourth value with perfect precision. Note that these variables do not need to be all that strongly correlated for this to happen. What is perfectly correlated is the sum of the first three variables and the fourth variable. Now let's consider a case where the four variables don't add up to exactly 100. Sometimes they add up to 100.1 or 100.2 and something they only add up to 99.8. This creates a near redundancy in the data if you know three of the variables, you can estimate the fourth variable to within +/-0.2. This is a type of collinearity that is important, even though it is not reflected in a high correlation between any pair of variables. You won't see this type of collinearity just by looking at the correlation matrix.

Martha Behnke also liked the article on collinearity and shared on her Facebook page my wry comment in that article ("Coming up with four different technical terms for the same condition is one way that we statisticians keep our discipline mysterious and awe inspiring."). She mentioned how important outcomes analysis is with the people she works with. It would be well worth writing an article about, if I get a chance.

Another anonymous reader also liked the article on collinearity and shared a link to an article that discusses the interplay of birthweight and gestational age on mortality risk. He suggested an article on heterogeneity in meta-analysis.

Two other readers shared brief compliments on my newsletter. Thanks!

--> Join me on Facebook, LinkedIn, and Twitter. I'm just getting started with social media. My Facebook page is www.facebook.com/pmean, my page on LinkedIn is www.linkedin.com/in/pmean, and my Twitter feed name is @profmean. If you'd like to be a Facebook friend, LinkedIn connection (my email is mail (at) pmean (dot) com), or tweet follower, I'd love to add you. If you have suggestions on how I could use these social media better, please let me know.

--> Permission to re-use any of the material in this newsletter. This newsletter is published under the Creative Commons Attribution 3.0 United States License, http://creativecommons.org/licenses/by/3.0/us/. You are free to re-use any of this material, as long as you acknowledge the original source. A link to or a mention of my main website, www.pmean.com, is sufficient attribution. If your re-use of my material is at a publicly accessible webpage, it would be nice to hear about that link, but this is optional.

What now?

Sign up for the Monthly Mean newsletter

Review the archive of Monthly Mean newsletters

Take a peek at an early draft of the next newsletter

Go to the main page of the P.Mean website

Get help

Creative Commons License This work is licensed under a Creative Commons Attribution 3.0 United States License. This page was written by Steve Simon and was last modified on 2010-12-31.