Is the randomized trial the gold standard for research? (2004-09-23)

This page has moved to my new website.

I'm giving a talk next month at the Midwest Society for Pediatric Research, and here's a brief outline of what I will be saying.

Before I start this talk, I have to brag. I'm a new father. My wife and I just got back from Russia with a little boy, Nicholas Zhenya Simon, and he is the cutest little boy who ever lived. Normally, you would think that this is a rather biased statement, but since I am a statistician, I can vouch for the fact that I have removed all bias before making that statement.

I tell people that since I have a Ph.D. and my wife has an M.D., that means that she's the only one qualified to change diapers.

You might wonder what I have in store for you today. The old joke is that if I had an hour to live, I'd spend it in a Statistics class. It would last a lot longer that way. Don't worry about this talk this morning. I have not included any formulas and there won't be a pop quiz at the end.

What I want to talk about is question that I've thought a lot about over the past few years.

What does it take to convince doctors to change their clinical practices?

Some doctors will say "I've done this way for the past thirty years and I'm not going to change now" and others will say "It was published in JAMA so it must be true". Hopefully you find yourselves somewhere between these two extremes. You don't change your practice every time a new article is published, but you do change when sufficient evidence accumulates.

This is a question at the heart and soul of evidence-based medicine. When is the evidence in a journal article sufficiently compelling to cause you to change how you practice medicine?

This morning I want look at one of the standards of evidence used in evidence-based Medicine, the randomized trial. Many experts call the randomized trial the gold standard of research and place it at or near the top of the hierarchy of evidence. James Penston has written a book staking out the opposite viewpoint.

Every day, millions of patients throughout the world take treatment which is based on the results of large-scale randomised trials. But, how much do we really know about these studies? This book exposes the serious flaws in this method of medical research. Although making vast profits for the pharmaceutical industry, large-scale randomized trials do little to improve the lives of patients and are responsible for an enormous waste of scare health care resources. From the back cover of Fiction and Fantasy in Medical Research. The Large-Scale Randomised Trial by James Penston (2003, The London Press, London England. ISBN: 0-9544636-1-7).

The truth is actually somewhere in between. Randomized trials have some weaknesses. In particular, researchers try to make them so squeaky clean that they no longer reflect what goes on in the real world.

I want to start by clarifying that I use the terms "randomized control trials" and "randomized clinical trials" interchangably. There's probably a lawyer somewhere who knows the difference between these two terms and is standing ready to sue me when I use the wrong term. To keep things simple for this talk, I will leave the "C" out and just call them randomized trials.

I want to start by offering a brief definition for evidence-based Medicine and contrast it with anecdotal medicine. Then I want to talk about the sorts of people that volunteer for research study. I also want to examine the claim that the randomized trial is reductionist. Finally, I want to draw a distinction between a cynical view of research and a skeptical view.

What is EBM? David Sackett provides a nice succinct definition of EBM.

evidence-based medicine is the conscientious, explicit, and judicious use of current best evidence in making decisions about the care of individual patients. bmj.bmjjournals.com/cgi/content/full/312/7023/71

It's important to remember to include the values of the individual patient in any EBM decision. Suppose you are considering a treatment that has as a side effect impairs the production of sperm and reduces your fertility. Some men, would not even consider such a treatment. They have a strong desire to father their own children now or in the future, and they would sacrifice their own health in order to maintain their ability to produce children. Other people would be totally indifferent to this side effect. A man with a vasectomy does not worry to much about drugs that alter his sperm production. Others might actually perceive reduced fertility as a benefit rather than a side effect.

Why we can't rely on anecdotes. Human memory is fragile and selective. Thomas Gilovich has written extensively about this in a book, How We Know What Isn't So. The Fallibility of Reason in Everyday Life. He cites, as an example, the commonly held belief that infertile couples who adopt a child are likely to conceive a natural child shortly afterwards. This is taken, perhaps, as a sign that it was just stress that was causing the infertility.

This is a myth, and is a rather insensitive thing to say to a couple with fertility problems who is considering adoption. But it also is a story that resonates with a lot of people.

If you wanted to study this problem scientifically, you need to count four groups of people.

women who adopt and then get pregnant,
women who adopt and don't get pregnant,
women who don't adopt and get pregnant, and
women who don't adopt and don't get pregnant.

It's first group of women who make for the memorable anecdotes. They stand out in our minds far more strongly.

There are stories about how the emergency room gets crazy on nights with a full moon. Careful studies have shown that the phases of the moon do not affect suicides, crime, psychiatric admissions, or anything else. But people who work on suicide hot lines, in law enforcement, and in other areas will often tell stories that make you believe that a full moon makes people go crazy.

I suspect that much of this is subconscious. Certain events resonate with you and others just slip out of your memory so easily. You remember the night that you dreamed about Aunt Bertha and the very next day a letter for her is sitting in your mailbox. But the nights that you don't dream about Aunt Bertha and the days that you don't get a letter from her just don't seem important enough to be worth tallying.

I recall a letter to the editor opposing seat belt laws. This was many years ago, so I can't cite the source. The author had been in a car crash and was thrown out of the car because he was not wearing a seat belt. He claims that if he had been wearing a seat belt, he would have been trapped in the car and would have died.

I thought it was quite fascinating. Being in a car accident makes you an expert in auto safety. I would have thought that the people who avoided getting into accidents would be the ones who are more expert in safety issues. But, here again, you need to count four groups of people involved in car crashes:

drivers who are ejected from the car and live,
drivers who are ejected from the car and die,
drivers who remain in the car and live,
drivers who remain in the car and die.

Only two of these groups write letters to the editor. You shouldn't trust anyone from either of these two groups until you've had a careful accounting of the two non-letter-writing groups.

Prince Charles of England tells an enthusiastic anecdote about Gerson therapy.

I know of one patient who turned to Gerson Therapy having been told she was suffering from terminal cancer and would not survive another course of chemotherapy. Happily, seven years later, she is alive and well. So it is vital that, rather than dismissing such experiences, we should further investigate the beneficial nature of these treatments. observer.guardian.co.uk/uk_news/story/0,6903,1248282,00.html

The problem, of course, is that the people who were killed by relying on Gerson therapy aren't available to offer a counterpoint. In fairness to Gerson therapy, the people who were killed by relying on traditional therapy are also unavailable for comment.

A famous statistician, Abraham Wald, was asked during World War II to help out the Air Force. A lot of bombers were being shot down over Germany, and they wanted to reinforce the planes with armor. You can't put armor everywhere, because the plane would be too heavy to get off the ground. So where, Abraham Wald, was asked, should you place the armor? They had records of where the planes returning from Germany had been shot at, because there were big gaping holes. So Dr. Wald started tallying things like the planes that returned with holes in the left wing and the planes that returned without holes in the left wing. It soon became apparent what to do, because the data represented only two of the four groups. There was no data on bombers that didn't return from Germany. So Dr. Wald noted the few areas of the bombers where holes were NEVER found. These were the areas that needed heavy armor, because any bomber hit in those areas must not have been able to make it back to England.

Whenever you hear an anecdote, remember that it represents only one of four groups. Some of these other groups might be easily ignored or overlooked.

What sort of people volunteer for a randomized trial? When you stop and think about it, the fact that anyone participates in a randomized trial is rather amazing. These volunteers give up a considerable amount of autonomy and allow the choice of therapy that they get to be determined, not by them, not by another doctor, but by the random flip of a coin.

There are lots of areas where people are unwilling to cede this authority. Patients, for example, want to have a say in the type of surgery they get. This makes it hard to recruit patients for a randomized surgery trial. An example is laproscopic surgery. If you offer people a choice between the laprascopic surgery, which leaves a small scar, and the traditional surgery, which leaves a large scar, most people want the small scar (Neugebauer 1991).

In another study, surgeons had difficulty recruiting patients into a trial comparing extracorporeal shock wave lithotripsy with open surgery for treating gallstones (Plaisier 1994). Not only did patients want the shock wave, which was non-invasive, but the open surgery control group also looked very unattractive to a laprascopic method that became available outside the randomized trial. In that study, only 8% of the pool of patients entered into the randomized trial.

Another example I like to cite is a randomized study of birth control methods. Most people are pretty fussy about the types of birth control that they want to use. And boy, do they get upset when you tell them that one of the arms in the randomized trial is a placebo. Think about it. What sort of couple would volunteer for a placebo controlled trial of birth control. They're the sort of people that don't get upset if they get pregnant and they don't get upset if they don't get pregnant. Most couple I know are either trying very very hard to have babies or trying very very hard not to have babies. I'm sure there are some people who are indifferent, but they are few and far between, and they probably aren't much like you or me.

Recruiting controls is especially troublesome in a study that involves a painful procedure. A Swedish study documents volunteer bias in a study of personality (Gustavsson 1997). In this study, the researchers wanted to analyze cerebrospinal fluid in order to "examine the associations between personality traits and biochemical variables."

Now, how do you get cerebrospinal fluid? The technical term is lumbar puncture, but it's also called a spinal tap. A spinal tap is rather painful, I'm told, and it carries a small risk of some serious side effects. What sort of person would volunteer to submit to a spinal tap?

In this study, the subjects they recruited had already completed a complete personality profile in a previous research study. Of the 87 subjects, 48 declined to participate. There was one personality trait that was quite different between the "volunteers" and the "refusers". Can you guess what it is?

It turns out that the volunteers had scores roughly a half standard deviation higher on impulsiveness. They did not differ on other personality traits such as socialization and detachment. The large difference in the impulsiveness measurement would obviously cloud any attempt to correlate personality traits and biochemical measurements in spinal fluids among those who volunteered.

Many drug companies pay good money for healthy volunteers to test new drugs. If the study involves extensive observation and/or invasive procedures, the amount of money offered can add up. Some volunteers will return repeatedly for different studies. No one gets rich this way, and the amount of money offered can not be so large to be coercive. But serving as a research volunteer can still help pay a few bills and supplement your income.

Do these professional volunteers differ from you and me? You might suspect that these volunteers are poorer and less likely to have a full time job. There are some subtle differences, though, that are even more important.

When genetic testing was done on a group of professional volunteers, there were almost no instances of a genetic variation that was associated with slow metabolism of certain drugs (Chen 1997). This slow metabolism would tend to be associated with a greater chance of side effects. This may not be too surprising. If you have a bad outcome with your first research study, you'll probably not come back for the next study. Unfortunately, this means that studies on professional volunteers could possibly to understate the likelihood and severity of side effects, as compared to the general population.

Exclusions. It's not just the fact that some people don't volunteer. Some people are excluded from the study before they even get the chance to volunteer. The most obvious example of this is the exclusion of elderly patients from research studies.

If you are elderly, pat yourself on the back. Your demographic group drives the healthcare economy. You are, by far, the largest consumers of new medications and new therapies. Yet, far too often, these new medications and new therapies are tested on patients much younger (Bayer 2000).

There's a simple reason for this exclusion. When researchers design their experiments, they want a nice clean sample.

Researchers want patients who are ill with one and only one disease. But with older people, several things will break down at the same time (Schellevis 1993).
Researchers don't want patients who are taking a lot of other medications. But older people take so many different drugs that they often qualify for bulk discounts at Walgreen's.
Finally, researchers want patients who are likely to stay alive for the duration of the research study. But older people are likely to die from conditions unrelated to disease being studied.

Although the reasons for excluding elderly patients are understandable, they are still not justifiable. Research done on younger patients cannot be easily generalized to older patients.

Another exclusion that is very troublesome is the use of a single blind run-in period. Another famous Statistician, Stephen Senn, explains and then harshly criticizes this design.

Many trials, however, are preceded by a "placebo run in," in which all patients are given placebo. The practice is common within the pharmaceutical industry and recommended by standard texts as a means of weeding out non-compliers before randomisation, eliminating placebo responders, ensuring that patients are stable, washing out previous treatment, or simply to provide a period for baseline measurement. This is incompatible with informed consent, since a doctor is hardly likely to say: 'Take this ineffective substance for the next month and record your symptoms daily in this diary.' -- (Senn 1997).

It might make sense from the researcher's perspective to exclude non-compliers, but it makes it awfully hard to generalize the results. Who among you has the luxury of telling your patients that you won't treat them until they pass a test that shows they are compliant?

An example of the exclusion of non-compliers is a study of allergy shots (Adkinson 1997). Patients were randomized to either receive allergy shots in addition to the normal standard of care or a placebo shot. But patients who did not comply well with the normal standard of care were not allowed to be randomized. The study showed that allergy shots were not effective above and beyond the normal standard of care, but that ignores the fact that allergy shots probably represent the one type of treatment that would work well when you have compliance problems. You stick them in the arm while they're in the office and you know that they're getting the medication. Most other treatments are used in an unsupervised setting.

Reductionism. Perhaps the sharpest criticisms of randomized trials, however, come from proponents of complementary and alternative medicine (CAM). They have considered randomized trials to be "reductionist" because they fail to look at the whole patient and reduce that patient to a single dimension. A balanced perspective on this controversy appears in (Mason 2002). They point out that:

"..many practitioners argue that research methods dissect their practice in a reductionist manner and fail to take into account complementary medicine's holistic nature."

They argue that randomized trials have to be adapted to the special features of CAM. In particular, they point out that the tendencies of randomized trials and CAM are often in conflict. Randomized trials:

focus on a single disease,
require tightly standardized treatment regimens,
attempt to remove practitioner effects from the design,.
focus on a single intervention,
focus on easily quantifiable outcomes, and
focus on short term changes.

In contrast, CAM

is used for more general problems and conditions,
tailors the treatment to individual patients,
relies on the relationship between the patient and the practitioner,
uses multiple interventions simultaneously,
tries to produce more subtle effects such as spiritual change or personal growth, and.
aims for long term healing.

Note that these are tendencies. Some randomized trials focus on more than one disease, but the tendency is to focus on a single disease. Some types of CAM are standardized, but the tendency is to offer individualized therapies.

It's not just CAM that exhibits these conflicts, though. The Medical Research Council wrote a report in April 2000 ([pdf]) that discusses the evaluation of complex interventions where it is difficult to isolate the individual components of the intervention. They mention several examples.

Does a physiotherapist contribute significantly to the management of knee injuries? This role goes beyond a simple sequence of exercises.

The package of care to treat a knee injury may be quite straightforward and easily definable - and therefore reproducible: �This series of exercise in this order with this frequency for this long, with the following changes at the following stages�. However, the physiotherapist may have, in addition to the exercises, a psychotherapy role in rebuilding the patient's confidence, a training role teaching their spouse how to help with care or rehabilitation, and potentially significant influence via advice on the future health behaviour of the patient. Each of these elements may be an important contribution to the effectiveness of a physiotherapy intervention.

How does a stroke unit improve the quality of care for stroke patients? The concept of a stroke unit is difficult to standardize.

For example, although research suggests that stroke units work, what, exactly, is a stroke unit? What are the active ingredients that make it work? The physical set-up? The mix of care providers? The skills of the providers? The technologies available? The organisational arrangements?

How cognitive behavioral therapy works? This approach is highly individualistic.

Does success depend on the personality of the therapist? The personality, health status, social status, or other characteristic of the patient? The content of the therapy? The way it is delivered? The frequency of contact? The location of contact? The duration and the timing? What other components count?

Rather than arguing that randomized trials need to be adapted to the special needs of CAM, perhaps randomized trials should be adapted to meet the special needs of many types of medical interventions.

Furthermore, the claim that a practice is holistic should not be used as a blithely disregard evidence from an overly simplistic randomized trial. Perhaps the randomized trial can get to the heart of the issue by focusing on a single key dimension to the problem. A fourth grade student evaluated Therapeutic Touch (TT) for a science fair project. This project was highlighted on a Public Broadcasting Service show "Scientific American Frontiers" and published in the April 1, 1998 issue of JAMA (Rosa 1998) and received a lot of press coverage (CNN has a very nice story).

Therapeutic Touch is a therapy to improve health through the manipulation of the human energy field. There apparently is no physical touching. The official website on therapeutic touch describes it as:

"...an intentionally directed process of energy exchange during which the practitioner uses the hands as a focus to facilitate the healing process. It is a contemporary interpretation of several ancient healing practices. Therapeutic Touch is a scientifically-based practice founded on the premise that the human body, mind, emotions and intuition form a complex, dynamic energy field. The human energy field is governed by pattern and order. In health, the field is balanced, however in disease, the energy is characterized by imbalance and disorder."

Emily Rosa's experiment was very simple, perhaps too simple. If practitioners of Therapeutic Touch are able to manipulate energy fields, they must first be able to detect energy fields. She would hold her hand above either the left or right hand of the practitioner and ask him/her to tell which hand. The choice of hand was randomly determined by a coin flip. A screen with two holes in it prevented the practitioner from seeing what was going on.

Emily Rosa got 21 experienced practitioners to agree to the test. They were right only 44% of the time. Did this simple experiment disprove the healing power of TT? Perhaps not. TT is a complex intervention and this experiment only looked at a single aspect of it.

The experiment does shift the burden of proof, however. Detection of energy fields is a fundamental aspect of TT that all other aspects of this therapy rely on. How can practitioners of TT manipulate energy fields that they cannot even detect? Any further research should be discontinued until practitioners of TT can demonstrate the ability to detect energy fields in a rigorous blinded study.

Larry Sarner (Emily Rosa's step-father) makes much the same point in an article on the Quackwatch web site that responds to criticisms of the Rosa study. In particular, he responds to the criticism of reductionism:

[Critical comment #5] This was not a test of TT, but a parlor game. What the practitioners were required to do during the experiment invalidated its applicability to TT, especially since TT is a holistic process and can't be validly analyzed in parts. Emily's test was not of efficacy or technique (or "healing"), but I of raw ability. It's very much like testing a surgeon to see if he can l tell, without looking, in which hand the scalpel is being held. In any event, there was some movement. Emily presented her hand after each coin flip, which required relative movement between her hands and the subject's. Both subjects and Emily had at least small I movements of their hands during the trials, and some practitioners even wiggled their fingers or hands. Previous descriptions of the sensations of feeling an HEF state that the field itself is constantly in motion, and the literature states that such motion can be easily felt. Significantly, all of Emily's subjects agreed to the protocol and none voiced any concern that the test setup would pose a problem in demonstrating their ability. The argument about TT being "holistic" is a thinly disguised attempt to get back to "outcome" (i.e., clinical) testing, where it is easier to obfuscate, ignore negative results, or explain away nonconforming data. There have been numerous clinical trials on outcomes using TT. The results are highly mixed. Some tests do not have statistically significant results, others revealed slight positive effects (though statistically significant), and several actually reported statistically significant effects, but negative (i.e., the control group did better than the TT group). Holistic practitioners' prejudice against what they call "reductionism" (analyzing things in parts) is not shared by others in scientific medicine.

There is, by the way, a huge financial incentive to demonstrate the ability to detect energy fields. The James Randi Education Foundation offers a one million dollar prize to anyone who can show, under carefully controlled conditions, evidence of any paranormal, supernatural, or occult power or event. James Randi himself says that TT as well as several other alternative medicine therapies (Iridology, Reiki, Homeopathy and Applied Kinesiology) would qualify for the challenge.

Cynicism versus skepticism. There's a tendency to approach research from a checklist, and to literal a reading of the EBM literature can get you in trouble. It's a philosophy that says

Read the methods section. If it isn't randomized, if it isn't double blind and if they didn't use Intention to Treat analysis, disregard the results of the study.

Think about the flaws in research as a series of caution flags. Slow down, look carefully. If you see enough caution flags, it is reasonable to ask for some corroborating evidence: independent replication of the results, description of a biological mechanisms, and so forth.

If you leave this talk thinking that you just can't trust any research, not even randomized trials, then I've created another cynic. The world doesn't need more cynics. If you leave this talk thinking that randomized trials aren't perfect and I shouldn't accept their results without looking carefully at the details, then I've created a skeptic. Being skeptical is a good thing. You don't change your clinical practice on a whim, but you will if enough evidence accumulates. There's a fine line between cynicism and skepticism, and I want you to stay well on the skeptical side.

Bibliography

Conventional versus laparoscopic cholecystectomy and the randomized controlled trial. Cholecystectomy Study Group. Neugebauer E, Troidl H, Spangenberger W, Dietrich A, Lefering R. Br J Surg 1991: 78(2); 150-4. [Medline]

Unexpected difficulties in randomizing patients in a surgical trial: a prospective study comparing extracorporeal shock wave lithotripsy with open cholecystectomy. Plaisier PW, Berger MY, van der Hul RL, Nijs HG, den Toom R, Terpstra OT, Bruining HA. World Journal of Surgery 1994: 18(5); 769-72; discussion 773. [Medline]