P.Mean: How sample size calculations are reported in the literature (created 2012-02-23).

News: Sign up for "The Monthly Mean," the newsletter that dares to call itself average, www.pmean.com/news.

I am preparing a webinar on sample size calculations and wanted to examine some examples in the published literature. There were lots of interesting examples in an open source journal called Trials. I only included a few examples in my webinar, but I wanted to save the examples I found here in case I want to expand the talk.

This trial aims to recruit 128 participants with mild AD and their carers, with 64 patients being allocated to each study group. A study of this size will have 80% power to detect between group differences on the ADAS-COG associated with moderate effect size (Cohen's d = 0.5) and alpha of 5%.http://www.trialsjournal.com/content/12/1/47

We calculated power on an expected transition rate of 35 percent over eighteen months with a 50 percent reduction of transitions in the CBT-group. The sample we need for a 2-tailed test of the proportions with an alpha of .05 and a power of .80 is 2 × 93 for the reduction of the transition to psychosis and 2 × 82 for the persistence of ARMS and 2 × 91 for the transition into psychosis. A conservative estimate of the drop-out rate is twenty percent per year in schizophrenia research [24]. With an estimated 30 percent drop-out over 18 months, we decided to include 240 persons in the trial. Interventions to minimize drop-outs are flexibility to location of therapy (the appointment can be at their home-address or some times by telephone or webcam), sending Christmas- and Birthday cards every year. For the participants that end up in the CBT-treatment group there is also the possibility for web-cam therapy. All the participants that complete the study will have a financial compensation for expenses made. http://www.trialsjournal.com/content/11/1/30

Because patient function (or dysfunction) is generally considered the more consequential of our two primary outcome measures [27] (the other being bothersomeness of back pain), sample size calculations are based on the Roland-Morris Disability Questionnaire (RDQ). Our sample size is designed to ensure that we have good power to detect a clinically significant difference of 2.5 points for pairwise comparisons (yoga to self care and yoga to exercise) on the RDQ (Aim 1 and 2) and adequate power to detect a difference of 1.7 points between the yoga and exercise groups that would be of interest when exploring mechanisms of action (Aim 3). We think that a smaller detectable difference between yoga and exercise can be justified when examining mechanisms of action because in the Aim 3 comparison of yoga and exercise, we focus on the additional benefits of yoga compared with exercise, anticipating that a portion of yoga's clinical effects would actually result from movement.

We have accommodated these dual power needs by proposing a 2:2:1 randomization ratio (yoga: exercise: self care), with a total of 210 participants. Assuming 10% loss to follow-up (which is slightly higher than in our previous study [19]), there would be outcomes data for 75:75:38 participants in the yoga, exercise, and self care groups, respectively. To protect against multiple comparisons, we will use Fisher's protected least significant difference approach, which has been shown to have desirable properties when there are three groups [51]. This approach makes pair-wise comparisons between the three treatment groups only if the overall F-test is significant. The power of this omnibus F-test depends on how the means from the three treatment groups differ. We therefore assumed that the yoga group would be 1.7 RDQ points superior to the exercise group, which would, in turn, be 0.8 RDQ points better than the self care group (giving a difference of 2.5 points between the yoga and self care groups). We chose a 1.7-point difference between yoga and exercise because it is slightly more conservative (i.e., smaller) than that found in our previous study [19].

Our estimates of the standard deviations of our primary outcome measures adjusted for pre-randomization baseline values were derived from analyses of covariance of 12-week follow-up data estimated from the 101 study participants in our 3-arm pilot study: RDQ standard deviation (SD) = 3.68 and bothersomeness SD = 2.38. With our proposed sample size, the omnibus F-test for the RDQ score will have 92% power for detecting a statistically significant difference among the three treatment means with the distribution assumed above. If this omnibus test isstatistically significant we will address Aims 1 and 2 of the study by comparing the appropriate pairs of means, as discussed below. To detect a pair-wise difference of 2.5 RDQ points, we will have 92% power for the yoga (or exercise) to self care comparison and 98% power for the yoga to exercise comparison. For Aim 3 we will have 92% to detect a pairwise difference of 2.5 RDQ points between yoga (or exercise) and self care and 80% power to detect a pairwise difference of 1.7 points between yoga and exercise.

Our sample size will also provide adequate power to detect a clinically important difference of 1.5 on our 0 to 10 bothersomeness measure. For the omnibus F-test, we will have 89% power for detecting a significant difference of 1.5 points among the three groups (if we assume a difference of 1.1 points between the yoga and exercise groups and of 0.4 points between exercise and self care). For a difference of 1.5 points on the bothersomeness measure, we will have 88% power for the yoga (or exercise) to self care comparison and 97% power to for the yoga to exercise comparison. For Aim 3, we will have 88% power to detect a pairwise difference of 1.5 bothersomeness points between yoga (or exercise) and self care and 80% power to detect a pairwise difference of 1.1 points between yoga and exercise.

At each time point, both primary outcomes (function and symptoms) will be tested at the 0.05 level because they address separate scientific questions. Analyses of both outcomes at all follow-up times will be reported, imposing a more stringent requirement than simply reporting a sole significant outcome. Arguments against adjusting for multiple comparisons in this situation have been made by Rothman [52,53] and others [54].

The power calculations are based on simple comparisons of the follow-up scores at a single point in time with adjustment for baseline values using analysis of covariance. We also plan to adjust for other baseline characteristics (e.g., age, gender, and baseline covariates found predictive of 10-week outcomes). Inclusion of such baseline covariates can improve precision of the variance estimate and therefore increase power. http://www.trialsjournal.com/content/11/1/36

The GAD treatment trial aims to recruit 120 participants (40 per group). Most CBT-based treatment trials of GAD report an effect of approximately .6 SDs relative to a placebo and 0.8 SDs relative to minimal contact. Pharmacological treatments of GAD report an effect of approximately 0.3 SDs compared to placebo. Based on the primary outcome measure of differences in anxiety scale scores on the GAD-7, this sample size will have 80% power to detect a moderate effect size. Assuming a correlation of .7 between pre- and post-test measures, the study will have 80% power to detect differences in change from baseline of approximately. 3SDs in a priori contrasts of trial conditions conducted within the framework of an omnibus test. http://www.trialsjournal.com/content/11/1/48

Based on Scott et al [25] we expect a medium effect size (.50) [26] of bCBT as compared to GPC. With an alpha of 0.05 (one-sided) we will need 51 patients in each experimental condition to achieve a power of .80 [27]. When taking into account a refusal rate of .20 [25,28-30] we will need 61 patients per treatment condition, which implies a total sample size of 122 patients. http://www.trialsjournal.com/content/11/1/96

The primary endpoint analysis will be conducted with linear mixed models (LMM, [35]). As software for sample size calculation for the analysis of longitudinal data using multilevel mixed models is not available, we calculated the sample size for classical ANOVA using nQuery 4.0.

The power calculation is based on published results about CBT for persistent positive symptoms. For the comparison between CBT and TAU Tarrier et al. [36] reported Effect Sizes (ES: (meanTAU-meanCBT)/SDTAU) of 0.33-0.66 for the 18 month follow-up. Tarrier et al. (1998) found an ES of 0.48, Kuipers et al. [37] an ES of 0.6, and Sensky et al. [19] an ES of 0.5. In a review Gould et al. [38] found a range of ES from 0.2 to 1.26. The reported variance differs to a great extent indicating considerable differences with regard to samples or treatments. A recent effect size analysis applied broader inclusion criteria for studies and resulted in an ES of .57 for acute patients (post-treatment analysis) and an ES of 0.27 in chronic patients [3]. These reviews show considerable efficacy of CBT when compared to treatment as usual. However, this study focuses on the difference between CBT and Supportive Treatment (ST). Unfortunately, the power calculation is more difficult for this comparison as fewer studies are available. According to a review of Tarrier et al. [36] the following studies have included ST-control groups: Tarrier et al. [18], Haddock et al. [28], Pinto et al. [39], Lewis et al. [40], and Durham et al. [41]. The effect sizes vary between -.49 in a study including only 21 patients [28] and .99 in a study with 37 patients [39]. In addition, sample characteristics and endpoints are different between the studies. Thus, it does not seem possible to make assumptions about the ES for the comparison between CBT and ST based on the literature.

Regarding drop out rates there is also much heterogeneity with a range between 0% and 36% [36,38]. The majority of studies reports drop out rates of less than 20%. As measures of quality control will be applied and monetary incentives for participation in the follow-up examinations will be offered we expect a drop out rate of about 20%.

On this background we aim to identify an effect size of more than .35 as significant given an anticipated drop out rate of 20%. An ES of .35 would be obtained if the PANSS Scores (Positive Syndrome) at the post treatment assessment were 12 for CBT and 14.5 for ST with a standard deviation of 7.14. An ES of less than .35 would be of limited clinical relevance.

This results in n = 130 per group for a power of 80% and a two-sided significance level of 5% (sample size calculated by nQuery 4.0, Panel MGT0). The confirmatory statistical analysis will be based on the intention-to-treat principle. Patients with missing PANSS-scores at T9 (post-treatment) will be included with the last observation carried forward (LOCF). In case of missing PANSS-scores at T9 the treatment effect will presumably be underestimated by using LOCF. To compensate for this underestimation the sample size should be adapted for drop out. Thus, we plan to include 330 patients (165 per study condition).

As assumptions about the real effect size cannot be based upon the literature we calculated different scenarios: in case of a lower effect size and/or drop out of more than 20% the statistical power will be reduced. For example, a reduced ES of 0.2 would result in a power of only 36% for the drop out of 20%. An increased drop out of 30%would reduce the power to 74% for the minimum ES of .35. On the other hand, a more favourable ES of 0.45 would increase the power to 85% for the maximum drop-out rate of 20%.

With a sample size of 330 individuals (165 each therapy group), ten assessments per patient, one primary analysis variable (therapy) and one covariable (center), the power should also be sufficient for a Mixed Model [42].

Table 3 gives an overview over the number of patients required in the different stages of the trial and the required effort for treatment and assessment. In order to successfully include 330 patients 6 study centers have been included which were committed to participate actively in this trial. http://www.trialsjournal.com/content/11/1/123

Based on the data from the earlier non randomized study[9], it was assumed that a difference of at least 20% reduction on the PANSS total score of 65 (sd 10) i.e. 13 points to 52, would be highly clinically significant. The estimated pre-post correlation was 0.4 (reducing the effective sd to 9 after adjusting for baseline). The intra-class correlation within the three sites in the control arm was set at 0.05; in the intervention arm (which would include within-CHW as well as within-site effects) it was set at a higher level of 0.1; alpha was set as 0.05. Using the method of Roberts and Roberts[28] with the Stata routine cluspower will give us 98% power to detect this difference. This is equivalent to a large standardised effect size, 1.44. For effect sizes of 1 (9 PANSS units) and 0.8 (7.2 PANSS units) the power would be approximately 90% and 80% respectively.

The required sample size of 241 was increased to 282 to allow for 15% attrition and rounded up to allow for a 2:1 randomisation within each site. The total number of participants was then divided slightly unequally between sites for reasons of feasibility. A total of 188 persons with schizophrenia will be allocated to the CCBC arm while 94 persons will be allocated to the FBC arm. http://www.trialsjournal.com/content/12/1/12

Sample size for Step I. Assuming an intra-cluster correlation coefficient to be 0.05 [37,38], with alpha error at 0.05 and statistical power at 0.80, to detect a mean difference of 1 point on PHQ9 (SD = 5), i.e. to detect an effect size of 0.2, we need 66 patients at each of 30 sites. The total sample size is therefore 1980.

Sample size for Step II. The clinical question for Step II is the main hypothesis of this trial. Previous studies using PHQ9 in the acute phase treatment of major depression have shown that, on average, the PHQ9 scores will drop from 15 (SD = 5) at baseline to 10 (SD = 6) at end of treatment, with a mean change of 5 (SD = 5) [49-51]. We expect a difference of 20% (1 point) in the PHQ9 change scores among the intervention arms and consider this to be a clinically meaningful difference in effect. With alpha error set at 0.05 and statistical power at 0.80, in order to detect a between-group difference of 1 point (SD = 5) in the reduction of PHQ9 scores from baseline, we need 522 per group and 1566 in toto at Step II. Assuming a dropout rate of 20% and a remission rate of 10% at week 3, we need 2175 participants for Step I.

One point difference in the mean change score on PHQ9 corresponds with an effect size of 0.2. This is a small effect according to Cohen's rough rule of thumb for effect size interpretation [52]. However, because the present trial represents comparison among active treatments and because the true effect size of antidepressants over placebo appears to be around 0.3 [53] and the average effect size of all the health interventions examined in the Cochrane Library appears to be between 0.3 and 0.4 [54], we consider this to be a clinically meaningful difference in effectiveness worth detecting in a large clinical trial. As a matter of fact, an effect size of 0.2 will be translated into an NNT of 10 if the control event rate is around 50% (e.g. response as defined usually by 50% or greater reduction in depression severity from baseline) and 20 if the control event rate is around 20% (e.g. remission of depression) [55]. They therefore represent clinically meaningful difference in effect.

The sample size will be revisited after completion of the pilot study.

Sample size for Step III. Step III represents continuation treatment for Steps I and II, and will therefore be examined as exploratory studies. We therefore will not calculate sample size necessary to detect a significant difference. However, we will calculate the obtained statistical power post-hoc.

Sample size for pilot study. The pilot study is a feasibility study and needs no sample size calculation. The target sample size is 200. We will perform no statistical analyses looking at the comparison arms at the end of the pilot study, whose participants will therefore be included in the main study unless there is a major change in the study protocol. http://www.trialsjournal.com/content/12/1/116

For this study we assumed that the verum treatment is better than placebo by 2.7 ± 6.0 (mean ± standard deviation) HAM-D score points after 6 weeks (corresponding to a SMD = 0.45), that type II case history is better than type I by 2.7 ± 6.0 score points (SMD = 0.45), and that both effects do not interact. If so, a Bonferoni-adjusted F-Test (multiple significance level α = 0.05, two-sided) has a power of 83.5% to detect the difference between verum and placebo and a power of 85.0% to detect the difference in case history taking, if 68 patients are included in groups 1 and 3, and 34 patients are included in groups 2 and 4. This leads to a total number of 228 patients, if one allows for a 10% drop-out rate per group. http://www.trialsjournal.com/content/12/1/43

The participants will be randomized to two groups: intervention or treatment-as-usual. A total of thirty participants will be recruited to the pilot. This will ensure that, even after loss to follow-up, we will have at least 10 subjects per group for analysis (FDA guidance http://www.fda.gov/cder/guidance). http://www.trialsjournal.com/content/12/1/159

The proposed design of the trial will run in 6 CAMHS in each of the three centres, giving 18 clinics with a minimum of one therapist for each treatment modality in each clinic. Because IMPACT will involve more than two treatments, a number of comparisons can be made between treatment groups. We propose to make two: a) the two individual specialist treatments CBT and STPP will be compared; and b) the individual specialist treatments combined will be compared with SCC. A 2.5% two-tailed significance level has therefore been used in the power calculation. Additionally, power has been calculated according to each of the following hypothesis of comparisons: superiority, equivalence or non-inferiority. It has been assumed that 5 points on the self-report MFQ is the minimum clinically important difference. This is approximately 25% of the change in the MFQ scale from baseline to 28 weeks according to the results from the ADAPT trial [13] and is equivalent to a 1 point improvement on 5 of the 33 items of the scale [13]. The ADAPT trial [15] gave an SD for MFQ of 14.6 across follow-up assessments, so that 5 points corresponds to a standardised effect size of about a third (small to medium) and a non-overlap between treatments of approximately 25% [64]. In the ADAPT trial the correlation between baseline and follow-up at 28 weeks was approximately 0.5 and the intra-cluster correlation for therapist was less than 0.01 [15]. With 18 therapist in each treatment modality and 10 followed-up patients per clinic a superiority analysis comparing CBT and STPP will have a power of over 80% provided that the intra-cluster correlation for therapists does not exceed 0.025. By virtue of the increased sample size comparison of the specialist individual treatments (CBT and STPP) with SCC will have greater power. These power calculations assume a cross-sectional analysis, but statistical analysis will be based on a longitudinal data using a linear mixed model. The use of such model will increase the power of the statistical analysis as data is in effect shared across follow-up time points. ADAPT had 92% follow-up at 28 weeks suggesting a recruited sample size of approximately 540 patients. http://www.trialsjournal.com/content/12/1/175

Sample size calculations are performed to determine the number of participants needed to detect effect sizes similar to those that have been reported in recent ADHD medication trials [56]. This study is powered to detect a 7.04 difference between the groups. A sample size of 32 per group is required to achieve 80% power with a 2-tailed significance level of .05, assuming an equivalent standard deviation of 9.9 in both groups. Estimating a 20% dropout during the study, a minimum of 80 total participants are needed to reach the target 32 participants per group [, μc - μt: Mean difference = 7.04, σ2 = Assumed Common Variance of both treatment and control group (Standard Deviation = 9.9), n = 32]. http://www.trialsjournal.com/content/12/1/173

The aim of TREC-SAVE is to pilot methods, to generate data relevant for sample size calculations and, if possible, clinically useful data regarding whether either procedure is better in terms of clinically measures restriction and perceived treatment success. TREC-SAVE is, in part, a proof of concept study. Two main factors determine the number of people who should be recruited to in order for the trial to provide clear answers. They are the frequency of the investigated event and the size of the effect of treatment. It is important to avoid results that are erroneous. The probability of producing so called 'false-positive' results (type I error-α) and 'false-negative' findings (type II error-β) is minimised by having adequate sample size. However, we have no data at all from trials even to estimate adequate sample size [64] and so we recruited ten people as a very preliminary test of methods and, thereafter, aim to randomise 100 in total. Concerning the primary outcome, 100 patients were needed to give 80% power at a two-sided 5% significant level to detect a reduction in an assumed value of 50% to 30% or of 25% to 10%. http://www.trialsjournal.com/content/12/1/180

We are planning a trial of the continuous response variable, GAF-F, from independent control and experimental participants with one control per experimental participant. A previous study involving psycho-education in a Danish community mental health centre showed that response within each participant group was normally distributed with a standard deviation of 15 [38]. In the few previous studies using IMR or elements of IMR the effectiveness has been assessed by using the total GAF score showing a difference of 6-10 points [21,39]. Based on this knowledge we will conservatively estimate the true difference in the experimental and control group means to be 6 points on the GAF-F score. Using this estimation will require a total of 200 participants to reject the null hypothesis that the population means of the experimental and control groups are equal with probability (power) 0.8. The type I error probability associated with this test of the null hypothesis is 0.05. The power and sample size calculations have been made using the PS Power and Sample Size Calculations program version 3.0.34 [40].

The power of some the secondary measures has also been estimated with a total number of 200 participants. This showed that a sample size of 200 participants is sufficient to show a relevant effect size in PSP, PANSS and the IMR Scale corresponding to similar studies, see table 2. The remaining secondary measures have not been tested due to the fact that similar trials with these outcomes could not be obtained. Thus, the results of these analyses should be interpreted with caution, and the analyses should be considered as exploratory. http://www.trialsjournal.com/content/12/1/195

The primary outcome in this trial is the change of the Visual Analogue Score (VAS) of average pain in the menstrual period between the baseline (Visit 2) and after treatment (Visit 4). The hypothesis is H0: δ=1-2=0 H1: δ=1-2􌋤0 1: The change of VAS score between Visit 4 and baseline in GJBNH group 2: The change of VAS score between Visit 4 and baseline in placebo control As a reference, we have chosen the study of Choi et. al. [6] and Yeh LL.[9]. The mean (±standard deviation, SD) change of the VAS in GJBNH group (n=13) was 2.42 (±2.04) and those in placebo control group (n=38) was 1.08 (±2.14). We used the mean difference between groups, 1.34 (=2.42-1.08), as a clinically significant improvement worth to detect and 2.12 as its pooled standard deviation. With 5% of two-sided significance level and 80% of statistical power, we needed to randomly assign fifty participants in each group considering approximately 20% of drop-out rate. http://www.trialsjournal.com/content/pdf/1745-6215-13-3.pdf

The planned number of patients to be included in this trial is 80, 40 in each arm. This is sufficient to show a statistically significant difference between the intervention arms of effect size (Cohen's d) ≥ 0.66 for the change in total PANSS score from baseline to last follow-up. It is based on an alpha of 0.05, a power (1-beta) of 0.8 and a two-sided unpaired t-test. Further, we accounted for 10% withdrawal. It should be noted that the power of the repeated measures analyses is considerably higher. http://www.trialsjournal.com/content/7/1/31

The trial aims to recruit 800 patients over a period of 2 years. Even allowing for a 25% loss to follow-up, this sample size will provide over 90% power to detect small (0.25 standard deviations) overall treatment effects on the primary outcome measures with 90% power at p < 0.01. It will also be sufficient to detect small differences (0.25 standard deviations) in the primary outcome measures at any one assessment point with 80% power at p < 0.05, and small to moderate differences (0.3 standard deviations) with 90% power at p < 0.05. These differences are equivalent to 1–2 point improvements in the SMMSE and BADLS, which are considered the minimal clinically relevant differences. The sample size calculations assume a correlation between serial measurements of the SMMSE/BADLS of around 0.6 (multiplying factor of 0.34 based on 1 baseline and 4 post-randomization measurements, as described by Machin et al [41]) anticipating an analysis of covariance with repeated measures. http://www.trialsjournal.com/content/10/1/57

At present there is no reliable data for calculating the sample size of the proposed trial. Currently available data suggests that older people living in the community lose 1.6 points per year on the CAMCOG [30]. Factoring in a possible 20% loss to follow up, with 64 people in each group we will have 80% power to detect a between-groups difference of 1.5 points on the CAMCOG. This assumes a decline that is twice as large in the educational compared with the cognitive intervention group, and although statistically this may be associated with moderate effect size (0.5), it is the minimum difference that one would consider clinically significant. http://www.trialsjournal.com/content/10/1/114

We based our sample size calculations to test hypothesis 1 on the results of the trial published by Coppen and Bailey, [22] which is the largest study of adjunctive antidepressant treatment with folic acid published to date. This study also enrolled a community-representative sample although recruited adults 18 years and over. The B-VITAGE trial will therefore have the added potential advantage of having participants with higher average tHcy given their older age (60 plus). We have previously shown that tHcy increases with age and that response to homocysteine-lowering vitamins is higher in people with higher tHcy [17]. Fluoxetine and citalopram are both selective serotonin reuptake inhibitors (SSRI's) and have been shown to have similar efficacy in the treatment of major depression [42]. The addition of vitamins B12 and B6 in our study may provide additional benefit (and therefore potentially bigger effect sizes) than folate supplementation alone given their respective roles in homocysteine metabolism.

Coppen and Bailey reported that 64.7% of their subjects treated with fluoxetine 20 mg and folic acid 0.5 mg were free of clinically significant depressive symptoms at the end of 10 weeks (Hamilton Depression Rating Scale score = 9), compared with 48.3% of those receiving fluoxetine 20 mg and placebo. A difference in remission rates of this magnitude is considered to be of meaningful clinical significance [43]. A study with 310 subjects (155 per group) would have 80% power to demonstrate a difference of this magnitude, with 2-tailed alpha set at 5%. We conservatively estimate a 12-week loss to follow-up of 15%, with an additional 10% lost between weeks 12 and 52 (total loss of 25%). Therefore, we will aim to recruit 388 older adults with a depressive episode into the trial (194 subjects per group).

To estimate the power of our study to test hypothesis 2, we used previously published data showing that about 35% of older adults with depression either remain depressed or relapse during the first year of treatment [44]. A study with 155 subjects per group by the end of 12 months will have 81% power to declare as significant a between group difference of 15% (35% vs 20% for placebo and vitamins, respectively). Finally, a study of this size will have 80% power to declare as significant a mean difference of MADRS scores between the groups of 2.3 points (standard deviation = 7, alpha = 0.05) [45]. Changes in the scores of the MADRS between baseline and week 12 will constitute a secondary outcome of interest for the study. http://www.trialsjournal.com/content/11/1/8

Under the assumption of four successful follow-up assessments and a within-subject correlation of r = 0.40, 72 participants in total would yield, at a two-sided alpha < 0.05, a power of 0.81 to detect a medium effect size of d = 0.35 for main effects [62], i.e. a difference of 0.35 of the pooled standard deviation. http://www.trialsjournal.com/content/11/1/19

The primary outcome measures are cognition (ADAS-Cog) and quality of life (QoL-AD) at 24 weeks follow up. In the pilot CST trial [6] which recruited people with mild/moderate dementia (MMSE 10-24), community and institutional participants had a similar level of cognitive impairment (mean MMSE 14.5 and 14.1). The RO review [5] found a moderate effect size of 0.58 between the RO and control groups though the studies had some differences in methodology, outcome measures, and length of treatment/follow up. The MCST pilot study found a large effect size of 1.0 compared with CST alone. To detect an effect size for MCST of 0.39 on the ADASCog with power of 80% using a 5% significance level and an estimated attrition of 15% needed, a sample size of 230 at T1 is required. If an estimated 60 participants will have Alzheimer's disease and are suitable/willing to take cholinesterase inhibitors (ACHEIs), this provides sufficient numbers for the maintenance CST/ACHEIs trial platform to estimate effect size and the feasibility of the trial. http://www.trialsjournal.com/content/11/1/46

Based on anticipated between-group absolute risk difference of 30% in number of people with BPSD symptoms measured by the NPI, a sample size of 146 participants will give 90% power to detect this difference with 95% confidence. This calculation includes a 20% attrition estimate. The sample size calculations have been described in Table 2. http://www.trialsjournal.com/content/11/1/53

Training for care providers has been associated with a 0.4 point mean fall in QOL scores (SD 7.6) over 6 months, while QOL fell by a mean of 3.5 points (SD 7.9) in facilities that did not train care providers [10]. We have powered the present study to detect an effect of at least this magnitude. We would need 100 subjects in each row of the two by two factorial table (total number 200) to have 0.8 power to detect a main effect of this magnitude (alpha = 0.05; two sided). Loss of subjects to follow up, based on our previous experience, was anticipated to be in the order of 30% per annum in this population [23]. We used an estimated intra-class correlation of 0.05 (to account for clustering) and estimated cluster size of 9, resulting in an inflation factor of 1.4. We thus expected to have to enrol 364 subjects in forty clusters to achieve adequate numbers of completing subjects. This estimate of the intraclass correlation was conservative, given that intraclass correlation coefficients are typically smaller than 0.02 [24]. The power calculation was reviewed prior to closing recruitment, because actual cluster size (which influences the required number of participants) can be difficult to predict. Having recruited 67 clusters with an average cluster size of 5.2, it was confirmed that, because cluster size was smaller than anticipated, recruitment could be closed. Analysis will be by "intention to treat" (that is, according to randomisation, rather than participation in education). Statistical analysis will be conducted using multilevel mixed-effects linear regression in Stata version 11.0 (StataCorp, College Station, Texas). The effect of clustering by both facility and GP will be accounted for by treating the facility and GP as random effects with GP nested within facility. For each outcome analysis, a model containing the GP intervention, the facility intervention, and the baseline values of outcome variables will be used to estimate the marginal effect of each intervention. Next, the confounding effects of other covariates will be examined by comparing the adjusted and unadjusted intervention effects. Any covariates that produce clinically important changes in the intervention effect estimates, and are therefore demonstrably confounding these effects, will be retained in the model. Secondary analyses will be conducted to test the significance of any interaction between the facility and GP interventions. The results reported will include the estimated mean of the outcome variable in each arm of both interventions and for each intervention, the mean difference in outcome between the intervention and control arms. The results of the secondary analysis of the interaction between interventions will also be presented. All results will be presented with their associated 95% confidence intervals.http://www.trialsjournal.com/content/11/1/63

In a pilot study conducted by this research team, the ADAS-Cog score of participants in the control group (usual care, n = 10) increased 4.47 ± 6.36 points in six months (indicating cognitive decline) whereas the patients in the intervention group (physical activity, n = 12) increased 2.21 ± 4.88 points. Based on this pilot data, recruiting 115 participants in each of the two groups (total = 230) at baseline will result in a power of 80% to detect a difference of 2.2 points between the groups (alpha = 0.05, 2-tailed). This sample size also allows for an estimated 15% drop out rate at follow-up. http://www.trialsjournal.com/content/11/1/120

The primary goal of the proposed study is to identify a number of characteristics which are differentially associated with outcomes across various treatments. This extends the traditional randomized clinical trials which directly compare treatments or a study designed to specifically test the moderating effect of one or more baseline characteristics. The sample size has been selected to provide statistical power of at least 89% power to detect small effects for predictors (odds ratio 1.3 per standard deviation change in the independent baseline measure) at an alpha level of p < .05; 94% power to detect medium effects for predictors (odds ratio of 1.5) at an alpha level of p < .001, 94% power to detect medium effects for moderator interaction terms (odds ratio of 1.5) at an alpha level of p < .01. In addition to replication in the second 1000 participants, we aim to control type I error by applying effect size criteria for each logistic model that odds ratios for the main parameter of interest must exceed 1.3. http://www.trialsjournal.com/content/12/1/4

The original brief from the NHS HTA programme specified the outcome as rate of remission in those depressed at baseline. We estimated that we needed to recruit 77 RNHs and at least 418 residents with depression at baseline to detect an increase in remission rates from 25% to 40%. Our primary analysis will be a comparison of the difference in proportion of depressed residents at the end of the study (40% v 25%). Recruiting participants from 77 RNHs gives more than 97% power to detect this at the 5% significance level, even if we need to exclude residents recruited post-randomisation from this comparison. Our sample size estimate includes an inflation factor to account for clustering effects. Few previous studies are available to allow us to estimate the range of likely values for the intra-cluster correlation (ICC) needed to estimate this. We therefore used a conservative value of 0.05 for this, which towards the upper end of the range seen in previous primary care studies. This should provide have sufficient power to allow for likely variation in cluster sizes and for likely clustering effects due to different physiotherapists carrying out the exercise programme in different RNHs.

Our calculations also allow for anticipated loss to follow-up. Because few RNH residents move out of residential accommodation we anticipate good follow-up rates. This population has a high mortality, up to 34% per year; additionally, for some, their health will deteriorate so that they are no longer able to complete some, if not all, of the follow-up assessments. However, those residents with the poorest health are less likely to join the study, so that we can anticipate a smaller attrition rate. The nearest equivalent study collected data on 169/220 depressed residents after 9.5 months (77%) [22], equivalent to 71% at one year. We therefore anticipate a loss to follow-up rate of 30%, made up of those who have died and those no longer willing or able to complete the assessments. http://www.trialsjournal.com/content/12/1/27

We used data from the prior cohort study to estimate sample size and statistical power. The Days at Home variable has a negatively skewed distribution (median 62 days at home, inter-quartile range 1 to 82 days), with 24% of participants having zero days at home. We assumed that the Mann-Whitney U test would be used to test for differences between groups. We used a bootstrapping simulation method to investigate the power of the study to detect possible effects of the MMHU, on both the probability of zero days at home and numbers of days spent there, under a range of plausible assumptions. A study with 300 participants in each group has 80% power to detect a 5 day difference in days at home, if the proportion of patients with zero days at home is reduced to 20% (a 15% reduction). This represents a reasonable minimum clinically important difference.http://www.trialsjournal.com/content/12/1/123

We based sample size calculations on the BECCA [12] and REMCARE [18] trials. These predicted effect sizes, defined as average effect per participant divided by population standard deviation, of 0.42 for CSP and 0.35 for RYCT. In a 2 × 2 factorial design using a 2:1 allocation ratio in favour of groups receiving RYCT, a completed sample of 240 dyads would yield power of more than 90% to detect both main effects using a significance level of 5%. This design would also yield power of more than 80% to detect interaction between CSP and RYCT equivalent to an effect size of 0.4, using an analogous definition. As both the REMCARE trial platform and BECCA retained some 80% of participants, we aim to recruit 300 dyads in 13 rounds of 24 dyads to yield a final sample of 240 dyads. http://www.trialsjournal.com/content/12/1/205

The planned number of participants is 150, as determined by previous RCT results, the clustering effects of each site, and the possibility of exhaustion of our resources. An RCT of EIS for patients with FEP in Holland indicated that EIS improved participant GAF-F scores in 24 months[5]. Another RCT of CBT in England revealed that CBT improved patient readmission rates and GAF scores in 18 months[6,7]. The effect sizes calculated from their results were 0.26, 0.46, and 0.58, respectively. The estimated sample sizes using G*Power 3[34] (alpha error = 0.05, beta error = 0.2) were 370, 150, and 96, respectively.

We will adopt 1 interim analysis and consider stopping the trial if the participants in the CAP group have unexpected effective outcomes compared with those in the SC group. We will conduct the interim analysis when half of the participants finish the trials until the first end point. Stopping rules will be planned on the basis of the O'Brien-Fleming method [35] for the GAF-F score, re-admission rate, lost to follow-up rate, self-harm and suicide attempt rate, and suicide rate at the first end point. We will consider stopping the trial on the basis of the stopping rule, baseline data, missing data, and site. Because of ethical issues, we will provide EIS to all participants until 18 months after allocation if the trial is halted. http://www.trialsjournal.com/content/12/1/156

Creative Commons License This page was written by Steve Simon and is licensed under the Creative Commons Attribution 3.0 United States License. Need more information? I have a page with general help resources. You can also browse for pages similar to this one at Sample Size Justification.