*Dear Professor Mean

- We have a large dataset with about 400 million records. We need to randomly select a subsample from it. However we need help in determining the sample size. What sample size do we need for the confidence interval calculations?* – Frantic Frank

Dear Frantic,

400 million records? I bet your fingers are tired from all that typing.

There are several approaches for determining the sample size. The
simplest is to **estimate what sample size will provide confidence
intervals that are narrow enough for your needs**. You might say

- “I want the interval to be as narrow as possible,” but that’s not really true. There is a certain level of precision
- which if you exceed it
- becomes clinically irrelevant. You don’t need to know the average cholesterol levels
- for example
- with a precision of two or
three decimal places.
**The smallest difference that is still important to your clinicians will determine your sample size**.

By the way

- don’t expect me (or any other statistician) to tell you how much of a difference is considered clinically relevant. I have enough trouble understanding the difference between good and bad cholesterol. The narrowness of the intervals should be determined by medical expertise.

**What else do I need to specify?**

Beyond specifying how narrow you really need the intervals to be

- you need to have at least a rough idea about how variable your outcome measure is. You could randomly select a few hundred records from your data base and estimate a standard deviation. You only need a rough estimate of the standard deviation
- so anything more than a few hundred records is overkill.

If you can’t pull out a few hundred records in advance

- you would try to find information
- perhaps in publications of similar research studies
- about the variability of your outcome measure. You’re not going to find published research that is identical to what you are doing
- but anything close should be fine.

When you find that publication

- look for a standard deviation. If no standard deviation is given
- sometimes you can estimate it using other measures of variability such as the standard error
- the range, confidence limits
- or even information about the percentiles of the data.

**Example**

If you let D represent the minimum detectable difference and S represent the standard deviation

- and Z represent the 1-alpha/2 percentile of a standard normal distribution
- then the appropriate sample size would be

![]{http://www.pmean.com/images/confid37.gif){width="95” height="55”}

Suppose you wanted a confidence interval for average cholesterol level to have a precision of plus or minus 2 units. And let’s suppose that the standard deviation for cholesterol in a population similar to yours is 50 units. If we wanted a 99% confidence interval (let’s be extravagant

- since we have 400 million data points to choose from!), then Z would be 2.576. Applying the formula
- we get

![]{http://www.pmean.com/images/confid38.gif){width="211” height="55”}

which we round up to 4,148.

**What if I am estimating a proportion?**

If you’re estimating a proportion rather than a mean

- the process is similar except that instead of a standard deviation
- you need a rough estimate of what you think the proportion might be. It doesn’t need to be all that accurate an estimate. A ballpark figure is fine.

If P is your guess at what the proportion should be

- then the sample size needed would be

![]{http://www.pmean.com/images/confid36a.gif){width="155” height="56”}

Suppose we wanted to estimate the proportion of adverse drug events to plus or minus 1.5% and we know that the proportion will be around 12%. Again

- let’s use a 99% confidence level. Then the sample size would be

![]{http://www.pmean.com/images/confid35.gif){width="268” height="55”}

which we round up to 3,115.

At this point

- you might protest and say
- but I don’t know the proportion! That’s probably true; if you already knew the proportion, you wouldn’t need to do the research. But I suspect that you have a rough idea of what the proportion might be
- either from your intuition or from previously published research in the area.

If you really have no idea what the proportion might be

- then use p=0.5. That gives you a worst case scenario
- meaning the largest sample size. If your proportion is much bigger or much smaller than 0.5
- then your interval will be narrower than you might expect
- but hardly anyone ever complains if their interval is narrower than planned.

**Summary**

Frantic Frank needs to randomly select some records from a database that has 400 million of them. He wants to know how many records he should select. Professor Mean suggests that confidence intervals would be a good way to summarize information from this type of random sample. He suggests that you select enough records so your confidence intervals are reasonably narrow.

**Further reading**

**The case for confidence intervals in controlled clinical trials.**M. Borenstein. Controlled Clinical Trials 1994: 15(5); 411-28. Medline**The use of predicted confidence intervals when planning experiments and the misuse of power when interpreting results.**Steven Goodman. Annals of Internal Medicine 1994: 121(3); 200-206. Medline Abstract Full text**Confidence limits and sample size in quarantine research.**HM Couey. Forum: Journal of Economic Entomology 1986: 79(4); 887-90.