library(tidyverse)I am working with someone who needs to compute age-adjusted rates. It is not too difficult, except if you’ve never done it before. Neither she nor I have, so I am taking the time to outline how this might be done in a simple case before we tackle her data.
Calculation formulas
The formula for age-adjustments is fairly simple. The CDC has a nice webpage that outlines this.
Age-adjusted rates are calculated by the direct method, as in:
\[\sum_{i=1}^{n}r_i (p_i/P)\]
where \(r_i\) = Rate in age group i in the population of interest, \(p_i\) = Standard population in age group i, and
\[P = \sum_{i=1}^{n}p_i\]
What is a standard population
So what, exactly is a “standard population”. It can be anything actually, but it usually is a distribution of ages that represents the distributions across an entire state or country or possibly across the whole world. A commonly used standard population is U.S. 2000 Standardized Population.
You can find links to several versions of this at the CDC webpage listed. The age breakdowns vary. One goes from
- Younger than 1
- 1–4
- 5–14
- 15–24
- 25–34
- 35–44
- 45–54
- 55–64
- 65–74
- 75–84
- 85 and older
for example. If your data has an age grouping different, then you can build your own table at the SEER (Surveillance, Epidemiology, and End Results) website, run by the National Cancer Institute. This page has a breakdown of the U.S. 2000 Standardized Population by single years.
Get data on COVID cases
To illustrate the calculations, you need a dataset. You can find a nice and fairly easy to use [dataset on COVID cases and deaths over time][ref-healthdata-nodate]. This is mostly used to look at trends for COVID, but we will use it to pull out a single date and region and calculate crude and adjusted rates by race.
The data is available as a comma-separated text file and you can feed the URL directly into the read_csv function.
f1 <- "https://data.cdc.gov/api/views/hrdz-jaxc/rows.csv"
covid_cases0 <- read_csv(
file=f1,
col_names=TRUE,
col_types="cccccnnn")
glimpse(covid_cases0)Rows: 407,126
Columns: 9
$ end_of_week <chr> "2022-11-19", "2022-02-19", "2022…
$ jurisdiction <chr> "Region 3", "Region 4", "Region 9…
$ age_group <chr> "40 - 49 Years", "65 - 74 Years",…
$ sex <chr> "Male", "Female", "Male", "Male",…
$ race_ethnicity_combined <chr> "White, NH", "Black, NH", "Hispan…
$ case_count_suppressed <dbl> 400, 625, 937, 1888, 10305, 3572,…
$ death_count_suppressed <dbl> NA, 19, NA, NA, 8, 40, NA, NA, NA…
$ case_crude_rate_suppressed_per_100k <dbl> 33.22, 99.55, 76.21, 85.67, 2305.…
$ death_crude_rate_suppressed_per_100k <dbl> NA, 3.03, NA, NA, 1.79, 1.66, NA,…
One quick note about the variable names. The need for privacy requires that you not see any very small groups in the dataset because that increases the risk that an individual person might be identifiable. In this dataset, anytime the count for cases or death goes below 5, the data are suppressed. That’s why the last variables have the word “suppressed” in their names. For the particular subset that you will use later, there are no suppressed values.
COVID data frequencies
Let’s get a few basic frequencies for the various categorical values.
covid_cases0 |>
count(jurisdiction)# A tibble: 11 × 2
jurisdiction n
<chr> <int>
1 Region 1 35057
2 Region 10 37079
3 Region 2 36435
4 Region 3 36025
5 Region 4 37771
6 Region 5 37867
7 Region 6 38114
8 Region 7 35078
9 Region 8 37096
10 Region 9 38229
11 US 38375
covid_cases0 |>
count(age_group)# A tibble: 11 × 2
age_group n
<chr> <int>
1 0 - 4 Years 36301
2 12 - 15 Years 35714
3 16 - 17 Years 35033
4 18 - 29 Years 37839
5 30 - 39 Years 37767
6 40 - 49 Years 37693
7 5 - 11 Years 36206
8 50 - 64 Years 37985
9 65 - 74 Years 37432
10 75+ Years 36853
11 Overall 38303
covid_cases0 |>
count(sex)# A tibble: 3 × 2
sex n
<chr> <int>
1 Female 135259
2 Male 134628
3 Overall 137239
covid_cases0 |>
count(race_ethnicity_combined)# A tibble: 6 × 2
race_ethnicity_combined n
<chr> <int>
1 AI/AN, NH 60047
2 Asian/PI, NH 67972
3 Black, NH 68824
4 Hispanic 69863
5 Overall 70281
6 White, NH 70139
Notice how the age groups did not sort properly. If you place a blank in front of the “0 - 4 Years” and “5 - 11 Years”, then they would sort nicely (as shown below).
Filtering down to a manageable subset
To get a snapshot of a single time point, you can use the filter function to limit the data to a single day. You can also limit the data to a single region, Region 4. This is an arbirtrary choice that represent the West North Central United States (ND, SD, NE, KS, MN, IA, MO).
You need to calculate the number of patients in each group. This is easy enough to calculate by inverting the formula
case_rate = case_count / size_per_100k
Do note that since the case rate is per 100K, the size is also computed per 100K. You could adjust this, if you wanted to, but keeping both at per 100K is actually simpler and better.
You do need to remove the “Overall” categories for age_group and race_ethnicity_combined. Keep only the “Overall” category for sex, though. In this example, you do not want any rates (crude or adjusted) separated into males and females.
You have to exclude the “AI/AN, NH” group because there is no “AI/AN, NH” group in the U.S. 2000 Standardized Population.
You should modify the age group labels with a leading blank for two of the categories. This insures that the sorted data will be in a logical order.
You should also NOT keep the variable case_crude_rate_suppressed_per_100k because you will be computing several different crude rates later.
covid_cases0 |>
filter(end_of_week == "2020-10-31") |>
filter(jurisdiction == "Region 4") |>
filter(sex == "Overall") |>
filter(age_group != "Overall") |>
filter(race_ethnicity_combined != "Overall") |>
filter(race_ethnicity_combined != "AI/AN, NH") |>
mutate(size_per_100k=case_count_suppressed/case_crude_rate_suppressed_per_100k) |>
mutate(age_group=str_replace(age_group, "0 - 4 Years", " 0 - 4 Years")) |>
mutate(age_group=str_replace(age_group, "5 - 11 Years", " 5 - 11 Years")) |>
select(
age_group,
race_ethnicity_combined,
case_count_suppressed,
size_per_100k) |>
arrange(age_group, race_ethnicity_combined) -> covid_cases1
covid_cases1# A tibble: 40 × 4
age_group race_ethnicity_combined case_count_suppressed size_per_100k
<chr> <chr> <dbl> <dbl>
1 " 0 - 4 Years" Asian/PI, NH 17 0.957
2 " 0 - 4 Years" Black, NH 273 9.26
3 " 0 - 4 Years" Hispanic 295 7.19
4 " 0 - 4 Years" White, NH 652 19.4
5 " 5 - 11 Years" Asian/PI, NH 35 1.50
6 " 5 - 11 Years" Black, NH 477 13.7
7 " 5 - 11 Years" Hispanic 570 10.1
8 " 5 - 11 Years" White, NH 1212 28.2
9 "12 - 15 Years" Asian/PI, NH 22 0.908
10 "12 - 15 Years" Black, NH 459 8.11
# ℹ 30 more rows
Calculate rates for each age group
Sum across all the race/ethnicity categories to get the numerator and denominator for the covid infection rates in each age group.
covid_cases1 |>
group_by(age_group) |>
summarize(
case_count_suppressed=sum(case_count_suppressed),
size_per_100k=sum(size_per_100k)) |>
mutate(crude_age_rate=case_count_suppressed / size_per_100k) -> age_rates
age_rates# A tibble: 10 × 4
age_group case_count_suppressed size_per_100k crude_age_rate
<chr> <dbl> <dbl> <dbl>
1 " 0 - 4 Years" 1237 36.8 33.6
2 " 5 - 11 Years" 2294 53.6 42.8
3 "12 - 15 Years" 2523 32.1 78.7
4 "16 - 17 Years" 1689 15.8 107.
5 "18 - 29 Years" 14561 103. 141.
6 "30 - 39 Years" 10402 84.2 123.
7 "40 - 49 Years" 10038 81.7 123.
8 "50 - 64 Years" 14473 129. 112.
9 "65 - 74 Years" 6235 68.5 91.0
10 "75+ Years" 4930 49.6 99.4
The rates appear to be highest for young to middle-aged adults.
Calculate rates for each race/ethnicity group
Sum across all age groups to get the numerator and denominator for race/ethnicity.
covid_cases1 |>
group_by(race_ethnicity_combined) |>
summarize(
case_count_suppressed=sum(case_count_suppressed),
size_per_100k=sum(size_per_100k)) |>
mutate(crude_race_rate=case_count_suppressed / size_per_100k) -> race_rates
race_rates# A tibble: 4 × 4
race_ethnicity_combined case_count_suppressed size_per_100k crude_race_rate
<chr> <dbl> <dbl> <dbl>
1 Asian/PI, NH 924 18.2 50.7
2 Black, NH 12211 142. 85.9
3 Hispanic 11925 89.4 133.
4 White, NH 43322 405. 107.
Hispanics have the highest COVID infection rate.
Read the U.S. Standardized Population
The age categories you need are not reflected in the CDC tables, so let’s use the population counts in the yearly dataset at SEER. Again, if you provide the URL, you can read the data directly.
f1 <- "https://seer.cancer.gov/stdpopulations/singleages.pops.tables.csv"
seer_population0 <- read_csv(
file=f1,
skip=2,
col_names=FALSE,
col_types="cnnnnnnnn")
glimpse(seer_population0)Rows: 95
Columns: 9
$ X1 <chr> "00 years", "01 years", "02 years", "03 years", "04 years", "05 yea…
$ X2 <dbl> 3794901, 3758562, 3773025, 3791001, 3869031, 3896081, 3917855, 3978…
$ X3 <dbl> 3794901, 3758562, 3773025, 3791001, 3869031, 3896081, 3917855, 3978…
$ X4 <dbl> 376321, 379990, 383179, 383741, 375833, 366757, 361038, 363440, 358…
$ X5 <dbl> 376321, 379990, 383179, 383741, 375833, 366757, 361038, 363440, 358…
$ X6 <dbl> 362725, 371898, 380119, 388730, 399411, 408927, 412709, 413664, 418…
$ X7 <dbl> 362725, 371898, 380119, 388730, 399411, 408927, 412709, 413664, 418…
$ X8 <dbl> 17917, 17802, 17701, 17613, 17536, 17470, 17414, 17366, 17327, 1729…
$ X9 <dbl> 17917, 17802, 17701, 17613, 17536, 17470, 17414, 17366, 17327, 1729…
You need to modify some of the variables so they will end up meshing well with the COVID cases data. Also, you need to remove any population count associated with other standardized population.
seer_population0 |>
rename(age_years=X1) |>
rename(population_count=X2) |>
filter(!is.na(age_years)) |>
filter(age_years != "Total") |>
mutate(age_years=str_sub(age_years, 1, 2)) |>
mutate(age_years=as.numeric(age_years)) |>
select(age_years, population_count) -> seer_population1Now calculate the proper age groups for the standardized population.
seer_population1 |>
mutate(age_group=case_when(
age_years <= 4 ~ " 0 - 4 Years",
age_years <= 11 ~ " 5 - 11 Years",
age_years <= 15 ~ "12 - 15 Years",
age_years <= 17 ~ "16 - 17 Years",
age_years <= 29 ~ "18 - 29 Years",
age_years <= 39 ~ "30 - 39 Years",
age_years <= 49 ~ "40 - 49 Years",
age_years <= 64 ~ "50 - 64 Years",
age_years <= 74 ~ "65 - 74 Years",
age_years > 74 ~ "75+ Years")) |>
group_by(age_group) |>
summarize(population_count=sum(population_count)) -> seer_age_groups
print(seer_age_groups, n=10)# A tibble: 10 × 2
age_group population_count
<chr> <dbl>
1 " 0 - 4 Years" 18986520
2 " 5 - 11 Years" 28178121
3 "12 - 15 Years" 15750921
4 "16 - 17 Years" 7865892
5 "18 - 29 Years" 43980495
6 "30 - 39 Years" 41691326
7 "40 - 49 Years" 42285022
8 "50 - 64 Years" 41185865
9 "65 - 74 Years" 18135514
10 "75+ Years" 16573966
Evaluate imbalances in age group sizes across race/ethnicity groups
You’ve done a lot, and it’s time to check to see how important the age adjustment is. Do the racial groups tend to have the same relative proportions in each age group? If so, there is less of a need for any adjustments.
covid_cases1 |>
filter(race_ethnicity_combined != "Overall") |>
filter(age_group != "Overall") |>
group_by(race_ethnicity_combined) |>
mutate(total=sum(size_per_100k)) |>
mutate(pct=100*size_per_100k/total) |>
ggplot() +
aes(x=race_ethnicity_combined, y=pct, fill=age_group) |>
geom_col(
position="stack",
color="black")There are some small differences here. The White, NH group tends to have a disproportionately larger proportion of people in the older age groups while the other groups, especially the Hispanic group, tend to skew a bit more to the younger age groups.
Calculate rates for age_groups within the Asian subgroup
To get an adjusted rate for the Asian subgroup, you need to compute a separate rate for every age group within the Asian subpopulation.
covid_cases1 |>
filter(race_ethnicity_combined=="Asian/PI, NH") |>
group_by(age_group) |>
summarize(
case_count_suppressed=sum(case_count_suppressed),
size_per_100k=sum(size_per_100k),
.groups="drop") |>
mutate(detailed_rates=case_count_suppressed / size_per_100k) -> asian_rates
print(asian_rates, n=10)# A tibble: 10 × 4
age_group case_count_suppressed size_per_100k detailed_rates
<chr> <dbl> <dbl> <dbl>
1 " 0 - 4 Years" 17 0.957 17.8
2 " 5 - 11 Years" 35 1.50 23.3
3 "12 - 15 Years" 22 0.908 24.2
4 "16 - 17 Years" 29 0.449 64.6
5 "18 - 29 Years" 251 3.16 79.4
6 "30 - 39 Years" 162 3.19 50.7
7 "40 - 49 Years" 152 2.93 51.9
8 "50 - 64 Years" 168 3.08 54.5
9 "65 - 74 Years" 62 1.30 47.6
10 "75+ Years" 26 0.747 34.8
Don’t forget what the overall rate is for the Asian subgroup.
race_rates |>
filter(race_ethnicity_combined=="Asian/PI, NH")# A tibble: 1 × 4
race_ethnicity_combined case_count_suppressed size_per_100k crude_race_rate
<chr> <dbl> <dbl> <dbl>
1 Asian/PI, NH 924 18.2 50.7
Re-combining across age groups
You already knew this, but if you tried to average the individual rates for the ten age_groups, you would get a different number than the overall rate.
asian_rates |>
summarize(simple_mean_rate=mean(detailed_rates))# A tibble: 1 × 1
simple_mean_rate
<dbl>
1 44.9
This discrepancy occurs because age_groups with a small number of patients get the same weight as age_groups with a large number of patients. To get the valid rates, you need to use a weighted mean which gives more weight to the age groups that have more patients.
asian_rates |>
summarize(weighted_mean_rate=weighted.mean(detailed_rates, size_per_100k))# A tibble: 1 × 1
weighted_mean_rate
<dbl>
1 50.7
Merge the data with the standardized population
Adjusting to the standardized population is just using weights associated with the standardized population instead of the weights found in existing data.
seer_age_groups |>
full_join(asian_rates) |>
summarize(adjusted_rate=weighted.mean(detailed_rates, population_count))Joining with `by = join_by(age_group)`
# A tibble: 1 × 1
adjusted_rate
<dbl>
1 48.7
This rate is going to be slightly different than the crude rate because the demographic proportions of the age groups in the Asian subpopulation differs from the demographic proportions of the age groups in the standardized population.
if you wanted to get adjusted rates for all four race groups at once, merge seer_age_groups with a dataset that has age_group rates for every race group. Be sure to add a group_by function before using summarize.
There is an R package, epiR, that will do the age adjustments for you. I have not tested it, but it looks pretty good.