Calculating age adjusted rates

I am working with someone who needs to compute age-adjusted rates. It is not too difficult, except if you’ve never done it before. Neither she nor I have, so I am taking the time to outline how this might be done in a simple case before we tackle her data.

Calculation formulas

The formula for age-adjustments is fairly simple. The CDC has a nice webpage that outlines this.

Age-adjusted rates are calculated by the direct method, as in:

\[\sum_{i=1}^{n}r_i (p_i/P)\]

where \(r_i\) = Rate in age group i in the population of interest, \(p_i\) = Standard population in age group i, and

\[P = \sum_{i=1}^{n}p_i\]

What is a standard population

So what, exactly is a “standard population”. It can be anything actually, but it usually is a distribution of ages that represents the distributions across an entire state or country or possibly across the whole world. A commonly used standard population is U.S. 2000 Standardized Population.

You can find links to several versions of this at the CDC webpage listed. The age breakdowns vary. One goes from

Younger than 1
1–4
5–14
15–24
25–34
35–44
45–54
55–64
65–74
75–84
85 and older

for example. If your data has an age grouping different, then you can build your own table at the SEER (Surveillance, Epidemiology, and End Results) website, run by the National Cancer Institute. This page has a breakdown of the U.S. 2000 Standardized Population by single years.

Get data on COVID cases

library(tidyverse)

To illustrate the calculations, you need a dataset. You can find a nice and fairly easy to use [dataset on COVID cases and deaths over time][ref-healthdata-nodate]. This is mostly used to look at trends for COVID, but we will use it to pull out a single date and region and calculate crude and adjusted rates by race.

The data is available as a comma-separated text file and you can feed the URL directly into the read_csv function.

f1 <- "https://data.cdc.gov/api/views/hrdz-jaxc/rows.csv"

covid_cases0 <- read_csv(
  file=f1,
  col_names=TRUE, 
  col_types="cccccnnn")

glimpse(covid_cases0)

Rows: 407,126
Columns: 9
$ end_of_week                          <chr> "2022-11-19", "2022-02-19", "2022…
$ jurisdiction                         <chr> "Region 3", "Region 4", "Region 9…
$ age_group                            <chr> "40 - 49 Years", "65 - 74 Years",…
$ sex                                  <chr> "Male", "Female", "Male", "Male",…
$ race_ethnicity_combined              <chr> "White, NH", "Black, NH", "Hispan…
$ case_count_suppressed                <dbl> 400, 625, 937, 1888, 10305, 3572,…
$ death_count_suppressed               <dbl> NA, 19, NA, NA, 8, 40, NA, NA, NA…
$ case_crude_rate_suppressed_per_100k  <dbl> 33.22, 99.55, 76.21, 85.67, 2305.…
$ death_crude_rate_suppressed_per_100k <dbl> NA, 3.03, NA, NA, 1.79, 1.66, NA,…

One quick note about the variable names. The need for privacy requires that you not see any very small groups in the dataset because that increases the risk that an individual person might be identifiable. In this dataset, anytime the count for cases or death goes below 5, the data are suppressed. That’s why the last variables have the word “suppressed” in their names. For the particular subset that you will use later, there are no suppressed values.

COVID data frequencies

Let’s get a few basic frequencies for the various categorical values.

covid_cases0 |>
  count(jurisdiction)

# A tibble: 11 × 2
   jurisdiction     n
   <chr>        <int>
 1 Region 1     35057
 2 Region 10    37079
 3 Region 2     36435
 4 Region 3     36025
 5 Region 4     37771
 6 Region 5     37867
 7 Region 6     38114
 8 Region 7     35078
 9 Region 8     37096
10 Region 9     38229
11 US           38375

covid_cases0 |>
  count(age_group)

# A tibble: 11 × 2
   age_group         n
   <chr>         <int>
 1 0 - 4 Years   36301
 2 12 - 15 Years 35714
 3 16 - 17 Years 35033
 4 18 - 29 Years 37839
 5 30 - 39 Years 37767
 6 40 - 49 Years 37693
 7 5 - 11 Years  36206
 8 50 - 64 Years 37985
 9 65 - 74 Years 37432
10 75+ Years     36853
11 Overall       38303

covid_cases0 |>
  count(sex)

# A tibble: 3 × 2
  sex          n
  <chr>    <int>
1 Female  135259
2 Male    134628
3 Overall 137239

covid_cases0 |>
  count(race_ethnicity_combined)

# A tibble: 6 × 2
  race_ethnicity_combined     n
  <chr>                   <int>
1 AI/AN, NH               60047
2 Asian/PI, NH            67972
3 Black, NH               68824
4 Hispanic                69863
5 Overall                 70281
6 White, NH               70139

Notice how the age groups did not sort properly. If you place a blank in front of the “0 - 4 Years” and “5 - 11 Years”, then they would sort nicely (as shown below).

Filtering down to a manageable subset

To get a snapshot of a single time point, you can use the filter function to limit the data to a single day. You can also limit the data to a single region, Region 4. This is an arbirtrary choice that represent the West North Central United States (ND, SD, NE, KS, MN, IA, MO).

You need to calculate the number of patients in each group. This is easy enough to calculate by inverting the formula

case_rate = case_count / size_per_100k

Do note that since the case rate is per 100K, the size is also computed per 100K. You could adjust this, if you wanted to, but keeping both at per 100K is actually simpler and better.

You do need to remove the “Overall” categories for age_group and race_ethnicity_combined. Keep only the “Overall” category for sex, though. In this example, you do not want any rates (crude or adjusted) separated into males and females.

You have to exclude the “AI/AN, NH” group because there is no “AI/AN, NH” group in the U.S. 2000 Standardized Population.

You should modify the age group labels with a leading blank for two of the categories. This insures that the sorted data will be in a logical order.

You should also NOT keep the variable case_crude_rate_suppressed_per_100k because you will be computing several different crude rates later.

covid_cases0 |>
  filter(end_of_week == "2020-10-31") |>
  filter(jurisdiction == "Region 4") |>
  filter(sex == "Overall") |>
  filter(age_group != "Overall") |>
  filter(race_ethnicity_combined != "Overall") |>
  filter(race_ethnicity_combined != "AI/AN, NH") |>
  mutate(size_per_100k=case_count_suppressed/case_crude_rate_suppressed_per_100k) |>
  mutate(age_group=str_replace(age_group, "0 - 4 Years", " 0 - 4 Years")) |>
  mutate(age_group=str_replace(age_group, "5 - 11 Years", " 5 - 11 Years")) |>
  select(
    age_group,
    race_ethnicity_combined,
    case_count_suppressed, 
    size_per_100k) |>
  arrange(age_group, race_ethnicity_combined) -> covid_cases1

covid_cases1

# A tibble: 40 × 4
   age_group       race_ethnicity_combined case_count_suppressed size_per_100k
   <chr>           <chr>                                   <dbl>         <dbl>
 1 " 0 - 4 Years"  Asian/PI, NH                               17         0.957
 2 " 0 - 4 Years"  Black, NH                                 273         9.26 
 3 " 0 - 4 Years"  Hispanic                                  295         7.19 
 4 " 0 - 4 Years"  White, NH                                 652        19.4  
 5 " 5 - 11 Years" Asian/PI, NH                               35         1.50 
 6 " 5 - 11 Years" Black, NH                                 477        13.7  
 7 " 5 - 11 Years" Hispanic                                  570        10.1  
 8 " 5 - 11 Years" White, NH                                1212        28.2  
 9 "12 - 15 Years" Asian/PI, NH                               22         0.908
10 "12 - 15 Years" Black, NH                                 459         8.11 
# ℹ 30 more rows

Calculate rates for each age group

Sum across all the race/ethnicity categories to get the numerator and denominator for the covid infection rates in each age group.

covid_cases1 |>
  group_by(age_group) |>
  summarize(
    case_count_suppressed=sum(case_count_suppressed),
    size_per_100k=sum(size_per_100k)) |>
  mutate(crude_age_rate=case_count_suppressed / size_per_100k) -> age_rates

age_rates

# A tibble: 10 × 4
   age_group       case_count_suppressed size_per_100k crude_age_rate
   <chr>                           <dbl>         <dbl>          <dbl>
 1 " 0 - 4 Years"                   1237          36.8           33.6
 2 " 5 - 11 Years"                  2294          53.6           42.8
 3 "12 - 15 Years"                  2523          32.1           78.7
 4 "16 - 17 Years"                  1689          15.8          107. 
 5 "18 - 29 Years"                 14561         103.           141. 
 6 "30 - 39 Years"                 10402          84.2          123. 
 7 "40 - 49 Years"                 10038          81.7          123. 
 8 "50 - 64 Years"                 14473         129.           112. 
 9 "65 - 74 Years"                  6235          68.5           91.0
10 "75+ Years"                      4930          49.6           99.4

The rates appear to be highest for young to middle-aged adults.

Calculate rates for each race/ethnicity group

Sum across all age groups to get the numerator and denominator for race/ethnicity.

covid_cases1 |>
  group_by(race_ethnicity_combined) |>
  summarize(
    case_count_suppressed=sum(case_count_suppressed),
    size_per_100k=sum(size_per_100k)) |>
  mutate(crude_race_rate=case_count_suppressed / size_per_100k) -> race_rates

race_rates

# A tibble: 4 × 4
  race_ethnicity_combined case_count_suppressed size_per_100k crude_race_rate
  <chr>                                   <dbl>         <dbl>           <dbl>
1 Asian/PI, NH                              924          18.2            50.7
2 Black, NH                               12211         142.             85.9
3 Hispanic                                11925          89.4           133. 
4 White, NH                               43322         405.            107.

Hispanics have the highest COVID infection rate.

Read the U.S. Standardized Population

The age categories you need are not reflected in the CDC tables, so let’s use the population counts in the yearly dataset at SEER. Again, if you provide the URL, you can read the data directly.

f1 <- "https://seer.cancer.gov/stdpopulations/singleages.pops.tables.csv"

seer_population0 <- read_csv(
  file=f1,
  skip=2,
  col_names=FALSE,
  col_types="cnnnnnnnn")

glimpse(seer_population0)

Rows: 95
Columns: 9
$ X1 <chr> "00 years", "01 years", "02 years", "03 years", "04 years", "05 yea…
$ X2 <dbl> 3794901, 3758562, 3773025, 3791001, 3869031, 3896081, 3917855, 3978…
$ X3 <dbl> 3794901, 3758562, 3773025, 3791001, 3869031, 3896081, 3917855, 3978…
$ X4 <dbl> 376321, 379990, 383179, 383741, 375833, 366757, 361038, 363440, 358…
$ X5 <dbl> 376321, 379990, 383179, 383741, 375833, 366757, 361038, 363440, 358…
$ X6 <dbl> 362725, 371898, 380119, 388730, 399411, 408927, 412709, 413664, 418…
$ X7 <dbl> 362725, 371898, 380119, 388730, 399411, 408927, 412709, 413664, 418…
$ X8 <dbl> 17917, 17802, 17701, 17613, 17536, 17470, 17414, 17366, 17327, 1729…
$ X9 <dbl> 17917, 17802, 17701, 17613, 17536, 17470, 17414, 17366, 17327, 1729…

You need to modify some of the variables so they will end up meshing well with the COVID cases data. Also, you need to remove any population count associated with other standardized population.

seer_population0 |>
  rename(age_years=X1) |>
  rename(population_count=X2) |>
  filter(!is.na(age_years)) |>
  filter(age_years != "Total") |>
  mutate(age_years=str_sub(age_years, 1, 2)) |>
  mutate(age_years=as.numeric(age_years)) |>
  select(age_years, population_count) -> seer_population1

Now calculate the proper age groups for the standardized population.

seer_population1 |>
  mutate(age_group=case_when(
    age_years <= 4  ~ " 0 - 4 Years",
    age_years <= 11 ~ " 5 - 11 Years",
    age_years <= 15 ~ "12 - 15 Years",
    age_years <= 17 ~ "16 - 17 Years",
    age_years <= 29 ~ "18 - 29 Years",
    age_years <= 39 ~ "30 - 39 Years",
    age_years <= 49 ~ "40 - 49 Years",
    age_years <= 64 ~ "50 - 64 Years",
    age_years <= 74 ~ "65 - 74 Years",
    age_years >  74 ~ "75+ Years")) |>
  group_by(age_group) |> 
  summarize(population_count=sum(population_count)) -> seer_age_groups

print(seer_age_groups, n=10)

# A tibble: 10 × 2
   age_group       population_count
   <chr>                      <dbl>
 1 " 0 - 4 Years"          18986520
 2 " 5 - 11 Years"         28178121
 3 "12 - 15 Years"         15750921
 4 "16 - 17 Years"          7865892
 5 "18 - 29 Years"         43980495
 6 "30 - 39 Years"         41691326
 7 "40 - 49 Years"         42285022
 8 "50 - 64 Years"         41185865
 9 "65 - 74 Years"         18135514
10 "75+ Years"             16573966

Evaluate imbalances in age group sizes across race/ethnicity groups

You’ve done a lot, and it’s time to check to see how important the age adjustment is. Do the racial groups tend to have the same relative proportions in each age group? If so, there is less of a need for any adjustments.

covid_cases1 |>
  filter(race_ethnicity_combined != "Overall") |>
  filter(age_group != "Overall") |>
  group_by(race_ethnicity_combined) |>
  mutate(total=sum(size_per_100k)) |>
  mutate(pct=100*size_per_100k/total) |>
  ggplot() +
  aes(x=race_ethnicity_combined, y=pct, fill=age_group) |>
  geom_col(
    position="stack",
    color="black")

There are some small differences here. The White, NH group tends to have a disproportionately larger proportion of people in the older age groups while the other groups, especially the Hispanic group, tend to skew a bit more to the younger age groups.

Calculate rates for age_groups within the Asian subgroup

To get an adjusted rate for the Asian subgroup, you need to compute a separate rate for every age group within the Asian subpopulation.

covid_cases1 |>
  filter(race_ethnicity_combined=="Asian/PI, NH") |>
  group_by(age_group) |>
  summarize(
    case_count_suppressed=sum(case_count_suppressed),
    size_per_100k=sum(size_per_100k),
    .groups="drop") |>
  mutate(detailed_rates=case_count_suppressed / size_per_100k) -> asian_rates

print(asian_rates, n=10)

# A tibble: 10 × 4
   age_group       case_count_suppressed size_per_100k detailed_rates
   <chr>                           <dbl>         <dbl>          <dbl>
 1 " 0 - 4 Years"                     17         0.957           17.8
 2 " 5 - 11 Years"                    35         1.50            23.3
 3 "12 - 15 Years"                    22         0.908           24.2
 4 "16 - 17 Years"                    29         0.449           64.6
 5 "18 - 29 Years"                   251         3.16            79.4
 6 "30 - 39 Years"                   162         3.19            50.7
 7 "40 - 49 Years"                   152         2.93            51.9
 8 "50 - 64 Years"                   168         3.08            54.5
 9 "65 - 74 Years"                    62         1.30            47.6
10 "75+ Years"                        26         0.747           34.8

Don’t forget what the overall rate is for the Asian subgroup.

race_rates |>
  filter(race_ethnicity_combined=="Asian/PI, NH")

# A tibble: 1 × 4
  race_ethnicity_combined case_count_suppressed size_per_100k crude_race_rate
  <chr>                                   <dbl>         <dbl>           <dbl>
1 Asian/PI, NH                              924          18.2            50.7

Re-combining across age groups

You already knew this, but if you tried to average the individual rates for the ten age_groups, you would get a different number than the overall rate.

asian_rates |>
  summarize(simple_mean_rate=mean(detailed_rates))

# A tibble: 1 × 1
  simple_mean_rate
             <dbl>
1             44.9

This discrepancy occurs because age_groups with a small number of patients get the same weight as age_groups with a large number of patients. To get the valid rates, you need to use a weighted mean which gives more weight to the age groups that have more patients.

asian_rates |>
  summarize(weighted_mean_rate=weighted.mean(detailed_rates, size_per_100k))

# A tibble: 1 × 1
  weighted_mean_rate
               <dbl>
1               50.7

Merge the data with the standardized population

Adjusting to the standardized population is just using weights associated with the standardized population instead of the weights found in existing data.

seer_age_groups |>
  full_join(asian_rates) |>
  summarize(adjusted_rate=weighted.mean(detailed_rates, population_count))

Joining with `by = join_by(age_group)`

# A tibble: 1 × 1
  adjusted_rate
          <dbl>
1          48.7

This rate is going to be slightly different than the crude rate because the demographic proportions of the age groups in the Asian subpopulation differs from the demographic proportions of the age groups in the standardized population.

if you wanted to get adjusted rates for all four race groups at once, merge seer_age_groups with a dataset that has age_group rates for every race group. Be sure to add a group_by function before using summarize.

There is an R package, epiR, that will do the age adjustments for you. I have not tested it, but it looks pretty good.