I have helped develop and have taught (along with other faculty in our department) three one credit hour pass/fail classes: Introduction to R, Introduction to SPSS
 and Introduction to SAS. These classes were developed back in 20142015 and they are in need of some serious updates. I will try to outline some of the updates that I think these classes need in this blog post.
The first big change is that all of these classes need to more closely adapt to the recent standards in reproducible research. We do some of this already
 but need to do more and need to make it more explicit. This means that for every data set that the students use
 they need to produce a data dictionary. And as they modify the data sets
 they need to produce a changelog file.
Most of the data sets that I use in these classes already have good data dictionaries

but I want the students to cutandpaste the relevant portions into their own text file. They should then add a bit of supplementary information

if needed. Here’s an example of a data dictionary for the first file that I use in my Introduction to R class. It is scarcely better than the data dictionary found at the second link described in this file

but it’s important for students to see the importance of creating a data dictionary that is stored locally.
~/introductiontorpart1/doc/data_dictionary_fd.txt Written by Steve Simon
See ~/introductiontorpart1/README.md for an overview of everything.
This file was downloaded from
http://www.amstat.org/publications/jse/datasets/fat.dat.txt.
A description of the file appears at
https://ww2.amstat.org/publications/jse/datasets/fat.txt.
It represents a study of body fat and body circumference measurements on 252 men. The data was first used in Penrose

K.

Nelson

A.

and Fisher

A. (1985)

“Generalized Body Composition Prediction Equation for Men Using Simple Measurement Techniques” (abstract)

Medicine and Science in Sports and Exercise, 17(2)

189.
and later described in a publication
Johnson (1996) “Fitting Percentage of Body Fat to Simple Body Measurements” Journal of Statistics Education.
and stored in the data archive for this journal. The data set is freely available for reuse for noncommercial purposes.
There are 252 rows and 19 columns in this data set. It is stored as a text file with tab delimiters. There are no missing value codes in this data set.
case

Case Number

a sequential number from 1 to 252. fat.b

Percent body fat using Brozek’s equation

457/Density  414.2 fat.s

Percent body fat using Siri’s equation

495/Density  450 dens

Density (gm/cm^3) age

Age (yrs) wt

Weight (lbs) ht

Height (inches) bmi

Adiposity index = Weight/Height^2 (kg/m^2) ffw

Fat Free Weight = (1  fraction of body fat) * Weight

using Brozek’s formula (lbs) neck

Neck circumference (cm) chest

Chest circumference (cm) abdomen

Abdomen circumference (cm) “at the umbilicus and level with the iliac crest” hip

Hip circumference (cm) thigh

Thigh circumference (cm) knee

Knee circumference (cm) ankle

Ankle circumference (cm) biceps

Extended biceps circumference (cm) forearm

Forearm circumference (cm) wrist

Wrist circumference (cm) “distal to the styloid processes”
There are no hard and fast rules about what goes in a data dictionary, but here are the elements that I included in this particular data dictionary.
 Where the data came from
 including urls and references
 if available
 A brief description of the data
 Licensing information for this data
 The format of the data
 including information about delimiters
 if appropriate
 The number of rows and columns in the data
 The codes for missing values
The data dictionary also includes an entry for each variable with the name of the variable
 a brief description
 and units of measurement
 if appropriate. There are no categorical variables in this data set
 but if there were
 you should include the values used and what they represent (e.g.
 1=male
 2=female).
A data dictionary does not have to include all of the things I listed above and it can include things that I did not include. Use your best judgement to decide how much to document in the data dictionary.
The other thing I want to emphasize in these classes is a changelog file. This file provides a historical record when new programs are written and when new data sets are created. Here’s a fictional example of a changelog file.
~/introductiontor/doc/changelog.txt
Written by Steve Simon
See ~/introductiontorpart1/README.md for an overview of everything.
## 20180220
Created src
 doc
 data
 results directories and moved files into them.
20160822
Enhanced documentation of part1a.Rmd.
20160818
Created part1a.Rmd through part1d.Rmd by splitting part1.Rmd.
20160808
Updated README.md Export fd to a text file
 body.txt
20160531
Created github repository. Created README.md Created part1.Rmd Imported data and stored it as fd in part1.RData Converted outlier in fd to NA and stored as fd1 in part1.RData Removed row with outlier from fd and stored as fd2 in part1.RData
Now
 I don’t have a changelog.txt file for any of my projects. I rely on git to produce something equivalent to a changelog file
 and pulled a small fraction of the changes documented by git into the file above.
Git does not work as well for SAS and SPSS as it does for R
 so it makes more sense to teach how to create a changelog file in all three classes.