I’m working in a team programming environment and I would have to characterize the quality of the documentation as uneven. I’m going to make the case for having detailed documentation standards at a meeting tomorrow. Here’s the a general overview of what I will say.

I have been on the receiving end of lots of programs and data sets that are poorly documented and lots that are well documented. I also have helped set up and run a working group on secondary data analysis, so I have lots of experience seeing how the professionals at groups like the Centers for Disease Control and Prevention set up clear and easy to follow documentation for some pretty complex data sets.

So if you’re like me and you’re creating programs and data sets that others statisticians will use, you need (I will argue at my meeting) formal written programming standards. What are some of the things that you should put in your programming standards document?

Variable naming convention

One of the simplest things you need is a variable naming convention.This may seem trivial, and sometimes it is, but at a minimum a consistent naming convention will make a program easier to read. It also greatly simplifies program maintenance.

The story I tell when I talk about variable naming conventions is a true one, actually. There was a group called Writer’s Exchange that wanted to set up a website. They named it www.writersexchange.com, which seems like a pretty good name except that you can accidentally read it as "writer sex change. So to avoid confusion, you need some way to easily separate words in a variable name. You can’t use blanks: most programs won’t allow it, so the three common approaches are camelCaseNames, underscore_separated_names, and dot.separated.names. The actual choice varies. Microsoft likes camelCaseNames (and a closely related variant they call PascalCaseNames). Google likes underscore_separated_names for some of their programming languages, camelCaseNames for others, and a mix depending on the type of variable for still others. Their R programming style guide recommends dot.separated.names but says that camelCaseNames are acceptable. The Google Style Guide for R recommends AGAINST underscore_separated_names, but other style guides recommend AGAINST dot.separated.names instead.

I like underscore_separated_names, but would be fine with any convention. R itself is terribly inconsistent. So, for example, the argument for handling missing values in the mean function is dot separated (na.rm). But in the table function, it is camel case (useNA).

But don’t you be like R itself. If you adopt a naming convention, you should follow it religiously. Here’s some of the reasons why.

First, a mix of naming conventions makes your code fragile. Suppose one programmer is cumulating events in event.counter, and a second programmer starts accumulating the events in event_counter, then the best you can hope for is that you get an error code that one of the variables is uninitialized. If you don’t get that message, then some of your events will not get counted properly.

Second, a mix of naming conventions messes up an alphabetical listing. If you have variables named

when you list them with ls(), they come out listed as

Finally, variable names in R can be manipulated as easy as data sets. If you name variables consistently, then you can more easily loop across variables, pull out subsets of variables, and re-order your variables.

Another convention that a lot of R programmers use is that they use nouns for variables and verbs for functions. An experienced R programmer will recognize a function because it is usually followed by a parenthesis, while most variables (except scalars) are followed by one or two square brackets. But less experienced R programmers and even experienced R programmers who are sleep deprived will get confused. So “age.in.years” might be a vector of patient ages and “calculate.age” might be a function that calculates an age from a exam date and a birth date. This also kind of makes sense because a function takes an action on one or more variables. Here’s an example:

age.in.years <- calculate.age(event.date,birth.date,units=“years”).

It almost reads like an English sentence.

Some programmers use all upper case for constants, but I dislike this approach. It especially bad to have upper case and lower case versions of the same name. R is case sensitive, but if you start relying on this, then you will get in trouble when you switch to coding in a program like SAS which is not case sensitive.

Finally, avoid acronyms and abbreviations. A few acronyms like bmi for body mass index might be okay, but try to spell out anything that is not instantly recognizable. You should also avoid abbreviations. They can sometimes be misread. So, does “temp” refer to temperature or a temporary variable? There’s also more than one way to abbreviate. Your temperature variable could easily by “temp” or “tmp” or even “t”.