P.Mean: Naming conventions for variables (created 2008-07-30)

P.Mean: Naming conventions for variables (created 2008-07-30).

This page is moving to a new website.

For almost all statistical software programs, you can and should provide variable names for your data. Variable names are a short descriptive explanation of what resides in each column of data. You should choose a variable name that is short, concise, and descriptive.

Names that are too short will lead to terse but difficult to follow output. Names that are too long will create formatting and readability problems. In most situations, it is best to keep the variable name between 4 and 16 characters in length. There are exceptions, of course. If you only have two columns of data, and you're not sharing your output with anyone else, it's okay to label them x and y.

Frequently, a variable name will logically consist of two or more separate words, such as "birth weight" or "systolic blood pressure." You will have several choices for variable names.

Use spaces in the variable name. This is not a viable option in many statistical software system. These systems will get confused by a variable with a name like "birth weight" especially if it appears in a list. Computer software tends to be simplistic and will tend to treat the above variable name as the names of two separate variables "birth" and "weight." Some statistical software programs will allow blanks in a variable name, but may force you to use delimiters like quote marks around such a variable name.

Run the words together. You can create a compound variable name out of several separate words, and often this works well. Variable names like "birthweight" and "systolicbloodpressure" are reasonable choices. Sometimes, however, there are unintended effects. There's a story about a group called "Writers Exchange" that wanted to set up a website. As is commonly done for many websites, they just ran the words together to get www.writersexchange.com. Of course, this could easily be misread as "writer sex change." Other combinations can look awkward or cause momentary confusion. The variables "momage" and "dadage" for example, could be accidentally be misconstrued as nonsense words rhyming with "homage" and "adage."

Use mixed capitalization. Some statistical software programs will allow you to use a mixture of upper and lower case and this will allow you to create more readable versions of compound variable names. There is less confusion about "MomAge" and "DadAge" for example. Mixed capitalization is often called "CamelCase" because the words end up having several humps.

Use punctuation symbols instead of spaces. You can replace blanks with special symbols that keep the meaning of a variable name clear. Two punctuation marks that I use frequently are the underscore character ("_") and the dot ("."). Some examples of this are "systolic.blood.pressure" and "birth_weight." Some symbols, such as the minus sign ("-") and the forward slash ("/") may cause problems because they may falsely imply a subtraction or division calculation. Thus a program that encounters "systolic-blood-pressure" might try to subtract the "blood" and "pressure" variables from the "systolic" variable, and a program that encounters "birth/weight" may try to divide the "birth" variable by the "weight" variable.

Here are some other issues to consider.

Please avoid all upper case when naming variables. This may be unavoidable in a few statistical software programs, but upper case writing is harder to read. On the Internet, all uppercase is described as the written equivalent of shouting and is considered annoying. On a more practical level, upper case is generally considered harder to read. "During repeated tests on adults, the studies indicated that the use of all caps lengthens the reading time by 9.5% to 19%. The average reader took about 12-13% more time to read all caps. That translates to 38 words/minute slower than using sentence case. Moreover, when the psychologists asked the participants for their opinion of legibility, 90% of the participants preferred lower case type." www.ca7.uscourts.gov/Rules/Painting_with_Print.pdf

Do not let capitalization be the only feature that distinguishes between two variables. It is common in genetics to distinguish between dominant and recessive alleles using upper and lower case respectively. This can often allow complex genetic situations to be described succinctly. But visually, there is very little difference between "X" and "x" so the benefits associated with brevity may be outweighed by the greater tendency for confusing the two variables.

Do not try to squeeze every detail into a variable name. While a variable name can include details like units of measure (birth.weight.in.grams) or timing (age_at_enrolment), this can often be overkill. Most statistical software programs will provide additional documentation, such as variable labels, to help with information that does not easily fit into the variable name.

This work is licensed under a Creative Commons Attribution 3.0 United States License. This page was written by Steve Simon and was last modified on 2010-04-01. Need more information? I have a page with general help resources. You can also browse for pages similar to this one at Category: Data management.