Blog post: Smart quotes, em dashes, and en dashes

Steve Simon

2020/03/02

If you work with text data a lot, you will encounter some characters that are sort of close to what you need, but sort of not. These include the smart quotes, em dashes, and en dashes.

Smart quotes

Programmers use a set of standard quotation marks in their work. These are called straight quotes by some and dumb quotes by others.

The straight double quote (") is one of the standard codes that works pretty much the same on any computer system.

The straight single quote (’) is another standard code,

On early computer systems, this was all you had. You might have a backwards slanting single quote (`), often called a backtick.

This was a step backwards from Guttenberg’s printing press. Actually, the regression occurred when the typewriter was invented. The limited number of keys that you could fit into a typewriter

prevented the use of a greater variety of quote marks.

Now you probably already know this but if you want to assign a double quote to a variable, you surround it with single quotes,

or precede the double quotes with a backslash

Use the charToRaw function to see the underlying code for the double quote mark

and for the single quote mark

and for the backtick

These values are hexadecimal, so 27 in hexadecimal is 2*16+7=39 in decimal.

If you know the hexadecimal code, you can convert it to the character equivalent using the prefix.

When personal computers started to expand beyond the limited character set, that allowed you to use the left double quote,

the right double quote,

the left single quote,

and the right single quote

These quote marks are part of a larger character set known as Unicode. The rawToChar function provides a surprising result

Surprise! When you open the world up to different typographic characters, you have to include characters with accents,

cedillas,

and tildes.

You have to have room for the sharp S in German,

the thorn in Icelandic,

and a whole host of new characters in Greek,

Arabic,

and Chinese.

When you add various emojis

the list becomes quite long. The system that encodes all of these values is Unicode.

You specify Unicode values with a

Now you might wonder why the internal code for the left double quote (e2 80 9c) does not match the 201C shown above. It turns out that the internal storage of Unicode uses a system called UTF-8. UTF-8 maintains storage efficiency and backwards compatibility with earlier coding systems.

The other smart quote marks are the right double quote,

the left single quote,

and the right single quote,

There are a couple of additional codes that I should mention. Most programmers use the minus sign in their coding

but there are two similar characters that you might see. The em dash,

is a longer dash. It has a width equal to the letter “m” in most proportional width fonts. There is another dash, the en dash

that is also longer than a minus sign, but about half the length of the em dash. It has a width that is equal to the letter “n” in most proportional width fonts.

The em dash and en dash will often cause confusion because they look so much like the minus sign, but they will cause problems often in R code.

There’s a nice web page about the historical developments of computer codes for quote marks and dashes and another page that talks about computer codes in general from the perspective of an R programmer.

There are some other variants, such as the prime symbols, described in this Wikipedia page.