Smart quotes, em dashes, and en dashes

If you work with text data a lot, you will encounter some characters that are sort of close to what you need, but sort of not. These include the smart quotes, em dashes, and en dashes.

Warning: Your computer screen may or may not display some of these characters correctly. I tried to develop a system using graphic equivalents of these characters for stability, but I can’t guarantee that this page doesn’t change from one computer system to another.

I also expanded the size of the characters to try to make subtle differences more obvious.

Straight quotes

Programmers use a set of standard quotation marks in their work. These are called straight quotes by some and dumb quotes by others.

The straight double quote is one of the standard codes that works pretty much the same on any computer system.

The straight single quote (’) is another standard code,

On early computer systems, this was all you had. You might have a backwards slanting single quote (`), often called a backtick.

This was a step backwards from Guttenberg’s printing press. Actually, the regression occurred when the typewriter was invented. The limited number of keys that you could fit into a typewriter

prevented the use of a greater variety of quote marks.

The charToRaw function

Use the charToRaw function to see the underlying code for the double quote mark,

charToRaw('"')

the single quote mark,

charToRaw("'")

and the backtick.

charToRaw("“)`

These values are hexadecimal, so 60 in hexadecimal is 6*16+0 = 96 in decimal.

If you know the hexadecimal code, you can convert it to the character equivalent using the backslash x prefix. For example,

"\x22"

Unicode

If the 1990s, computers started to expand beyond the limited character set, using a new standard known as Unicode. These quote marks are part of a larger character set known as Unicode. In R (and in many other programming languages), you can access this larger character set with a backslash U prefix.

With Unicode, you can get characters with accents.

"\U00E9"

cedillas,

"\U00C7"

and tildes.

"\U00F1"

You have to have room for the sharp S in German,

"\U00DF"

the thorn in Icelandic,

"\U00FE"

and a whole host of new characters in Greek,

"\U03B1\U03B2\U03B3"

Arabic,

"\U062A\U0633\U062C\U0651\U0644"

and Chinese.

"\U4E2D\U6587"

When you add various emojis

"\U1F642\U1F622"

the list becomes quite long.

Unicode punctuation

Unicode also expanded punctuation. You can You can create the left double quote,

"\U201C"

the right double quote,

"\U201D"

the left single quote,

"\U2018"

and the right single quote

"\U2019"

The UTF-8 representation of Unicode

The rawToChar function provides a surprising result with many Unicode characters.

charToRaw("\U201C")

Surprise!

Now you might wonder why the raw code for the left double quote does not match the 201C shown above. It turns out that the internal storage of Unicode uses a system called UTF-8. UTF-8 maintains storage efficiency and backwards compatibility with earlier coding systems.

This can cause a few snags. The string

is only two characters, but the UTF-8 format stores it as

which is eight bytes long. So be careful to know what you are asking for when you want the length of a string in Unicode.

Don’t let me scare you. The UTF-8 format is a good thing because it handles the basic English alphabet, the numbers, and most common symbols without a hitch. The UTF-8 format can easily handle older files, even those created 60 years ago. It’s when you need Greek letters, unusual symbols, emojis, etc. that things get a bit tricky.

The em dash and en dash

There are a couple of additional codes that I should mention. Most programmers use the minus sign in their coding

but there are two similar characters that you might see. Herer, first, is the em dash.

"\U2014"

It is called the em dash because it is about as wide as the letter “M” in a proportional width font.

The second variant is the en dash.

"\U2013"

that is also longer than a minus sign, but about half the length of the em dash. It has a width that is equal to the letter “N” in most proportional width fonts.

The em dash and en dash will often cause confusion because they look so much like the minus sign. Watch for them because they can sometimes cause problems in R code.

If you want to learn more

There’s a nice web page about the historical developments of computer codes for quote marks and dashes and another page that talks about computer codes in general from the perspective of an R programmer.

There are some other variants, such as the prime symbols, described in this Wikipedia page.

An earlier version is here.