Character Encoding: What Is It and Why Is It Important?

Introduction

Why Do Characters Change After Translation?

When online digital content is translated from one language to another, an unfortunate—and common—side effects can occur when this translated content is transported to a different medium. Most website translations services keep their eye out when content is being created, but whether you are doing translations internally or hiring out, it is important to understand this topic.

Simple sentences that contain accented letters or special formatting can appear malformed when copied from one file to another. Specific characters and punctuation elements are usually rendered as a series of question marks or random non-standard characters.

Why is this happening? Character encoding.

What is Character Encoding?

Character encoding tells computers how to interpret digital data into letters, numbers, and symbols. This process is essential for website translation and localization, as it ensures that text is displayed correctly across all devices and browsers in different languages.

Character encoding works by assigning a specific numeric value to each letter, number, or symbol. These characters are grouped together into specific “character sets” or “repertoires” that associate each one with a numerical value called a “code point.” These characters are then stored as one or more bytes. Without proper character encoding, multilingual websites may experience issues where translated content, especially with special characters, becomes unreadable or distorted.

In order to properly render translated digital content, the correct character encoding must be used. For example, text with special characters should look like this:

Character Encoding 101 by Kaðlín Örvardóttir

may display like this:

Character Encoding 101 by Ka▯l?n ▯rvard?ttir

Here’s some history on character sets, followed by some tips on how to properly leverage them for your website translation projects.

Types of Character Encoding

Until the early 1960s, computer programmers created ad-hoc conventions to represent characters internally. Some computers distinguished between upper- and lower-case letters, but most did not. The technique worked because the information was typically processed from end to end in a single machine. Hence, there was no need for standardized character encoding.

However, once information exchange became an important consideration, programmers needed a standard code that allowed data to move between different computer models. This led to the development of ASCII (American Standard Code for Information Interchange).

ASCII

In 1963, the ASCII (American Standard Code for Information Interchange) character encoding scheme was established as a common code used to represent English characters, with each letter assigned a numeric value from 0 to 127.

Most modern character encoding subsets are based on the ASCII character encoding scheme and support several additional characters.

ANSI/Windows-1252

When the Windows operating system emerged in 1985, a new standard was quickly adopted known as the ANSI character set. The phrase “ANSI” was also known as the Windows code pages (Code Page 1252), even though it had nothing to do with the American National Standards Institute.

Windows-1252 or CP-1252 (code page 1252) character encoding became popular with the advent of Microsoft Windows but was eventually superseded when Unicode was implemented within Windows. Unicode, which was first released in 1991, assigns a universal code to every character and symbol for all the languages in the world.

ISO-8859-1

The ISO-8859-1 (also known as Latin-1) character encoding set features all the characters of Windows-1252, including an extended subset of punctuation and business symbols. This standard was easily transportable across multiple word processors and even newly released versions of HTML 4.

The first edition was published in 1987 and was a direct extension of the ASCII character set. While support was extensive for its time, the format was still limited.

UTF-8

After the debut of ISO-8859-1, the Unicode Consortium regrouped to develop more universal standards for transportable character encoding.

UTF-8 (Unicode Transformation-8-bit) is now the most widely used character encoding format on the web, as it serves as a mapping method within Unicode. UTF-8 was declared mandatory for website content by the Web Hypertext Application Technology Working Group, a community of people interested in evolving the HTML standard and related technologies.

UTF-8 was designed for full backward compatibility with ASCII.

Why is Character Encoding Important?

So it’s clear that each character set uses a unique table of identification codes to present a specific character to a user. If you were using the ISO-8859-1 character set to edit a document and then saved that document as a UTF-8 encoded document without declaring that the content was UTF-8, special characters and business symbols will render unreadable.

Most modern web browsers support legacy character encodings, so a website can contain pages encoded in ISO-8859-1, or Windows-1252, or any other type of encoding. The browser should properly render those characters based on the character encoding format not being reported by the server.

However, if the character set is not correctly declared at the time the page is rendered, the web server’s default is usually to fall back without any specific character encoding format (usually ASCII).

This forces your browser or mobile device to determine the page’s proper type of character encoding. Based on the WHATWG specifications adopted by W3C, the most typical default fallback is UTF-8. However, some browsers will fall back to ASCII.

Character Encoding Tips and Best Practices

To ensure your users are always seeing the correct content on your HTML production pages, be sure:

The content is saved in and encoded using UTF-8
Declare the encoding type within your page with the use of metatags
Your server is delivering the correct data (Even if the data on your page is correctly encoded in UTF-8 and declared on the page, your server may be serving up the page with an HTTP header that is read by the end user as a different encoding)
The HTTP Content-Type header has UTF-8 specified as the encoding type

Following these specifications will easily facilitate website translation into various languages without having the need to decode and re-encode into other character encodings across the multichannel media that’s used on the web today.

Character Encoding and Website Localization

While character encoding is essential for website localization, it’s actually part of a process known as internationalization. Often shortened to i18n, internationalization enables applications to input, process, and output international text. For multilingual websites, it ensures web pages are successfully localized into the target languages.

In the 1990s, internationalization support meant that an application could input, store, and output data in different character sets and encodings. For example, an English-speaking user could converse with you in Latin-1 while a Russian-speaking user could do so in KOI8-R.

Yet this system presented a few problems, such as the inability to present data from two different user sets on the same page. Additionally, each piece of data needed to be tagged with the character set it was stored as. That meant the HTML and all text needed to be tagged properly.