|3 Min. Read||Todd Michalik||November 18, 2020|
When online digital content is translated from one language to another, an unfortunate—and common—side effect can occur when this translated content is transported to a different medium.
Simple sentences that contain accented letters or special formatting can appear malformed when copied from one file to another. Specific characters and punctuation elements usually render as a series of question marks or random non- standard characters.
Why is this happening?
Character encoding tells computers how to interpret digital data into letters, numbers and symbols. This is done by assigning a specific numeric value to a letter, number or symbol. There are a number of character encoding sets in use today, but the most common formats in use on the World Wide Web are ASCII, UTF-8 and Unicode.
In order to properly render translated digital content, the correct character set (aka character encoding) must be used.
Here’s some history on character sets, followed by some tips on how to properly leverage them for your website translation projects.
In 1963, the ASCII (American Standard Code for Information Interchange) character encoding scheme was established as a common code used to represent English characters, with each letter assigned a numeric value from 0 to 127.
Most modern character encoding subsets are based on the ASCII character encoding scheme, and support several additional characters.
When the Windows operating system emerged, a new standard was quickly adopted known as the ANSI character set. The phrase “ANSI” was also known as the Windows code pages (Code Page 1252), even though it had nothing to do with the American National Standards Institute.
Windows-1252 or CP-1252 (code page 1252) character encoding became popular with the advent of Microsoft Windows, but was eventually superseded when Unicode was implemented within Windows.
The ISO-8859-1 character encoding set features all the characters of Windows-1252, including an extended subset of punctuation and business symbols. This standard was easily transportable across multiple word processors, and even newly released versions of HTML 4.
ISO-8859-1 was a direct extension of the ASCII character set. While support was extensive for its time, the format was still limited.
After the debut of ISO-8859-1, the Unicode Consortium regrouped to develop more universal standards for transportable character encoding.
UTF-8 is now the most widely used character encoding format on the web. UTF-8 was declared mandatory for website content by the Web Hypertext Application Technology Working Group, a community of people interested in evolving the HTML standard and related technologies.
UTF-8 was designed for full backward compatibility with ASCII.
So it’s clear that each character set uses a unique table of identification codes to present a specific character to a user. If you were using the ISO-8859-1 character set to edit a document and then saved that document as a UTF-8 encoded document without declaring that the content was UTF-8, special characters and business symbols will render unreadable.
Most modern web browsers support legacy character encodings, so a website can contain pages encoded in ISO-8859-1, or Windows-1252, or any other type of encoding. The browser should properly render those characters based on the character encoding format not being reported by the server.
However, if the character set is not correctly declared at the time the page is rendered, the web server’s default is usually to fall back without any specific character encoding format (usually ASCII).
This forces your browser or mobile device to determine the page’s proper type of character encoding. Based on the WHATWG specifications adopted by W3C, the most typical default fallback is UTF-8. However, some browsers will fall back to ASCII.
To ensure your users are always seeing the correct content on your HTML production pages, be sure:
Following these specifications will easily facilitate website translation into various languages without having the need to decode and re-encode into other character encodings across the multichannel media that’s used on the web today.