Character encoding

A system that assigns unique binary values to characters so computers can store, process, and display text.

Every character on screen, whether the letter “A”, the Arabic “ع”, or the emoji “🌍”, is stored as a number. Character encoding defines which number corresponds to which character, and how those numbers are stored as bytes. Without a shared encoding standard, text written in one system becomes unreadable in another.

For most of computing history, different regions used incompatible systems. ASCII handled English. ISO 8859-1 extended it for Western Europe. Shift JIS covered Japanese. None of them could represent multiple language families at once. Unicode was built to fix that, and UTF-8 is the encoding that implements it across the web today, accounting for around 99% of all web pages. UTF-16 remains common in certain programming environments like Java, C#, and Windows APIs, mostly for historical reasons.

🤔 Why it matters for localization #️⃣

Encoding is one of the first things to get right before localization can work. Choosing the wrong encoding, or mixing encodings in the same system, causes garbled text, missing characters, or silent data corruption that can be hard to trace. This is sometimes called mojibake, the scrambled output that appears when content is decoded with the wrong encoding.

The most common mistake is not declaring encoding explicitly, leaving systems to guess and often guess wrong. Files need encoding declarations in HTML meta tags, HTTP headers, or XML prolog statements. Databases need Unicode-aware field types. APIs need to specify encoding when reading or writing files. Reading a UTF-8 file as ISO-8859-1, for example, produces corruption that looks fine for English but breaks as soon as accented or non-Latin characters appear.

🔤 Key points about character encoding #️⃣

  • UTF-8 covers the full Unicode character set while remaining backward compatible with ASCII, making it the safe default for any new project.
  • UTF-16 is used internally by Java, C#, and Windows APIs. It requires surrogate pairs for characters above U+FFFF, including most emoji.
  • A single Unicode character can sometimes be represented in multiple ways (for example, “é” as one precomposed character or as “e” plus a combining accent). Normalize to NFC at API boundaries to avoid subtle bugs in string comparison or translation search.
  • Emoji use Unicode code points in the supplementary planes and can be composed of multiple code points joined with Zero Width Joiner characters, meaning what looks like one character may be several bytes long.
  • The byte order mark (BOM) is a special Unicode character sometimes added to the start of a UTF-8 file as a signature, particularly on Windows. The Unicode standard does not recommend it for UTF-8 and it can break JSON parsing, shell scripts, or CLI tools that do not expect it. Skip it unless you have a specific reason to include it.

⚠️ Common encoding issues #️⃣

  • Mojibake (garbled symbols like ) when file encoding does not match what the system expects
  • Missing characters appearing as boxes or question marks when fonts do not support a Unicode range
  • Database corruption from storing Unicode text in non-Unicode field types
  • UTF-8 BOM silently breaking JSON parsing or causing invisible characters in output
  • File transfer corruption when FTP or older tools use ASCII mode on UTF-8 files

😉 How Localazy handles encoding #️⃣

Localazy uses UTF-8 across all supported file formats. Translated strings in any language, including right-to-left scripts and CJK characters, are handled without special configuration. Upload and download workflows manage encoding consistently so garbled output in production is not something you need to debug manually.

See all supported file formats in Localazy.

Curious about software localization beyond the terminology?

⚡ Manage your translations with Localazy! 🌍