What is Character encoding | Localazy Dictionary

Every character on screen, whether the letter “A”, the Arabic “ع”, or the emoji “🌍”, is stored as a number. Character encoding defines which number corresponds to which character, and how those numbers are stored as bytes. Without a shared encoding standard, text written in one system becomes unreadable in another.

For most of computing history, different regions used incompatible systems. ASCII handled English. ISO 8859-1 extended it for Western Europe. Shift JIS covered Japanese. None of them could represent multiple language families at once. Unicode was built to fix that, and UTF-8 is the encoding that implements it across the web today, accounting for around 99% of all web pages. UTF-16 remains common in certain programming environments like Java, C#, and Windows APIs, mostly for historical reasons.

Wrong encoding can break functionality, corrupt user data, and introduce silent failures in APIs, databases, or file transfers. This is why teams should standardize on UTF-8 wherever possible, explicitly declare encoding in HTML meta tags and HTTP headers, and use Unicode-aware database field types such as nvarchar or ntext.

🤔 Why it matters for localization #️⃣

Encoding is one of the first things to get right before localization can work. Choosing the wrong encoding, or mixing encodings in the same system, causes garbled text, missing characters, or silent data corruption that can be hard to trace. This is sometimes called mojibake, the scrambled output that appears when content is decoded with the wrong encoding.

The most common mistake is not declaring encoding explicitly, leaving systems to guess and often guess wrong. Files need encoding declarations in HTML meta tags, HTTP headers, or XML prolog statements. Databases need Unicode-aware field types. APIs need to specify encoding when reading or writing files. Reading a UTF-8 file as ISO-8859-1, for example, produces corruption that looks fine for English but breaks as soon as accented or non-Latin characters appear.

🔤 Key points about character encoding #️⃣

UTF-8 covers the full Unicode character set while remaining backward compatible with ASCII, making it the safe default for any new project.
UTF-16 is used internally by Java, C#, and Windows APIs. It requires surrogate pairs for characters above U+FFFF, including most emoji.
A single Unicode character can sometimes be represented in multiple ways (for example, “é” as one precomposed character or as “e” plus a combining accent). Normalize to NFC at API boundaries to avoid subtle bugs in string comparison or translation search.
Emoji use Unicode code points in the supplementary planes and can be composed of multiple code points joined with Zero Width Joiner characters, meaning what looks like one character may be several bytes long.
The byte order mark (BOM) is a special Unicode character sometimes added to the start of a UTF-8 file as a signature, particularly on Windows. The Unicode standard does not recommend it for UTF-8 and it can break JSON parsing, shell scripts, or CLI tools that do not expect it. Skip it unless you have a specific reason to include it.

⚠️ Common encoding issues #️⃣

Mojibake (garbled symbols like �) when file encoding does not match what the system expects
Missing characters appearing as boxes or question marks when fonts do not support a Unicode range
Database corruption from storing Unicode text in non-Unicode field types
UTF-8 BOM silently breaking JSON parsing or causing invisible characters in output
File transfer corruption when older tools or FTP clients use ASCII mode instead of binary or UTF-8-safe transfer settings for multilingual files

😉 How Localazy handles encoding #️⃣

Localazy uses UTF-8 across all supported file formats. Translated strings in any language, including right-to-left scripts and CJK characters, are handled without special configuration. This makes it easier to safely move files with Arabic, Cyrillic, CJK, emoji, and accented Latin characters between repositories, APIs, and production systems without manual encoding fixes.

Upload and download workflows manage encoding consistently so garbled output in production is not something you need to debug manually.

See all supported file formats in Localazy.

Character encoding

🤔 Why it matters for localization #️⃣

🔤 Key points about character encoding #️⃣

⚠️ Common encoding issues #️⃣

😉 How Localazy handles encoding #️⃣

Related terms

Related posts

⚡ Manage your translations with Localazy! 🌍