Sentence Splitting/Segmentation 

The process of dividing source content into smaller units, called segments, that a CAT tool handles individually for translation, review, and translation memory matching.

When you import a file into a CAT tool or translation management system, the tool does not present the entire document to the translator at once. Instead, it splits the text into segments (typically one sentence per segment), and displays them one by one in a side-by-side editor. The translator works through each segment individually, with the source on one side and the target on the other.

For example, a help article with five paragraphs might be split into 30 segments. Each segment gets translated, reviewed, and stored separately. If the same sentence appears again in a future project, the tool recognizes it and suggests the existing translation automatically.

⁉️ How segmentation rules work #️⃣

The tool uses segmentation rules to decide where one segment ends and the next begins. The most common triggers are punctuation marks that signal the end of a sentence: full stops, question marks, and exclamation marks. Line breaks and paragraph breaks also create boundaries.

The tricky part is that punctuation does not always mean the same thing. A period can end a sentence, but it also appears in abbreviations like “Dr.” or “St.”, in decimal numbers like “3.14”, and in file names like “config.json”. A basic segmentation rule that splits on every period would incorrectly break “Dr. Smith lives in St. Paul” into three separate segments. Proper rules account for these exceptions.

Segmentation rules are stored in a standard file format called SRX (Segmentation Rules eXchange), which most CAT tools and TMS platforms support. Default rules handle the majority of content well, but projects with unusual content (legal contracts, software strings, medical documentation), sometimes need custom rules adjusted for that domain.

🔀 Key points about segmentation #️⃣

  • Segmentation runs automatically when source files are imported, before any translation work begins.
  • In most document translation, one segment equals one sentence. In software localization, a segment is often an entire string, even if it contains multiple sentences.
  • Poor segmentation reduces translation memory leverage. If a line break splits a sentence that was previously translated as a whole unit, the tool will not recognize it as a match (even if the words are identical).
  • Segmentation rules apply to the source language. They should reflect how that language uses punctuation, not the target language.
  • Changing segmentation rules mid-project can break existing TM matches and create inconsistencies across the translated content.

🏧 The impact on costs and translation consistency #️⃣

Every segment that gets translated is saved to the translation memory. The next time the same or a similar segment appears (in a new version of the same product, or in a different file from the same project) the tool surfaces the stored translation. Translators can accept it as-is or edit it, but they do not start from scratch.

This is why clean source content matters. A developer who uses an extra paragraph break to create visual spacing in a string file, or a writer who puts two unrelated sentences in the same bullet point, can unintentionally split segments in ways that prevent TM reuse. Over a large project, those missed matches add up to real translation cost.

🤖 Segmentation in software localization #️⃣

Software strings behave differently from document sentences. A string like "Your account has been created. You can now log in." might look like two sentences, but in a resource file it is a single key. CAT tools and TMS platforms typically treat the whole string as one segment to keep it intact and give translators the full context. Splitting it would risk translating each sentence without knowing what comes before or after.

Curious about software localization beyond the terminology?

⚡ Manage your translations with Localazy! 🌍