Data Category Repository (DCR)

A centralized database of standardized definitions for data categories used to describe, build, and exchange language resources.

A data category is a unit of information used to annotate linguistic data, fields like “part of speech”, “gender”, “definition”, or “usage note”. Without a shared reference, the same field means different things across different tools. A Data Category Repository solves this by storing vetted, publicly accessible specifications that tools and organizations point to consistently when exchanging terminology data.

Each data category in a DCR gets a unique persistent identifier (PID), so tools confirm they are referencing the exact same definition regardless of what they call it internally. The primary public DCR for language resources is DatCatInfo, which replaced the earlier ISOcat registry and serves as the reference repository for data categories used in ISO/TC 37 standards, the ISO technical committee responsible for terminology and language resources.

🔗 Why the DCR matters for localization #️⃣

If a CAT tool, a terminology database, and a translation management system all follow DCR standards, they can share data seamlessly without custom integrations or data loss. Without standardized data categories, each tool uses its own proprietary structure, forcing teams to write custom converters or manually reformat data when moving between platforms.

This is especially relevant for TBX, the ISO standard for terminology exchange, which is built on data category selections drawn from a DCR. A term base exported from one CAT tool can land correctly in another with field mappings intact because both tools are pointing to the same DCR definitions.

🏗️ How a DCR is structured #️⃣

A DCR is not a flat list. Each entry is a data category specification that includes a canonical name, a definition, permitted values, and a PID. Tools reference the PID rather than the name to avoid ambiguity across systems.

Subsets of DCR entries can be grouped into data category selections. These selections, combined with a data model, define an application-specific language resource. In TBX, a selection of terminology-related data categories forms a module, and modules are combined to define a TBX dialect.

📝 Common data category examples #️⃣

  • Terminology metadata: term status, definition, context, usage notes, subject field
  • Translation memory fields: source segment, target segment, creation date, last modified date, translator ID
  • Lexical resource categories: part of speech, grammatical gender, register, domain
  • Linguistic annotation: token, lemma, morphological features, syntactic role

📌 Key points about the DCR #️⃣

  • A DCR defines what data fields mean. It does not store terminology data itself.
  • Each entry has a PID that tools reference to confirm they are using the same definition.
  • TBX dialects are defined by modules, each of which bundles a set of data category entries drawn from a DCR.
  • Scope covers terminology management, corpus annotation, NLP resources, and lexicography.
  • The DCR was officially launched in 2008 under the name ISOcat and later transitioned to DatCatInfo.
  • The DCR is no longer normatively owned by ISO, it is now maintained as a community-driven reference.

🤔 DCR vs. glossary #️⃣

A glossary stores the actual terms and translations a team uses in a project. A DCR defines the metadata structure used to describe those terms: what fields exist, what they mean, and what values they accept. A glossary is content; a DCR is the schema behind the tools that manage that content.

📐 The standard behind the DCR #️⃣

The DCR concept is governed by ISO 12620, which specifies how data category repositories should be created and maintained. ISO 12620 was split into two parts in 2022: Part 1 covers data category specifications, Part 2 covers repository requirements. Earlier editions such as ISO 12620:2009 and ISO 12620:2019 have been withdrawn.

Note: Always check the official ISO catalogue for the current status of ISO 12620.

Curious about software localization beyond the terminology?

⚡ Manage your translations with Localazy! 🌍