Archives: April 26, 2019

Why am I losing context matches when I move my TM from one tool to another using TMX?

Current TM systems often not only save the segment you translated, but also some context for the segment. The context is used to improve the matching. This is why you might see matches called CM (context match), 101% or 102%, ICE (in-context exact) or similar. They show that the segment you work on right now is not only the same as in the TM, but also the context is the same.

Unfortunately, the way this context information is saved to the TM and what exactly is saved as context is not standardized.

This means, tool A will not be able to read, interpret and use the context information from tool B. You will only receive a 100% match instead of a context match.

Here are examples how some tools save their context:

The sample text was translated into the TMs and the TMs were exported to TMX (Translation Memory Exchange format).

Sample text:

This is sentence one.

This is sentence two.

This is sentence three.

What is in the TMs:

Context for the second sentence in the TMX file from memoQ

The source segments before and after the actual segment are saved as context.

Context for the second sentence in the TMX file from SDL Trados Studio 2017

A hash code with information about the previous segment and the structure of the current segment (heading, footnote, content of a cell…) is saved as context.

Context for the second sentence in the TMX file from SDL Trados Studio 2019

A hash code plus explicit text of source and target segment before the actual segment is saved as context.

Also, the place where the context information is saved could be different (within the tuv area (translation unit variant = language) or before it. And the names of the attributes of the prop element are also different (x-Context versus x-context-pre and x-context-post)).

This shows why it is not possible to re-use context information between tools.

Why do match values differ?

After talking about the things that can produce different word counts, we should also look at what can be the reason for different match values.

Even with the same file and the same TM, the analysis results can differ, because the settings that influence the match values are usually project-based settings.

Let us take penalties first. A penalty can be applied to matches that come from a specific TM, that have metadata other than the one used in your current project or maybe even to segments with a certain user name or user role saved. This means instead of the “real” match value the segment would have, it shows up with a lower match value.

There are many reasons to apply a penalty:

  • The TM has been provided by a client and has not been created by yourself, so you cannot guarantee for its quality.
  • The material in the TM is old or comes from an alignment (most tools will apply a penalty for alignment segments automatically).
  • You have decided to start a new, fresh TM and use the existing TM as a reference in the background.
  • The content was saved to the TM by a certain person (maybe by an intern who did an alignment and was not very careful during aligning the segments) or with a certain role (you want to trust segments confirmed by a reviewer more than those confirmed by a translator).
  • The content was translated for a different subject matter area and this information was saved to the TM as well as metadata (the TM contains translations from marketing, but you now want to translate a contract).

Then, there are filter settings. Usually, applying a filter means to apply a penalty. But it could also be that certain segments do not appear at all, because the filter does not permit segments with different metadata, from a TM with a specific name or from a specific user.

Still another reason could be that the segmentation rules don’t contain all abbreviations. This will result in 2 segments in the document where there might be just one segment in the TM (maybe the translator joined the segments during translation, creating one segment in the TM but not updating the segmentation rules).

And another reason could be the use of different TM tools. As the way how match values are calculated differ from tool to tool, a 82% match in tool 1 can very well be a 80% match tool 2 and an 85% match in tool 3. The match values can differ quite a lot actually, depending on what is in the segments in the way of tags etc.

Here are some examples for differing match values:

Tool 1 shows 70%, tool 2 shows 89%.

The difference is one full word (short -> nice).

Tool 1 shows 95%, tool 2 shows 92%.

The differences are the number, the formatting and the capitalization and the spacing.

And to make it even more complex, the examples show that it is not necessarily the case that one tool always shows lower match values than the other 🙂