CAT Tools and NMT

Quality checks need to become a living system

Recently I have been asked to evaluate the output of an NMT system as to whether translating sentence-based or paragraph-based would provide better results.

The small number of sample texts I had suggested that paragraph-based translation could be slightly better, because of the larger context. But the inconsistencies in terminology even with larger segments showed that the client would need to invest some time and effort into setting up a good terminology database to check for their specific word usage (they were not using a system trained on their own material, but a general system).

In addition, there were some situations where we didn’t have automated checks in the CAT tool yet, but would have to create some.

One example was that the source segment contained a number. The number was translated correctly, but suddenly the currency EURO was appended to the number. In the source text there was no currency mentioned (and as it was a Swiss text, it probably would have had to be CHF not EURO).

But still, when you know what kind of mistakes can happen, you can come up with checking routines (most probably with regular expressions) for that.

But then, sometime afterwards I attended a session on neural machine translation (Thanks to Moni Höge, who did a great job explaining the workings of those systems). And one of the things she said made me a bit uneasy. What she said was that when you train an existing NMT system with new material, the type of mistakes the system makes can change.

That basically means that we will have to check NMT output again and again for new types of mistakes and create new types of checks to catch these mistakes. The QA check will have to become a living system that needs to adapt to the current output of the NMT system.

This could mean that the time spent on finding out what new mistakes the machine is making and defining them for Quality checking in TM tools takes up some of the time that we want to save by using machine translation.

Quality checking will then need to become a living system and adapt to the NMT output continuously.

What I find intriguing (and also a bit scary) about NMT is the unpredictability of the outcome as we don’t know exactly what is happening inside that NTM black box. 🙂


Details matter – Review process outside a translation tool

Even when two tools have the same kind of feature they not necessarily work the same way.

A translation tool is used by the translator, but not necessarily by the person who reviews the translation. These might be subject matter experts who don’t have access to a translation tool of their own.

Because of this, translation tools may provide a format that can be handled outside of a translation tool. This could be a browser-based view of the source and target language segments, but it could also be a “simple” Word-type document.

But beware, these documents don’t always behave the same way even though they are created for the same purpose.

Let us compare the review document from SDL Trados Studio and the bilingual RTF table from memoQ.

They look pretty much the same, but here are some differences:

1. The file format

  • SDL Trados Studio produces a DOCX file (which needs to be handled within Microsoft Office 2007 or later. Anything else could destroy the XML header of the file, thus preventing you from importing the reviewed file back again). The file name must not be changed and the file format needs to stay DOCX.
  • memoQ produces an RTF file (which can be handled in any text editor that can open and save RTF). The file name can be changed, but the file format needs to be RTF for back import.

2. The structure of the file

  • SDL Trados Studio shows a table with segment number, status, source segment and target segment. Comments use the commenting feature in Word.
  • memoQ shows a table with segment number, source segment, target segment, comment and status.

3. The process

  • In SDL Trados Studio, the file is a REVIEW file, i.e. only if the target language column contains text, any changed text can be imported back onto the project in Studio. If you want to use it for translation, the source text needs to be copied to the target column to be overwritten with the translation.
  • In memoQ, the file can be used for REVIEW or TRANSLATION. The target column can be filled or empty. Either way, what is entered in the target column will be imported back into the memoQ project.

And these are just the most important differences between these two formats.


Historical knowledge – what is tw4win?

If you’ve never worked with Trados Workbench (Trados up to version 2007), you might wonder why tools like SDL Trados Studio or memoQ mention the styles tw4winExternal and tw4winInternal in the filter settings for importing Word files.

tw = Translator’s Workbench, 4 = for and win = Windows (yes, there were also DOS versions of the first tools)

The Trados versions up to Trados 2007 were using Word (and later also TagEditor) for translation. This meant that everything you wanted to translate needed to be in a Word format like DOC or RTF.

To be able to show both, the source and the target language within one document, the tool would separate these with characters in purple that had a special style (tw4winMark).

Anything that should not be touched during translation could be marked up with a style named tw4winExternal (Word paragraph style), while tw4winInternal (Word character style) would be used to mark up elements in the text that should be treated as tags (comparable to the inline tags we use nowadays).

So what does that mean for us today? Well, you could still use a style called tw4winExternal to hide text from a Word file from translation in our current translation tools.

Here are the setting in SDL Trados Studio and memoQ.

These styles were also used in Wordfast Classic and the bilingual Word format from Translator’s Workbench was de facto one of the first “standard” formats to exchange bilingual translation files between users and tools.

—–

And although it does not happen very often these days, but it did happen to me just recently that I received a Word file which probably has been re-used and edited for more than 10 years. SDL Trados Studio refused to import the file and gave a message that this file had the styles of the bilingual Trados format, but did not seem to be set up correctly.

This was because the styles list still contained some tw4win styles from previous processing with Translator’s Workbench even though they were not used in the document any longer. By deleting those styles, we were able to open the file in our translation tool again.


Why am I losing context matches when I move my TM from one tool to another using TMX?

Current TM systems often not only save the segment you translated, but also some context for the segment. The context is used to improve the matching. This is why you might see matches called CM (context match), 101% or 102%, ICE (in-context exact) or similar. They show that the segment you work on right now is not only the same as in the TM, but also the context is the same.

Unfortunately, the way this context information is saved to the TM and what exactly is saved as context is not standardized.

This means, tool A will not be able to read, interpret and use the context information from tool B. You will only receive a 100% match instead of a context match.

Here are examples how some tools save their context:

The sample text was translated into the TMs and the TMs were exported to TMX (Translation Memory Exchange format).

Sample text:

This is sentence one.

This is sentence two.

This is sentence three.

What is in the TMs:

Context for the second sentence in the TMX file from memoQ

The source segments before and after the actual segment are saved as context.

Context for the second sentence in the TMX file from SDL Trados Studio 2017

A hash code with information about the previous segment and the structure of the current segment (heading, footnote, content of a cell…) is saved as context.

Context for the second sentence in the TMX file from SDL Trados Studio 2019

A hash code plus explicit text of source and target segment before the actual segment is saved as context.

Also, the place where the context information is saved could be different (within the tuv area (translation unit variant = language) or before it. And the names of the attributes of the prop element are also different (x-Context versus x-context-pre and x-context-post)).

This shows why it is not possible to re-use context information between tools.


Why do match values differ?

After talking about the things that can produce different word counts, we should also look at what can be the reason for different match values.

Even with the same file and the same TM, the analysis results can differ, because the settings that influence the match values are usually project-based settings.

Let us take penalties first. A penalty can be applied to matches that come from a specific TM, that have metadata other than the one used in your current project or maybe even to segments with a certain user name or user role saved. This means instead of the “real” match value the segment would have, it shows up with a lower match value.

There are many reasons to apply a penalty:

  • The TM has been provided by a client and has not been created by yourself, so you cannot guarantee for its quality.
  • The material in the TM is old or comes from an alignment (most tools will apply a penalty for alignment segments automatically).
  • You have decided to start a new, fresh TM and use the existing TM as a reference in the background.
  • The content was saved to the TM by a certain person (maybe by an intern who did an alignment and was not very careful during aligning the segments) or with a certain role (you want to trust segments confirmed by a reviewer more than those confirmed by a translator).
  • The content was translated for a different subject matter area and this information was saved to the TM as well as metadata (the TM contains translations from marketing, but you now want to translate a contract).

Then, there are filter settings. Usually, applying a filter means to apply a penalty. But it could also be that certain segments do not appear at all, because the filter does not permit segments with different metadata, from a TM with a specific name or from a specific user.

Still another reason could be that the segmentation rules don’t contain all abbreviations. This will result in 2 segments in the document where there might be just one segment in the TM (maybe the translator joined the segments during translation, creating one segment in the TM but not updating the segmentation rules).

And another reason could be the use of different TM tools. As the way how match values are calculated differ from tool to tool, a 82% match in tool 1 can very well be a 80% match tool 2 and an 85% match in tool 3. The match values can differ quite a lot actually, depending on what is in the segments in the way of tags etc.

Here are some examples for differing match values:

Tool 1 shows 70%, tool 2 shows 89%.

The difference is one full word (short -> nice).

Tool 1 shows 95%, tool 2 shows 92%.

The differences are the number, the formatting and the capitalization and the spacing.

And to make it even more complex, the examples show that it is not necessarily the case that one tool always shows lower match values than the other 🙂


Word Count Differences (2)

How can word counts differ within the same tool on different machines?

Have you ever run a word count with the same document on two different machines and received different word counts?

Well, here is what can have an impact on the word count statistics:

  • The use of a TM on one machine and no TM on the other machine can produce different word counts. A project with no TM will use default settings for counting, which might have been adjusted in the TM you actually use. For example, the setting to count words with hyphens as one or two words.

Example: The same file in the same project gets analyzed without a TM and with a TM where the default settings had been adjusted (and here even the number of segments and characters changes).

without TM:      

 with TM:            

  • The filters you use to import the file have different settings. If a filter includes or excludes hidden text, hidden layers, comments, hidden rows or columns, embedded objects etc. this can have a big impact on the number of words that are counted. I remember one time when a Word document that had visibly only a few words, produced a very large word count because of extracting the content of an embedded Excel file on one machine, but not on the other.

Example: The same file (just with different names) was imported with the default XML filter and with a filter that also imports the content of an attribute for translation.

  • The use of different versions of the software . Believe it or not, the tools providers do tweak the way words are counted now and then. At one point there was a Trados version where a number-measurement combination was counted as two words in one version, but counted as one word in the next version. It took some time to figure that one out, believe me, as it was unfortunately not mentioned in the release notes.
  • The analysis settings you use. The analysis might have an option to ignore locked segments, if that is switched off on one machine, but switched on on the other, the word counts will differ as well (provided of course there are locked segments in the files, for example in XLIFF files from another tool, or if you run an analysis after file preparation).


Word Count Differences

How can it be that the word count for the same file differs from (translation) tool to (translation) tool?

The way a translation tool counts words can differ from any other translation tool as well as the word count you can do in Word. The reason is the way words and word boundaries are defined in the tools. Some specify that a word with a hyphen (like “tool-related”) should be counted as one word, others see it as two words. The same is true for other delimiting characters, like slashes (/) or apostrophes (‘). It can even happen that a character like a slash, if it is surrounded by spaces (like in “in / out) could be counted as a word on its own in one tool, but not at all in another.

Some tools recognize combinations of letters and numbers (alphanumeric items) as one word, but only as long as there is no slash or hyphen that separates numbers from letters (ABC123 = 1 word, but ABC-123 = 2 words).

Depending on the types of elements your file contains, the difference can be quite extensive. A recent example from a file preparation showed elements like these:

/content/legal/privacy?cid=cookieprivacy#cookies-policy

One tool counted that whole expression as 1 word, the other counted 4 words, using the slashes and the equal symbol as word delimiters. Imagine the word count difference if there are 1000 items like this one in the file.

Of course it is debatable what of the above expression needs to be translated, if at all. That would be a nice exercise for the use of regular expressions, either to tag the whole thing or to extract the translatable part. 🙂

And although some tools let you influence the way they count by providing checkboxes to specify words with hyphens as one or two words, it is almost impossible to achieve the exact same word count with any two tools when your documents contain delimiting characters like slashes or equal symbols.

And here is the real-life comparison over 39 files:

Analysis tool A

Analysis tool B

Note that the segment count is quite close but the word count is very different.


REGEX – the hidden language for our translation tools

Our tools offer a lot of functionality, but in many places a knowledge of some simple regex (regular expressions) can enhance these functionalities a lot.

  • You can create your own filters, i.e. determine what parts of a (text-based) document get imported for translation.
  • You can convert text elements into tags, which is especially useful for placeholders like these {1}, ##NAME## or %sd.
  • You can use regex to search for a pattern, like a web address, a date, a combination of number and measurement…
  • You can use it to run a replace action (changing date formats or the sequence of elements, like 25% -.> % 25).
  • You can use regex in the QA checkers to find specific things, like numbers and measurement units that are not separated by a non-breaking space.
  • You can use regex when specifying segmentation rules and segmentation exceptions.

We all know that a good preparation at the beginning of a project can save a lot of repair work (in all the target languages) and regex is definitely a good thing to include into your preparation considerations.

 

I often get asked where there is material to learn how to do regex. Well, there are a lot of very good tutorials on the internet (just use the search words “regex” and “tutorial”). But none of them focuses on the needs of the translation industry (hence my course on Regex for Translation on L10Ntrain).

 

From my experience, the regex you need to know starts with these few expressions:

  • Brackets and what they do: ( ) for grouping, [ ] for character ranges/lists and { , } for minimum and maximum numbers of characters.
  • Characters that have their own meaning in regex and need the backslash (escape character) before them, when you need to search for the actual character:
    • Dot (.), plus (+), asterisk (*) Dollar ($), circumflex (^), backslash (\)
  • Searching for spaces in general: \s
  • Searching for digits: \d or [0-9]
  • Searching for letters: [a-z], [A-Z], \p{Lu}…

 

Of course, there are many more and you can do wonderful things with regex, but these few can get you started quite quickly.

For an overview and more examples, check out the introductory course on Regular Expressions in Translation 🙂


How would you calculate a repetition in the proposal/invoice?

This question comes up right after explaining what a repetition really is and it is not easily answered.

Technically, once the first occurrence of a repeated segment has been translated and saved to the TM, all other occurrences of this segment will appear as 100% matches to the translator. So, they could be invoiced the same way as 100% matches.

BUT, depending on the type of text you have to translate or the language pair you deal with, this might not be the case. For example, a catalog that is translated into German might deal with “gearboxes”, where in German the singular and plural of this word are identical (Getriebe). This means not all occurrences of this segment (maybe the headings in a table) have to be translated in the same way. Or, taking German as a target language again, one and the same sentence in English can be translated in 3 different ways, depending on the gender of the object you are talking about. Example: Connect the one with the other. (Admittedly, that is not good style in English, but it happens 🙂 ). This sentence could have 3 different translations in German, depending on the gender of the thing you are talking about.

 

This also brings up the question whether a 100% match can be left unchecked (as many clients seem to think). It is a 100% match after all, so it was there before, it is in the TM, it has been translated and paid for already… The thing is, a 100% match only tells you that the SOURCE segment has appeared before. But it does not mean that the translation is a complete/correct or fits the context.  That is why translation vendors usually will tell you that even 100% matches should be checked for correctness in context.


Localization Tools – What is a Repetition?

Translation tools are easy enough to get started with, but there are many settings and features that are not so self-explanatory. One of these seems to be the definition for a repetition when doing an analysis of translation documents. When I ask what a repetition is, I get answers ranging from “when words are repeated” to “all sentences that are similar” to “segments that repeat” – where only the last one is partially true.

Here is the definition of a repetition from the tools I have dealt with so far: A repetition is a segment that comes up repeatedly (either inside a document or between documents), which DOES NOT have a 100% match from the TM. This last part is important. If it had a 100% match, it would be counted as such. So if a segment appears 5 times in exactly the same way AND has a 100% match from the TM, it would be counted as 5 100% matches (usually, unless there is a setting to change this type of counting, but that is going too far here 🙂 ).

If it does not have a 100% match, it is counted as a “no match” or “fuzzy match” the first time around. The second time is counted as the first repetition and so forth.