Archives: March 18, 2019

Word Count Differences (2)

How can word counts differ within the same tool on different machines?

Have you ever run a word count with the same document on two different machines and received different word counts?

Well, here is what can have an impact on the word count statistics:

  • The use of a TM on one machine and no TM on the other machine can produce different word counts. A project with no TM will use default settings for counting, which might have been adjusted in the TM you actually use. For example, the setting to count words with hyphens as one or two words.

Example: The same file in the same project gets analyzed without a TM and with a TM where the default settings had been adjusted (and here even the number of segments and characters changes).

without TM:      

 with TM:            

  • The filters you use to import the file have different settings. If a filter includes or excludes hidden text, hidden layers, comments, hidden rows or columns, embedded objects etc. this can have a big impact on the number of words that are counted. I remember one time when a Word document that had visibly only a few words, produced a very large word count because of extracting the content of an embedded Excel file on one machine, but not on the other.

Example: The same file (just with different names) was imported with the default XML filter and with a filter that also imports the content of an attribute for translation.

  • The use of different versions of the software . Believe it or not, the tools providers do tweak the way words are counted now and then. At one point there was a Trados version where a number-measurement combination was counted as two words in one version, but counted as one word in the next version. It took some time to figure that one out, believe me, as it was unfortunately not mentioned in the release notes.
  • The analysis settings you use. The analysis might have an option to ignore locked segments, if that is switched off on one machine, but switched on on the other, the word counts will differ as well (provided of course there are locked segments in the files, for example in XLIFF files from another tool, or if you run an analysis after file preparation).


Word Count Differences

How can it be that the word count for the same file differs from (translation) tool to (translation) tool?

The way a translation tool counts words can differ from any other translation tool as well as the word count you can do in Word. The reason is the way words and word boundaries are defined in the tools. Some specify that a word with a hyphen (like “tool-related”) should be counted as one word, others see it as two words. The same is true for other delimiting characters, like slashes (/) or apostrophes (‘). It can even happen that a character like a slash, if it is surrounded by spaces (like in “in / out) could be counted as a word on its own in one tool, but not at all in another.

Some tools recognize combinations of letters and numbers (alphanumeric items) as one word, but only as long as there is no slash or hyphen that separates numbers from letters (ABC123 = 1 word, but ABC-123 = 2 words).

Depending on the types of elements your file contains, the difference can be quite extensive. A recent example from a file preparation showed elements like these:

/content/legal/privacy?cid=cookieprivacy#cookies-policy

One tool counted that whole expression as 1 word, the other counted 4 words, using the slashes and the equal symbol as word delimiters. Imagine the word count difference if there are 1000 items like this one in the file.

Of course it is debatable what of the above expression needs to be translated, if at all. That would be a nice exercise for the use of regular expressions, either to tag the whole thing or to extract the translatable part. 🙂

And although some tools let you influence the way they count by providing checkboxes to specify words with hyphens as one or two words, it is almost impossible to achieve the exact same word count with any two tools when your documents contain delimiting characters like slashes or equal symbols.

And here is the real-life comparison over 39 files:

Analysis tool A

Analysis tool B

Note that the segment count is quite close but the word count is very different.