Characters that look the same but don't sort the same

Question

If you use the Latin alphabet with diacritics over characters, you may encounter a puzzling phenomenon in the Wordlist. You may have multiple entries of what appears to be the same word.
This happens because the Unicode standard has two ways of entering many letters with diacritics. For example, the é character can be the Unicode character 00E9, or it can be two characters, 0065 followed by 0301 (lower case e followed by combining diacritic acute). If this is the case, words containing 00E9 will not be identical with words containing 0065 + 0301, even if all the other letters are the same.

You can verify this by looking at the character inventory. This example is of a test project with two t characters, one *00E9 and one 0065 0301. You can see that it breaks down the difference between the é characters.

The Unicode standard does say that these sequences should be treated as identical by the software, but many programs have not managed to do that yet. Paratext does not, Microsoft Word does not either.

If you have this issue, you can fix it by doing a search and replace, for instance searching for one kind of é and replacing it with the other kind. To keep it from occurring again, translators working on the same project should either use the same keyboard file, or at least keyboard files that output the same character sequence for letters with diacritics.

Paratext Jun 5, 2015 asked by [Expert]

sewhite (3.1k points)
Jun 11, 2019 reshown

4 Answers

drwww · Answer 1 · 2015-06-08T07:31:37+0000

Dear drwww,

Go to Project/Language Settings and choose the Other Characters Tab. There
you will see a little tick box which says, “Non-standard diacritics follow
base character”. Tick this, and your problems should go away.

You should also make sure that the alphabetic characters tab is filled in
properly according to the “More Help” information provided on the side
slideout. You may have problems with Paratext accepting the capital form of
some of your combining diacritic characters. If you do, just put in the
lower case form(s) of the offending character(s) with a space between them.
The capitalization will be properly predicted.

Also, you should make sure to view the Characters Inventory with “Show
combinations” ticked. It won’t look right unless the tick box in the
language settings is chosen first, but it is important for making sure that
you are approving your characters properly, joined, rather than separate
from their diacritics.

Blessings,

Shegnada J.

Language Technology & Publishing Coordinator, Nigeria

GPS Text Processing Specialist

Wycliffe

+[Phone Removed]

Skype:Shegnada.James.

Jun 8, 2015 commented by Shegnada (1.3k points)

Fool Running · Answer 2 · 2015-06-15T14:19:07+0000

This is actually fixed in 7.6. Even if the project is not fully normalized in the way @John+Wickberg suggests, the Wordlist will still only show one entry for these two normalized cases.

wdavidhj · Answer 3 · 2019-06-08T03:50:22+0000

For additional experiences and other discussions, see (in chronological order):

What is Paratext's behaviour for composing/decomposing unicode characters/sequences?
Matches Not Found in Word List - Composed and Decomposed (whole conversation)
Guide: Checking > Characters Inventory (second post)

Adding keywords for search:
normalisation normalization

Characters that look the same but don't sort the same

Please log in or register to answer this question.

4 Answers

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Related questions

Categories