+1 vote

If you use the Latin alphabet with diacritics over characters, you may encounter a puzzling phenomenon in the Wordlist. You may have multiple entries of what appears to be the same word.
This happens because the Unicode standard has two ways of entering many letters with diacritics. For example, the é character can be the Unicode character 00E9, or it can be two characters, 0065 followed by 0301 (lower case e followed by combining diacritic acute). If this is the case, words containing 00E9 will not be identical with words containing 0065 + 0301, even if all the other letters are the same.

You can verify this by looking at the character inventory. This example is of a test project with two t characters, one *00E9 and one 0065 0301. You can see that it breaks down the difference between the é characters.

image

The Unicode standard does say that these sequences should be treated as identical by the software, but many programs have not managed to do that yet. Paratext does not, Microsoft Word does not either.

If you have this issue, you can fix it by doing a search and replace, for instance searching for one kind of é and replacing it with the other kind. To keep it from occurring again, translators working on the same project should either use the same keyboard file, or at least keyboard files that output the same character sequence for letters with diacritics.

Paratext by [Expert]
(3.1k points)

reshown

4 Answers

0 votes
Best answer

In Paratext 7.6 (still in development), new projects will be default use
"Normal Form Composed" when saving text data - this will prevent these
differences in text that are usually caused by using different keyboards.
The “Normal Form Decomposed” can be used instead - this may be required to
make text display correctly in some fonts.

There is also a Convert Project tool in Paratext 7.6 that can be used to
change all text in a project to either NFC or NFD. The Convert Project tool
does create a new project in the process, but project history is preserved.

by [Administrator]
(3.1k points)
0 votes

The problem we’re into is that accent marks are being classified on their own or being grouped with the character that follows rather than with the character they modify.

image

What should we be doing differently?

by (412 points)

Dear drwww,

Go to Project/Language Settings and choose the Other Characters Tab. There
you will see a little tick box which says, “Non-standard diacritics follow
base character”. Tick this, and your problems should go away.

You should also make sure that the alphabetic characters tab is filled in
properly according to the “More Help” information provided on the side
slideout. You may have problems with Paratext accepting the capital form of
some of your combining diacritic characters. If you do, just put in the
lower case form(s) of the offending character(s) with a space between them.
The capitalization will be properly predicted.

Also, you should make sure to view the Characters Inventory with “Show
combinations” ticked. It won’t look right unless the tick box in the
language settings is chosen first, but it is important for making sure that
you are approving your characters properly, joined, rather than separate
from their diacritics.

Blessings,

Shegnada J.

Language Technology & Publishing Coordinator, Nigeria

GPS Text Processing Specialist

Wycliffe

+[Phone Removed]

Skype:Shegnada.James.

0 votes

This is actually fixed in 7.6. Even if the project is not fully normalized in the way @John+Wickberg suggests, the Wordlist will still only show one entry for these two normalized cases.

by [Expert]
(16.2k points)
0 votes

For additional experiences and other discussions, see (in chronological order):

  • What is Paratext's behaviour for composing/decomposing unicode characters/sequences?
  • Matches Not Found in Word List - Composed and Decomposed (whole conversation)
  • Guide: Checking > Characters Inventory (second post)

Adding keywords for search:
normalisation normalization

by (1.4k points)
reshown

Related questions

Welcome to Support Bible, where you can ask questions and receive answers from other members of the community.
Give proper recognition to those widows who are really in need.
1 Timothy 5:3
2,645 questions
5,394 answers
5,065 comments
1,437 users