0 votes

I’m noticing what seems to be a binary difference between some term ids that appear in Major Biblical Terms versus those used in project rendering files. The matching term IDs all look the same visually if you open the xml in an editor. But I’m prototyping a plug in and my code noticed the difference. Perhaps it is something like composed vs uncomposed, but I don’t know Hebrew at all. I’m posting links to two versions of the Term ID for Shem. One is from Major Biblical Terms. The other is from TermRenderings.xml on one of my projects. Both IDs look the same, but a binary file viewer shows them to be different.
Thanks for any suggestions.
stevepence

Shem from a Renderings file
Shem from Major Terms

Paratext by (127 points)

4 Answers

0 votes
Best answer

This is probably more detail than you want to have, but here is my take on the Hebrew normalization issues. Unicode came up with a canonical diacritic order which was not ideal linguistically. Since diacritic order between non-interacting diacritics can be arbitrary, they decided to not change the order when asked. But there are some situations where order is important for interacting diacritics in Hebrew and that order is lost by normalization. To get around this, people use the CGJ (U+034F) to allow for a different diacritic ordering. This should be considered part of the spelling of the word.

While all this was being thrashed out, fonts were often not designed to handle the Unicode canonical order. But that has long been fixed and fonts can handle Unicode canonically ordered text just fine.

All this to say that I see no reason why data shouldn’t be stored in either Unicode normal form (NFC, NFD which I think are identical in Hebrew), but that if you have some legacy data (old Unicode data) then care should be taken in the normalization. In particular check the word for Jerusalem (IIRC from my foggy memory about words that contain diacritic ordering issues) which should contain a CGJ (or looked at carefully with a visual representation in front of you).

As to Lorna’s example, there should be no difficulty with those two diacritics since they are non-interacting (one above one below).

by (366 points)
0 votes

To find out what characters are in a string, you could use UniView 14 . Paste your text into the box where it says “text area”, then click on the downward pointing arrow just below the text area box and you’ll see a list of the characters (character shape, Unicode value and description).

If your suspicion is correct and the difference is due to normalization form, you may want to include a normalization step in your plugin.

by (296 points)
0 votes

ShemFrompmcdblRendering.txt encoding string: U+05E9 U+05B5 U+05C1 U+05DD
ShemFromMajorTerms.txt encoding string: U+05E9 U+05C1 U+05B5 U+05DD

If we look at the Unicode properties I believe the encoding from ShemFromMajorTerms.txt has it wrong and the Major Terms document may need normalizing.
Caveat: I know Hebrew has some Unicode normalization issues and I’m not up to speed on where normalization should not be followed.

by (329 points)

stevepence,
I suspect as more organizations write their own plugins; others may run into your problem. It just did not impact a user of Paratext, but apparently does impact you if you are writing plugins. If the Major Biblical terms list is not using best practices for Hebrew encoding, then maybe you could discuss your needs with anon291708.

0 votes

Thanks to all who replied. Everyone’s response was extremely helpful in understanding this issue. I am very grateful!

I would not presume to comment on which file has “correct” encodings - certainly a complex topic. My very limited concern is being able to use the ids unambiguously as keys to data sets. Obviously the Biblical Terms tool is able to correctly link the ids in different files despite differences in their binary representations, presumably by an on-the-fly normalization.

Since I am currently only prototyping, this is as deep as need to go now. Once I get to coding the plugin itself, I will need to dig into the details how PT does what it does in this area.

I now understand the problem much better. Thanks to all!

stevepence

by (127 points)

In Paratext, all term IDs and rendering IDs are normalized to NFC format when loaded to get around this problem. As a matter of fact, almost all data is normalized internally when comparing two strings because of this issue. I consider it a design problem of Unicode, but that’s a different topic. :stuck_out_tongue_winking_eye:

EDIT: Also, if you’re making a Paratext plugin, you should be using IProject.GetBiblicalTermRenderings and IPluginHost.GetBiblicalTermList / IProject.BiblicalTermList to handle the Biblical Terms - which should avoid this problem.

Thanks. Currently I am still working in VBA finalizing a prototype and getting user feedback. I haven’t started actually doing real work with the api. I am certainly hoping that the api will hide all this detail, but I appreciate everyone’s help in understanding the issues - even if I don’t ever have to go there.

stevepence

Welcome to Support Bible, where you can ask questions and receive answers from other members of the community.
Very truly I tell you, whoever accepts anyone I send accepts me; and whoever accepts me accepts the one who sent me.
John 13:20
2,627 questions
5,369 answers
5,042 comments
1,420 users