Matches Not Found in Word List - Composed and Decomposed

Question

I have recently come across an issue, which I am now recognizing in more and more projects. Its particular symptom comes when you try to correct a word with a character that has a diacritic (.e.g. á or Á). You make the correction, and it comes back with the message, No Matches Found.

This is often because (1) there is no normalization being applied to the project
(2) the character á or Á in this project are decomposed characters, not composed (one unit).

I am told Paratext 8 stores all its data in the Word List, wherever possible, as composed. So when you click on a word in the word list, which has a correction, it then looks in the data, and does not find it, because out in the data, the á character is not composed, but decomposed.

The most robust solution for this issue is to Convert the Project - and implement a Normalization - preferably Composed.

The other solution is not to normalize the project by converting it, but to change the autocorrect keyboard, so that, instead of entering decomposed á, it enters composed á etc. And also to do a find and replace across the data, finding all the decomposed characters, and replacing them with composed characters.

Please, Paratext experts, comment on the above two solutions, as this feature is probably causing not a little frustration out there.

Paratext May 23, 2018 asked by muckles (536 points)
Jun 11, 2019 reshown

11 Answers

Best answer

When I sent a problem report on this previously, Steve White said that this is probably a bug, but that it may be fixed in newer versions. But I’m still seeing the issue in 8.0.100.59. Maybe someone could clarify how spelling corrections are supposed to work when a project sets Normalization to None.

The Bungu (wun) project is in the situation that anon758749 mentioned. This project was migrated to PT8 with normalization set to ‘None’. Currently we use decomposed characters, and we anon421222’t have any problems keeping all uses decomposed. But of course, setting normalization to decomposed would be better.

But if Paratext knows that we have set Normalization to None, then why does it assume some sort of normalization when adjusting spelling via the Wordlist / Spell checking? Is this a bug or intended functionality? If someone has set normalization to None, wouldn’t that imply that they want Paratext to distinguish between the composed and decomposed uses? But in fact Paratext seems to be treating all uses as composed when it tries to correct spelling, which makes our decomposed characters incapable of being corrected via spell-check.

May 28, 2018 answered by Stephen Katt (1.3k points)
May 28, 2018 reshown

I have been told this decision is under review. But for now Paratext 8 will always store all data, to the extent possible, as composed. So for data that is already composed, this creates no problems in the Word List. But if your data is decomposed, and Normalization is set to none, then the problems start.
I think this may be deliberate on Paratext’s part, because they want you to apply a Normalization where possible. If they left the data as Unnormalized, then it would be possible for the very same word to appear in two or three or four different forms in the Word List (depending on whether the vowels were composed or decomposed), and this is a very undesirable outcome. I think this is why the Paratext developers made this decision.
You could make a suggestion to the Paratext developers, Stephen+Katt, but it’s all part of switching to this new version of Paratext.
The best solution is to normalize - though of course the keyboard will have to be fixed to enter composed data.

May 28, 2018 commented by muckles (536 points)
May 28, 2018 reshown

When the normalization mode is None, the data is always stored as-is. It is never normalized in any way. However, because most users do not understand normalization and because of complaints about seeing the same word multiple times in the Wordlist (as you said), we made the (probably bad) decision to normalize the words in the wordlist as composed so that there would only be one occurrence of each “word”. Unfortunately, this had the undesirable side effect of messing up the find/replace done from the Wordlist (you can still do a normal find/replace and it should work).

We have had some internal discussions about the best way forward, but the problem is, unfortunately, a little complex because, as I said before, most users anon421222’t understand what normalization is or why it might matter.

Note that when the normalization mode is set to None, normalization is also applied in other places like Biblical Terms so that matches happen consistently there as well.

May 29, 2018 commented by [Expert]

Fool Running (16.3k points)
May 29, 2018 reshown

“I’m not sure what makes you think it will be inconsistent more than it was before. Before Paratext even had the option to normalize data, users could enter the data in whatever form they pleased and that form would be retained in the data on-disk. This allowed many projects to contain data that was neither composed nor decomposed, but some horrible combination between the two. This is still the case when the normalization form of the project is set to None.”

A late response to this comment in the thread sure to traveling and teaching at a workshop.

I may have pinpointed a problem in communication between the users and developers in this issue. When I say consistency, I do not actually care whether my teams’ data is composed or decomposed or mixed. I care that there is only one underlying Unicode value for each character. I.e falling tone nasalized dotted I should not be found in the data with underlyingly different combinations of Unicode values. This applies to data inside and outside of Paratext. For all the teams in Nigeria that I serve, about 70 or more on paper it’s hard to keep count, I have worked hard to keep this true by insisting that a team always use the same keyboard and most have complied. The problem of rogue keyboards popping up comes from new people who think they are helping a team without consulting the supporters. I get that showing up and have to deal with it but I have been diligent. Now my diligence in this is being undermined by Paratext itself which is presenting a solution to an apparently different problem.

The person who wrote the comment at the top of this email seems to define consistency differently. They care that the data be decomposed or composed and never be some “horrible combination”. I freely admit that most likely our data does have some horrible combination since that part of how to avoid this was not absorbed in my training in 2002 if it was indeed taught. That was when we converted the data from legacy to Unicode and wrote new keyboards for Unicode data.

What I anon421222’t get is why it matters. And especially why it matters enough to make Paratext cause me such a terrible problem in the middle of migrating all our projects one by one by for a December 31 deadline. I’m half done and I’ve been very diligent about it. No matter what choice I make for the teams, their data will now be inconsistent by my definition. I chose to not normalize because it kept our data inside and outside of Paratext consistently the same with their underlying Unicode values and because I got absolutely no answers when I asked which was better, composed or decomposed, or for assistance in how to change our keyboards to match those choices so that I could somehow find a path forward in repairing the problem with a long term solution. Not that that path would be welcome, changing the keyboards is the small step. Fixing all the data outside Paratext to match Paratext is a very big job I anon421222’t have time for. I know because I did it in 2002 when we had a lot fewer teams.

So, I beg the Paratext developers, please make fixing the wordlist BUG your highest priority and, since you really care about your definition of normalization, explain it to us clearly and give us assistance in fixing our keyboards. An easier solution than the one from 2002 in then converting our data outside Paratext to match the data inside Paratext would also be greatly appreciated. Surely there is one since it takes just a click in paratext. I promise to look at the whole issue seriously and even try to comply, but not while I’m migrating all my teams and teaching them the new software. Maybe I can fit it in before you roll out a new interface.

Respectfully and desperately requested,

Shegnada

Language Technology and Publishing Coordinator, SIL Nigeria

Complex Script Layout Specialist, GPS Dallas

Jun 14, 2018 commented by Shegnada (1.3k points)
Jun 14, 2018 reshown

Page:

Kent Spielmann · Answer 1 · 2018-05-23T17:58:46+0000

I would do a convert project, however this requires some careful planning. Based on the above I assume the project has already been migrated, or was started, in PT8. I also assume you are an administrator on this project with the authority to do this.

First realize that you will need to change the short name of the project in order to convert it. If you want to keep the same short name you will need to convert it twice. After the first conversion you must delete the original project, then convert it again to the original project name.
Second I recommend you do a test conversion. This will tell you two things: 1) How long it will take to convert and 2) if there are any irregularities in the project repository that will prevent the project from being converted. The length of time it take to do a conversion depends on several factors including, the number of revisions in the project, the number of users, how often they did Send/Receives and the speed of your processor. This will tell you how long the project will need to be off line and if it is a reasonable thing to do.

It is actually a bit easier from a project management point of view to convert the project to the same short name since you can be certain that the old project has been eradicated from all users machines and that it will not pop up on your machine unexpectedly. This is what you do if you want to keep the same name:

Contact all all users on the project. Tell them to do a S/R and then delete the project from their machines. Tell them you will contact them after the conversion and they will need to do a S/R to receive the project back.
Do a S/R on your machine to receive all the changes.
Convert the project to a temporary name. Be sure you choose composed (NFC) normalization. All of the users with their permissions will be retained in the converted project.
Do not register the project or do a S/R.
Save the old project to a file on your computer for backup insurance.
Delete the old project and registration for all users.
Convert the temporary project back to its original name.
Delete the temporary project.
Register the new project.
Do an S/R.
Tell all users to do a S/R and that they can start working again.

Fool Running · Answer 2 · 2018-05-30T13:35:25+0000

Questions not meant to be obstreperous but really really needing answers.

1 Why does Paratext normalize to composed in the wordlist instead of decomposed? Pros and cons please of the two choices including current and future use in cellphones.

How can normalizing to decomposed sort out the problem with the wordlist if it is normalizing to composed?
Why does no one respond to the data outside Paratext dilemna when normalizing? Am I just being stupid and there is no problem somehow? Is someone willing to provide a way to normalize outside data also? This is not trivial, I had to do this years ago when
we moved from legacy to Unicode and we have 10 times the number of projects now. It daunts me exceedingly.
If I am forced to redo keyboards and convert outside data for all the projects I need a way to know exactly the encoding the characters are being normalized to. I anon421222’t intend to attend hours of trial and error with Paratext figuring out the first step.

Desperately,

Shegnada

Get Outlook for Android

May 30, 2018 commented by Shegnada (1.3k points)
May 30, 2018 reshown

Shegnada, I am probably missing something - but anon421222’t understand your insistence that all the data within a language has to be consistent. I could understand why this would be important, if Paratext data were being exported to other media outside of Paratext, or vice versa - data composed in other media were being imported into Paratext. Is this sort of thing common in the case of the language groups you work with? In my experience, Scripture stays in the Paratext environment - until it is published or distributed in other forms (Scripture App, audio). If two way exportation and importation is necessary, then the problem is very big.

Jun 14, 2018 commented by muckles (536 points)

Fool Running · Answer 3 · 2018-05-30T14:25:52+0000

Composed was chosen because we had already done composed in other places inside Paratext. It was a 50/50 choice.
Normalization to composed is only done if the project normalization is set to None or NFC. The data will be normalized to decomposed if the project is set to NFD.
I anon421222’t understand how data outside Paratext relates to data inside Paratext. If data is pasted into Paratext and the project has a normalization set, then it will be normalized to the value.
You should not need to change keyboards or any data outside Paratext.

Fool Running · Answer 4 · 2018-05-30T15:36:51+0000

Yes

I’m not sure what makes you think it will be inconsistent more than it was before. Before Paratext even had the option to normalize data, users could enter the data in whatever form they pleased and that form would be retained in the data on-disk. This allowed many projects to contain data that was neither composed nor decomposed, but some horrible combination between the two. This is still the case when the normalization form of the project is set to None.

Also, the normalization of data even when the project is set to None is only used in certain cases to keep undesirable things happening like words that anon421222’t match in the Biblical Terms and duplicate/identical words showing up in the Wordlist. This does not affect the scripture data at all and has been done behind the scenes for years before this new option was even added inside Paratext (e.g. in Biblical Terms matching).

If you anon421222’t like the way that normalization is being done in a project with the normalization set to None, you can use Convert Project to change the project to use a specific normalization so that you can be assured that it is in a certain normalized mode. From that point on, all data will be normalized to the specified mode by Paratext for that project every time it is saved.

anon421222 · Answer 5 · 2018-06-01T20:15:37+0000

In Paratext 8.0, there are two topics related to the above:

How do I resolve problems with the Wordlist?
How do I make my Paratext project data consistent?

Tim · Answer 6 · 2018-06-13T21:44:15+0000

In one project, this has created shock and frustration. The Normalization setting in project properties advanced is (and was) set to “None”.

The project has always kept all its data decomposed. And recently they noticed that PT8 has (on its own) composed at least some of the spelling data. No time yet for a full analysis. This makes the PT data useless (or painfully needing lots of extra work) for all other external work, analysis, statistics, searches, scripts, … because the entire language project depends on consistent data!

Also the PT8 spell checking is very messed up on all words which first happen into PT8 as capitalized (typically at the beginning of a sentence) and later also happen in texts in mid-sentence. This is possibly a separate problem(?) but hard to tell, when many words carry diacritics and do not behave right.

It has made the user experience in the office very frustrating, if the tool does not respond in a reliable way. User right-clicks a marked-as-wrong word and confirms or wants to correct - and sees a window with green ticks (what? already correct?). And on other words, an error message and nothing happens.

The users need consistent data and a working spell checker, especially for emerging orthographies, where the team is building up lists of new words and needs a reliable lookup, not buggy behaviour.

Tim · Answer 7 · 2018-06-14T11:31:00+0000

I did some research today into specific project spellcheck data:

migration was successful, now PT8 latest version, Normalisation “None”

data is now completely mixed between composed and decomposed; before it was 100% decomposed

even worse: Why would PT even start “composing” for a language where it makes no sense at all:

This language is using ten vowels and three combining diacritics:

there are some pre-composed glyphs defined in Unicode for five of those vowels
but there is nothing for the other five vowels (and probably there never will be; that is the entire point of the combining diacritics, to avoid cluttering the Unicode tables with thousands of trivial combinations)

So PT should ask users four times and get a signature before ever starting to turn decomposed data (half) into composed data. We use a backup tool (I believe it is called Clonezilla or related) where the user gets asked three times (and the consequences are super explicitely spelled out) before they can even write stuff on any partition.

In my screenshot you can see bad data (red, composed) and good data (green, decomposed) and light highlighting for simple vowels for double-checking.

screenshot30
Also you can see that some of the language data is happening inside XML tags (as part of tag attribute value) which made it rather hard for me to quickly analyse. Normally I would have expected the language data to be between the XML tags (like for the Correction example). But this is just sharing, not a complaint.

Tim · Answer 8 · 2018-06-14T21:01:50+0000

I would please need help for an immediate fix:

One translation project, five staff, four computers in sync, more data every working day.

All machines are PT8 latest version, Normalisation “None”, send/receive via USB-drive + chorus-hub + online server + shared folder on local NAS (depending on who and where and when).

Data is now completely mixed between composed and decomposed; before it was 100% decomposed.

I want to fix our data, so that entry, spell checking, everything is working, and we can export to Flex for more postprocessing.

We would prefer “None” i.e. PT will never alter our data. But that is reported as broken at the moment.

What are safe steps, to create a copy of the projects (one translation, one back translation, some private consultant notes) to take all data to fully decomposed (for just now) status, using a set of vowels I will provide (our project has got strict standards on what Unicode positions make up our orthography) and a set of combining diacritics I will provide.

If PT cannot do it presently, and if someone will send me the complete list of files that need fixing, I can probably do it via regexes in my power editor.

I further need clear instructions on how to avoid a mess with send/receive for all machines. I guess I need to create a fresh set of projects/files. Then turn off PT. Do all the editing and fixing. Turn PT on again and check that all is working. And then roll those out to the other computers as “new stuff”.

Are there safety features (hash-notes) in PT or in send/receive which would notice and protest/undo when files are altered outside of PT?

This language is still emerging from an unwritten-stage. Many words happen for the first time in writing in PT. So spell checking is not a luxury, it is essential to the work here. And features like Biblical key terms make no sense until the spelling is correct and consistent.

We can wait for a proper fix, which is stable and user-friendly. But we also need a quick fix to continue our language work. I am willing to do the work, I just need enough info so that I will not make it even worse. Thank you.

Phil_Leckrone · Answer 9 · 2018-06-14T21:23:11+0000

The current fix for your situation is to do a project conversion (Tools > Advanced > Convert Project). You must give the project a new short name and also set the normalization to Decomposed.

The process will create a completely new project by converting your data and history. All of the settings will be kept so this new project can be sent. You will need to register the project in order to S/R via the Internet.

Please examine the project after conversion to verify that your data is correct. You could use Tool > Compare Projects as one way to verify.

Please note that this does not change your current project. You should do this for each project that needs to be changed. You would need to relink your back translation to the new project.

This process can take a while depending on how many history points you have in your project.

Jun 14, 2018 answered by Phil_Leckrone (8.9k points)

I am right now in the process of the project conversion. Now I get stuck on the registration website: I cannot see the old project next to the converted project. // Update: I had to login twice to finally have both projects next to each other. So I discovered this:

PT has filled in some of the fields according to the original project but has left many details blank, even though they are “Fields required for registration are shown with a red asterisk (*)”

Some fields were even changed to wrong content: the rightsholder was changed wrongly, and even worse, the confidential status was changed. Luckily the migration is not long ago, so I remembered that after clicking on the big green button, one needs to open the registration again (EDIT) and find many more tabs which all need work.

I will report later on how the conversion went technically. But I wanted to register first, so that I can connect the converted back-translation to “something existing”.

Jun 18, 2018 commented by Tim (855 points)

Dear anon848905,

I just did the conversion for the main translation, the back translation and the private consultant notes project. That worked. I did check the data (spot checks on several SFM and XML files) and found no more composed characters.

It seems that many of our settings, users, user-roles, style sheets, spell checking data, etc. have not been damaged in this conversion.

Funny enough, all our French entries where vowels carry accents (for example in the consultant notes XML file) now showed up “green” in my self-made checker as also decomposed characters. That was not the case before.

So our initial choice of Normalization in the migration to PT8 for “None”, i.e. do not change any data was the right choice, except there is this bug… I guess, having slightly messed up French notes, is the lesser weevil compared to messed up translation data.

I did register online (see my previous post) but now my freshly converted projects do not show up in send/receive via Internet server. We have three more options for send/receive, but send/receive via Internet also feels like a nice way of doing backup out-of-country in case of whatever. So do I need to do anything on this end to activate send receive.

I guess we can never demand “to have no bugs”. But this 3-projects-conversions hack, has probably given us a reasonable temporary fix. I have mentioned the two eye-brow-risers: French vowels get affected, as the XML-file do not have ways to distinguish what I write in my notes, so PT not to blame; and no send/receive via web yet.

In conclusion, I feel glad that I work a lot every day with PT, and still I was hesitant about what to do in what order. I did the main translation first and registered online. Only then did I convert the back translation and assigned it to the new translation project. Then I did the private consultant notes.’

I do not know how we can have a final test for data-integrity. I do not know if or when I need to switch our old (buggy) project on the online registry from active to something else (we should keep it for a few weeks, just in cases).

So, since this conversion is helpful, maybe somebody who knows much more, could write a step-by-step instruction for other supporters (not click by click, just step by step).

Thanks again.

Jun 18, 2018 commented by Tim (855 points)

Matches Not Found in Word List - Composed and Decomposed

Please log in or register to answer this question.

11 Answers

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Related questions

Categories