0 votes

I have recently come across an issue, which I am now recognizing in more and more projects. Its particular symptom comes when you try to correct a word with a character that has a diacritic (.e.g. á or Á). You make the correction, and it comes back with the message, No Matches Found.

This is often because (1) there is no normalization being applied to the project
(2) the character á or Á in this project are decomposed characters, not composed (one unit).

I am told Paratext 8 stores all its data in the Word List, wherever possible, as composed. So when you click on a word in the word list, which has a correction, it then looks in the data, and does not find it, because out in the data, the á character is not composed, but decomposed.

The most robust solution for this issue is to Convert the Project - and implement a Normalization - preferably Composed.

The other solution is not to normalize the project by converting it, but to change the autocorrect keyboard, so that, instead of entering decomposed á, it enters composed á etc. And also to do a find and replace across the data, finding all the decomposed characters, and replacing them with composed characters.

Please, Paratext experts, comment on the above two solutions, as this feature is probably causing not a little frustration out there.

Paratext by (536 points)
reshown

11 Answers

+2 votes
Best answer

When I sent a problem report on this previously, Steve White said that this is probably a bug, but that it may be fixed in newer versions. But I’m still seeing the issue in 8.0.100.59. Maybe someone could clarify how spelling corrections are supposed to work when a project sets Normalization to None.

The Bungu (wun) project is in the situation that anon758749 mentioned. This project was migrated to PT8 with normalization set to ‘None’. Currently we use decomposed characters, and we anon421222’t have any problems keeping all uses decomposed. But of course, setting normalization to decomposed would be better.

But if Paratext knows that we have set Normalization to None, then why does it assume some sort of normalization when adjusting spelling via the Wordlist / Spell checking? Is this a bug or intended functionality? If someone has set normalization to None, wouldn’t that imply that they want Paratext to distinguish between the composed and decomposed uses? But in fact Paratext seems to be treating all uses as composed when it tries to correct spelling, which makes our decomposed characters incapable of being corrected via spell-check.

by (1.2k points)
reshown

I have been told this decision is under review. But for now Paratext 8 will always store all data, to the extent possible, as composed. So for data that is already composed, this creates no problems in the Word List. But if your data is decomposed, and Normalization is set to none, then the problems start.
I think this may be deliberate on Paratext’s part, because they want you to apply a Normalization where possible. If they left the data as Unnormalized, then it would be possible for the very same word to appear in two or three or four different forms in the Word List (depending on whether the vowels were composed or decomposed), and this is a very undesirable outcome. I think this is why the Paratext developers made this decision.
You could make a suggestion to the Paratext developers, Stephen+Katt, but it’s all part of switching to this new version of Paratext.
The best solution is to normalize - though of course the keyboard will have to be fixed to enter composed data.

That, indeed is the problem, fixing the keyboards and updating all non-paratext data to match what Paratext has done. It is a monumental task that we have no time for.

Blessings,

Shegnada James

Language Technology and Publishing Coordinator, SIL Nigeria

Text Processing Specialist – Complex Script, GPS, SIL Intl

Skype: Shegnada.james.

[Email Removed]

+1 972 974 8146

When the normalization mode is None, the data is always stored as-is. It is never normalized in any way. However, because most users do not understand normalization and because of complaints about seeing the same word multiple times in the Wordlist (as you said), we made the (probably bad) decision to normalize the words in the wordlist as composed so that there would only be one occurrence of each “word”. Unfortunately, this had the undesirable side effect of messing up the find/replace done from the Wordlist (you can still do a normal find/replace and it should work).

We have had some internal discussions about the best way forward, but the problem is, unfortunately, a little complex because, as I said before, most users anon421222’t understand what normalization is or why it might matter.

Note that when the normalization mode is set to None, normalization is also applied in other places like Biblical Terms so that matches happen consistently there as well.

Well, I think the situation as it is is not good. There are entities that have yet to migrate to Paratext 8. They are using Decomposed character keyboards (e.g. Eastern Congo Keyman keyboard), and Normalization will by default set to None - since Normalization was not an option in Paratext 7. When they do migrate, they will find this situation in the Word List - and it will be an unpleasant surprise for them.

What is your best advice for such entities, anon291708 (anon291708 )? Should they normalize? If so, to decomposed? Will that sort out the problem with the Word List, while at the same time allowing them to keep using their existing keyboards?

Yes, I have seen problems for days. And even doubting some of my own work. And now I am reading that this is a major issue. Should have been mentioned in the migration documentation for all projects.

Where is the place, where the PT team will explain the situation and will provide solutions, when they are ready? Is it here in this forum? In this thread?

Can we do something right now to avoid further corruption of data?

“I’m not sure what makes you think it will be inconsistent more than it was before. Before Paratext even had the option to normalize data, users could enter the data in whatever form they pleased and that form would be retained in the data on-disk. This allowed many projects to contain data that was neither composed nor decomposed, but some horrible combination between the two. This is still the case when the normalization form of the project is set to None.”

A late response to this comment in the thread sure to traveling and teaching at a workshop.

I may have pinpointed a problem in communication between the users and developers in this issue. When I say consistency, I do not actually care whether my teams’ data is composed or decomposed or mixed. I care that there is only one underlying Unicode value for each character. I.e falling tone nasalized dotted I should not be found in the data with underlyingly different combinations of Unicode values. This applies to data inside and outside of Paratext. For all the teams in Nigeria that I serve, about 70 or more on paper it’s hard to keep count, I have worked hard to keep this true by insisting that a team always use the same keyboard and most have complied. The problem of rogue keyboards popping up comes from new people who think they are helping a team without consulting the supporters. I get that showing up and have to deal with it but I have been diligent. Now my diligence in this is being undermined by Paratext itself which is presenting a solution to an apparently different problem.

The person who wrote the comment at the top of this email seems to define consistency differently. They care that the data be decomposed or composed and never be some “horrible combination”. I freely admit that most likely our data does have some horrible combination since that part of how to avoid this was not absorbed in my training in 2002 if it was indeed taught. That was when we converted the data from legacy to Unicode and wrote new keyboards for Unicode data.

What I anon421222’t get is why it matters. And especially why it matters enough to make Paratext cause me such a terrible problem in the middle of migrating all our projects one by one by for a December 31 deadline. I’m half done and I’ve been very diligent about it. No matter what choice I make for the teams, their data will now be inconsistent by my definition. I chose to not normalize because it kept our data inside and outside of Paratext consistently the same with their underlying Unicode values and because I got absolutely no answers when I asked which was better, composed or decomposed, or for assistance in how to change our keyboards to match those choices so that I could somehow find a path forward in repairing the problem with a long term solution. Not that that path would be welcome, changing the keyboards is the small step. Fixing all the data outside Paratext to match Paratext is a very big job I anon421222’t have time for. I know because I did it in 2002 when we had a lot fewer teams.

So, I beg the Paratext developers, please make fixing the wordlist BUG your highest priority and, since you really care about your definition of normalization, explain it to us clearly and give us assistance in fixing our keyboards. An easier solution than the one from 2002 in then converting our data outside Paratext to match the data inside Paratext would also be greatly appreciated. Surely there is one since it takes just a click in paratext. I promise to look at the whole issue seriously and even try to comply, but not while I’m migrating all my teams and teaching them the new software. Maybe I can fit it in before you roll out a new interface.

Respectfully and desperately requested,

Shegnada

Language Technology and Publishing Coordinator, SIL Nigeria

Complex Script Layout Specialist, GPS Dallas

0 votes

I would do a convert project, however this requires some careful planning. Based on the above I assume the project has already been migrated, or was started, in PT8. I also assume you are an administrator on this project with the authority to do this.

  • First realize that you will need to change the short name of the project in order to convert it. If you want to keep the same short name you will need to convert it twice. After the first conversion you must delete the original project, then convert it again to the original project name.
  • Second I recommend you do a test conversion. This will tell you two things: 1) How long it will take to convert and 2) if there are any irregularities in the project repository that will prevent the project from being converted. The length of time it take to do a conversion depends on several factors including, the number of revisions in the project, the number of users, how often they did Send/Receives and the speed of your processor. This will tell you how long the project will need to be off line and if it is a reasonable thing to do.

It is actually a bit easier from a project management point of view to convert the project to the same short name since you can be certain that the old project has been eradicated from all users machines and that it will not pop up on your machine unexpectedly. This is what you do if you want to keep the same name:

  1. Contact all all users on the project. Tell them to do a S/R and then delete the project from their machines. Tell them you will contact them after the conversion and they will need to do a S/R to receive the project back.
  2. Do a S/R on your machine to receive all the changes.
  3. Convert the project to a temporary name. Be sure you choose composed (NFC) normalization. All of the users with their permissions will be retained in the converted project.
    Do not register the project or do a S/R.
  4. Save the old project to a file on your computer for backup insurance.
  5. Delete the old project and registration for all users.
  6. Convert the temporary project back to its original name.
  7. Delete the temporary project.
  8. Register the new project.
  9. Do an S/R.
  10. Tell all users to do a S/R and that they can start working again.
by (1.8k points)
reshown
0 votes

The best advice is to use Convert Project to normalize the project into whichever form (composed or decomposed) they need (usually depends on the script and/or keyboard used). That will sort out the problem with the Wordlist and should not affect what keyboard they use.

by [Expert]
(16.2k points)

Questions not meant to be obstreperous but really really needing answers.

1 Why does Paratext normalize to composed in the wordlist instead of decomposed? Pros and cons please of the two choices including current and future use in cellphones.

  1. How can normalizing to decomposed sort out the problem with the wordlist if it is normalizing to composed?

  2. Why does no one respond to the data outside Paratext dilemna when normalizing? Am I just being stupid and there is no problem somehow? Is someone willing to provide a way to normalize outside data also? This is not trivial, I had to do this years ago when
    we moved from legacy to Unicode and we have 10 times the number of projects now. It daunts me exceedingly.

  3. If I am forced to redo keyboards and convert outside data for all the projects I need a way to know exactly the encoding the characters are being normalized to. I anon421222’t intend to attend hours of trial and error with Paratext figuring out the first step.

Desperately,

Shegnada

Get Outlook for Android

Thank you so much for the reply. So just to make sure I understand correctly, if we normalize a project that already has decomposed data, to a Decomposed Normalization, the Word List will store the data in Decomposed form, so that filtering on decomposed forms will find them, the decomposing keyboard will work OK, etc etc. ?

Once the project is normalized to decomposed, the word list too will be decomposed, and so that should solve the problem, at least internally within Paratext, between the Word List and the Scripture books. It is a solution, provided that all the data in the Scripture files is decomposed. It DOES NOT solve the problem of a keyboard that enters a mixture of composed and decomposed data outside of Paratext.

Shegnada, I am probably missing something - but anon421222’t understand your insistence that all the data within a language has to be consistent. I could understand why this would be important, if Paratext data were being exported to other media outside of Paratext, or vice versa - data composed in other media were being imported into Paratext. Is this sort of thing common in the case of the language groups you work with? In my experience, Scripture stays in the Paratext environment - until it is published or distributed in other forms (Scripture App, audio). If two way exportation and importation is necessary, then the problem is very big.

While importing data into Paratext is rare, the data in Paratext is one of the biggest sources of data for our languages. Up to this point, if a team used the same keyboard for all work, the data was consistent across all software. Now we have a problem.

Blessings,

Shegnada James

Language Technology and Publishing Coordinator, SIL Nigeria

Text Processing Specialist – Complex Script, GPS, SIL Intl

Skype: Shegnada.james.

[Email Removed]

+1 972 974 8146

0 votes
  1. Composed was chosen because we had already done composed in other places inside Paratext. It was a 50/50 choice.
  2. Normalization to composed is only done if the project normalization is set to None or NFC. The data will be normalized to decomposed if the project is set to NFD.
  3. I anon421222’t understand how data outside Paratext relates to data inside Paratext. If data is pasted into Paratext and the project has a normalization set, then it will be normalized to the value.
  4. You should not need to change keyboards or any data outside Paratext.
by [Expert]
(16.2k points)

Paratext is not a world unto itself. Data inside Paratext is used outside Paratext in a variety of ways and with a variety of software and internet tools. But now it will be inconsistent with all the other data. The only way to keep all the data consistent is to change the data outside Paratext and from then on type the data so that the encodings match the Paratext data.

What about the above statement is untrue?

Blessings,

Shegnada James

Language Technology and Publishing Coordinator, SIL Nigeria

Text Processing Specialist – Complex Script, GPS, SIL Intl

Skype: Shegnada.james.

[Email Removed]

+1 972 974 8146

+1 vote

Yes

I’m not sure what makes you think it will be inconsistent more than it was before. Before Paratext even had the option to normalize data, users could enter the data in whatever form they pleased and that form would be retained in the data on-disk. This allowed many projects to contain data that was neither composed nor decomposed, but some horrible combination between the two. This is still the case when the normalization form of the project is set to None.

Also, the normalization of data even when the project is set to None is only used in certain cases to keep undesirable things happening like words that anon421222’t match in the Biblical Terms and duplicate/identical words showing up in the Wordlist. This does not affect the scripture data at all and has been done behind the scenes for years before this new option was even added inside Paratext (e.g. in Biblical Terms matching).

If you anon421222’t like the way that normalization is being done in a project with the normalization set to None, you can use Convert Project to change the project to use a specific normalization so that you can be assured that it is in a certain normalized mode. From that point on, all data will be normalized to the specified mode by Paratext for that project every time it is saved.

by [Expert]
(16.2k points)

reshown
0 votes

In Paratext 8.0, there are two topics related to the above:

  1. How do I resolve problems with the Wordlist?
  2. How do I make my Paratext project data consistent?
by [Expert]
(733 points)
0 votes

In one project, this has created shock and frustration. The Normalization setting in project properties advanced is (and was) set to “None”.

The project has always kept all its data decomposed. And recently they noticed that PT8 has (on its own) composed at least some of the spelling data. No time yet for a full analysis. This makes the PT data useless (or painfully needing lots of extra work) for all other external work, analysis, statistics, searches, scripts, … because the entire language project depends on consistent data!

Also the PT8 spell checking is very messed up on all words which first happen into PT8 as capitalized (typically at the beginning of a sentence) and later also happen in texts in mid-sentence. This is possibly a separate problem(?) but hard to tell, when many words carry diacritics and do not behave right.

It has made the user experience in the office very frustrating, if the tool does not respond in a reliable way. User right-clicks a marked-as-wrong word and confirms or wants to correct - and sees a window with green ticks (what? already correct?). And on other words, an error message and nothing happens.

The users need consistent data and a working spell checker, especially for emerging orthographies, where the team is building up lists of new words and needs a reliable lookup, not buggy behaviour.

by (842 points)
0 votes

I did some research today into specific project spellcheck data:

migration was successful, now PT8 latest version, Normalisation “None”

data is now completely mixed between composed and decomposed; before it was 100% decomposed

even worse: Why would PT even start “composing” for a language where it makes no sense at all:

This language is using ten vowels and three combining diacritics:

  • there are some pre-composed glyphs defined in Unicode for five of those vowels
  • but there is nothing for the other five vowels (and probably there never will be; that is the entire point of the combining diacritics, to avoid cluttering the Unicode tables with thousands of trivial combinations)

So PT should ask users four times and get a signature before ever starting to turn decomposed data (half) into composed data. We use a backup tool (I believe it is called Clonezilla or related) where the user gets asked three times (and the consequences are super explicitely spelled out) before they can even write stuff on any partition.

In my screenshot you can see bad data (red, composed) and good data (green, decomposed) and light highlighting for simple vowels for double-checking.

screenshot30
Also you can see that some of the language data is happening inside XML tags (as part of tag attribute value) which made it rather hard for me to quickly analyse. Normally I would have expected the language data to be between the XML tags (like for the Correction example). But this is just sharing, not a complaint.

by (842 points)
0 votes

I would please need help for an immediate fix:

One translation project, five staff, four computers in sync, more data every working day.

All machines are PT8 latest version, Normalisation “None”, send/receive via USB-drive + chorus-hub + online server + shared folder on local NAS (depending on who and where and when).

Data is now completely mixed between composed and decomposed; before it was 100% decomposed.

I want to fix our data, so that entry, spell checking, everything is working, and we can export to Flex for more postprocessing.

We would prefer “None” i.e. PT will never alter our data. But that is reported as broken at the moment.

What are safe steps, to create a copy of the projects (one translation, one back translation, some private consultant notes) to take all data to fully decomposed (for just now) status, using a set of vowels I will provide (our project has got strict standards on what Unicode positions make up our orthography) and a set of combining diacritics I will provide.

If PT cannot do it presently, and if someone will send me the complete list of files that need fixing, I can probably do it via regexes in my power editor.

I further need clear instructions on how to avoid a mess with send/receive for all machines. I guess I need to create a fresh set of projects/files. Then turn off PT. Do all the editing and fixing. Turn PT on again and check that all is working. And then roll those out to the other computers as “new stuff”.

Are there safety features (hash-notes) in PT or in send/receive which would notice and protest/undo when files are altered outside of PT?

This language is still emerging from an unwritten-stage. Many words happen for the first time in writing in PT. So spell checking is not a luxury, it is essential to the work here. And features like Biblical key terms make no sense until the spelling is correct and consistent.

We can wait for a proper fix, which is stable and user-friendly. But we also need a quick fix to continue our language work. I am willing to do the work, I just need enough info so that I will not make it even worse. Thank you.

by (842 points)
+1 vote

The current fix for your situation is to do a project conversion (Tools > Advanced > Convert Project). You must give the project a new short name and also set the normalization to Decomposed.

The process will create a completely new project by converting your data and history. All of the settings will be kept so this new project can be sent. You will need to register the project in order to S/R via the Internet.

Please examine the project after conversion to verify that your data is correct. You could use Tool > Compare Projects as one way to verify.

Please note that this does not change your current project. You should do this for each project that needs to be changed. You would need to relink your back translation to the new project.

This process can take a while depending on how many history points you have in your project.

by (8.0k points)

thank you anon848905, I will try that as soon as possible

I believe I can do all that you have listed. I guess there will be a new folder in the PT-projects folder. Will all files get copied/converted during “project conversion” or are there elements like custom style-sheets and shared stuff that I would need to copy or sync manually?

Is there documentation about how to

?

Yes, when you do a project conversion the new projects are created in their own folders. All of the elements of the project should be converted and custom stylesheets and settings should be moved to the new project folder.

To relink a back-translation you would click in the back-translation project and then go to Project > Project Properties and Settings. For the option as to which project the back-translation is “Based On” you would choose the name of the new translation.

I am right now in the process of the project conversion. Now I get stuck on the registration website: I cannot see the old project next to the converted project. // Update: I had to login twice to finally have both projects next to each other. So I discovered this:

PT has filled in some of the fields according to the original project but has left many details blank, even though they are “Fields required for registration are shown with a red asterisk (*)”

Some fields were even changed to wrong content: the rightsholder was changed wrongly, and even worse, the confidential status was changed. Luckily the migration is not long ago, so I remembered that after clicking on the big green button, one needs to open the registration again (EDIT) and find many more tabs which all need work.

I will report later on how the conversion went technically. But I wanted to register first, so that I can connect the converted back-translation to “something existing”.

Dear anon848905,

I just did the conversion for the main translation, the back translation and the private consultant notes project. That worked. I did check the data (spot checks on several SFM and XML files) and found no more composed characters.

It seems that many of our settings, users, user-roles, style sheets, spell checking data, etc. have not been damaged in this conversion.

Funny enough, all our French entries where vowels carry accents (for example in the consultant notes XML file) now showed up “green” in my self-made checker as also decomposed characters. That was not the case before.

So our initial choice of Normalization in the migration to PT8 for “None”, i.e. do not change any data was the right choice, except there is this bug… I guess, having slightly messed up French notes, is the lesser weevil compared to messed up translation data.

I did register online (see my previous post) but now my freshly converted projects do not show up in send/receive via Internet server. We have three more options for send/receive, but send/receive via Internet also feels like a nice way of doing backup out-of-country in case of whatever. So do I need to do anything on this end to activate send receive.

I guess we can never demand “to have no bugs”. But this 3-projects-conversions hack, has probably given us a reasonable temporary fix. I have mentioned the two eye-brow-risers: French vowels get affected, as the XML-file do not have ways to distinguish what I write in my notes, so PT not to blame; and no send/receive via web yet.

In conclusion, I feel glad that I work a lot every day with PT, and still I was hesitant about what to do in what order. I did the main translation first and registered online. Only then did I convert the back translation and assigned it to the new translation project. Then I did the private consultant notes.’

I do not know how we can have a final test for data-integrity. I do not know if or when I need to switch our old (buggy) project on the online registry from active to something else (we should keep it for a few weeks, just in cases).

So, since this conversion is helpful, maybe somebody who knows much more, could write a step-by-step instruction for other supporters (not click by click, just step by step).

Thanks again.

The send/receive has also been fixed, it is working now. That was an “Africa problem”: We got powercuts for days, and our internet was badly erratic. Just found out that a cable on our router was badly connected, sorry.

Related questions

Welcome to Support Bible, where you can ask questions and receive answers from other members of the community.
And over all these virtues put on love, which binds them all together in perfect unity.
Colossians 3:14
2,476 questions
5,170 answers
4,866 comments
1,282 users