+3 votes

I have a number of foreign words in my Paratext Wordlist that come from the front page (e.g. ‘Illustrations‘, ‘permission’, or the names of artists) or from between \tl…\tl*.
Is there a way to exclude such words so that they don’t appear in the Wordlist or mark them somehow as not belonging to the language?
Thanks!

Paratext by (234 points)

8 Answers

0 votes
Best answer

This issue has been on our minds for a long time (I think I say that a lot on this forum) and it relates to approval of not just words, but also characters and punctuation.

As you all have noted, the Wordlist and Inventories currently only let teams approve words “everywhere” or “nowhere”, but things are rarely that simple. The solution we have planned shows the status of words (or characters) in scripture and non scripture locations in separate columns in the Wordlist and inventories.

Specifying approval according to location will be available to those who want it. It will not be required for those who do not need it.

The early signs of these changes can be seen in Find in PT9.2. This allows users to search in “Verse text” or “All text” (‘non-verse text’ coming "soon"™️). Also, the Wordlist no longer has three buttons in every row for setting approval of words. Instead, status is set in the column header according to the active row. This is because when we add a column for non-verse text (and in Study Bible projects, another column shows for study bible material), the “wall” of 6-9 buttons per row becomes a bit overwhelming. A mockup of the Characters Inventory below gives a good idea of how this would look in Wordlist. By default, approving a word in verse text will also approve it in non-verse text; approving a word in non-verse text does not approve that word in verse text.

We hadn’t thought of there being a need for a separate hyphenation list. That’s something for us to bear in mind.

You’re probably all wondering when this will be implemented. It may be in 9.4 (no guarantees).

by [Moderator]
(1.1k points)
0 votes

There is currently no way to exclude these words from the Wordlist. The Wordlist is not simply a list of words in the language, but a list of words that occur in the text of the project. It is important to verify all of those words since it is possible to misspell foreign words.

by (8.4k points)
+1 vote

Paratext actually does exclude nonpublishable words from the Wordlist. For example, words in remarks are not found in the wordlist. In order to exclude reference text from map text tables, I created a custom character style called \zrm …\zrm* with \TextProperties nonpublishable and added it to my project’s frtbak.sty file:

# Map Table Reference text
# Reference text in Map tables marked with \zrm ...\zrm* do not appear in the Wordlist or other checks 
\Marker zrm
\Endmarker zrm*
\Name zrm...zrm* - Remark span
\Description For Comments and remarks
\OccursUnder ip im ipi imi ipq imq ipr iq iq1 iq2 iq3 io io1 io2 io3 io4 ms ms1 ms2 s s1 s2 s3 s4 cd sp d lh li li1 li2 li3 li4 lf lim lim1 lim2 lim3 lim4 ili ili1 ili2 ili3 ili4 m mi nb p pc ph phi pi pi1 pi2 pi3 pr pmo pm pmc pmr po q q1 q2 q3 q4 qc qr qd qm qm1 qm2 qm3 cls tr th1 th2 th3 th4 thr1 thr2 thr3 thr4 tc1 tc2 tc3 tc4 tcr1 tcr2 tcr3 tcr4 f fe ef rem NEST
\TextType Other
\TextProperties nonpublishable 
\StyleType Character
\FontSize 12
\Color 16711680 # Same color as \rem

This is an example of how it is used in a project’s BAK book

\periph Map 02_03 The World of the Patriarchs

\tr \thc1 \zrm The world of the Patriarchs \zrm* \thc2 পিতৃপুরুষদের জগৎ
\tr \th1 \zrm Legend \zrm* \th2 ~
\tr \tc1 \zrm km\zrm* \tc2 কিমি
\tr \tc1 \zrm Journeys of Abraham\zrm* \tc2 অব্রাহামের যাত্রা

The words in the first column are excluded from Wordlist but the words in the second column are spell checked.

Let me add that I do not recommend excluding any words that will be published from Wordlist. Names and words in copyright statement should also be checked for spelling.

by (1.8k points)
reshown

Keeping in mind the caveats that @CrazyRocky says, I also use the nonpublishable marker quite a bit for the exact reasons listed here.

One downside is that PT also ignores nonpublishable characters when doing certain searches and giving results. For example, when doing parentheses searches and some other checks, I often get strange results such as something which looks like a set of parentheses marks with nothing inside, but which actually has only nonpublishable characters inside. As a result I actually edit my custom.sty sheet every once in a while to make PT deal with foreign words as I want it to at the moment.

I add “nonpublishable” in the custom.sty stylesheet, and then change it back to “publishable” in PrintDraft-mods.sty and/or any relevant ptxprint.sty sheets.

+1 vote

I have come across this many times. I support translation teams that are getting ready to publish by working through their final checks, including the wordlist. Our goal for the wordlist is always “0 unknown or incorrect words in the NT”, but our count always includes those 100-200 French words (because these teams are in CAR, a Francophone country) that they have decided to include in their text. It’s unreasonable to expect that they would ONLY use their MT because many of their community members speak and read at least some French. It’s most one word here or there, usually in italics: This is how you write “X French word” in our language.

So I want to make a suggestion for the developers: Could we make a way to exclude foreign words from the wordlist, or provide a way to remove them from the “unknown words” count? I have two ideas for how that could be done.

  1. Add an optional fourth checkbox in the wordlist? ie. blue question mark, green checkmark, red X, and…say, orange star? (Or some fourth icon+differentcolor that means “foreign word”?) Trouble is, the global validation that happens in the wordlist is not appropriate for “foreign words” since each case of the foreign word needs to be reviewed and validated separately. So I guess this fourth box would mean “this word is not MT orthography but is not ‘unknown’, it can be ignored for wordlist purposes but each case will need to be reviewed before publishing to make sure it is still valid.”

  2. Another option is to change \tl or some other marker to be a “foreign word” marker. It would not change any formatting, just how PTX treats it: “this word can be ignored in the wordlist.” Unlike the “nonpublishable” ideas explained above, it would still have to be publishable. This would solve the problem of “reviewing each case” that I mentioned above, but would require special markers for each and every case of every word.

Either option will probably need some sort of inventory + basic check pair to allow for validation, but I’m not sure how this would be done on a case-by-case basis.

So what do you all think? Would this be a useful thing to ask of the developers?

Thanks,
~anon550568

by (114 points)

There would be an additional advantage to option 2. In a few cases, a certain string of characters can both be a (correctly spelled) foreign word and a vernacular word (either correctly or incorrectly spelled). One may want to judge these cases case by case.

In such cases, the ideal solution would be to be able to tell PT to ignore individual occurrences of words – just like we can deny individual errors in the basic checks (and PT will remember this choice). If that is too complex to implement, ignoring a certain marker for spell checking may be a good alternative.

By the way, when it comes to the final spell check, I usually define a filter including all Biblical books (and possibly the glossary and introduction) and excluding the front matter. But this does not solve the issue if the INT book contains both a vernacular and a LWC introduction.

Paulus+Kieviet

PS Another example where I’d like to exclude individual occurrences is this: in a few books, the Outline of Contents contains 2nd level items, which are labelled with letters alphabetically. The vernacular alphabet is “a e f” etc. Now “a” and “e” are also misspellings of vernacular words (should be ꞌ****a and ꞌ****e, respectively), so it would be nice to be able to ignore those occurrences. I know, itꞌs a tiny detail, but it just doesnꞌt allow us to get to that very satisfactory stage of zero incorrect or undecided words.

Paulus+Kieviet

The English image descriptions are often left in the figure captions alongside the vernacular captions (though not shown in the output). This would also be another section to selectively ignore.

0 votes

As I’ve been working with another team on their wordlist today, I thought of another benefit that would come from this change, particularly from the “ignore this word” marker idea…the French words would be removed from the wordlist so the user wouldn’t have to constantly be having to skip over them. As it stands right now, all the orthography checks–and even a manual check–has to mentally skip over any foreign words as there is no way to exclude them currently, often many times in one session because they are everywhere! To remove them would save us all a lot of mental rework.

by (114 points)

\tl … \tl* would be a good candidate for words to be ignored. By default it puts the words in italics, but one could override that.

Paulus+Kieviet

0 votes

I don’t know. The \tl marker already has a particular use actually marking foreign words in the text, so they may have to create a new one for this specifically that doesn’t do the format change, just the “ignoring in the wordlist” feature…though it wouldn’t be a bad idea to ALSO take \tl words out of the wordlist, too.

by (114 points)
reshown
0 votes

Just in case anyone is planning to use nonpublishable … Be warned that the second part of mnjames’ comment above is really important (put them to publishable in a local style sheet): nonpublishable is used by PTXprint to say ‘this is a comment, don’t put anything on the page here’. I.e. it will be silently skipped.

Paratext actually already has \tl with a \TextProperties of nonvernacular ie. it should be treated differently to other words.

With the translation we have been working on, we have quite a lot of national-language words as synonyms / translations in our glossary and footnotes. Being in the national language they break all sorts of rules for the vernacular, but are common misspellings/typos when people are tired. It would be very good to be able to exclude these from the character inventories, etc.

I would suggest that text in any ‘nonvernacular’ style be:

  1. Searchable
  2. Entirely ignored for character and punctuation inventories (unless someone wants to implement a second non-vernacular set of inventories).
  3. Have a distinct and separate entry in the hyphenation list, so that they can be hyphenated and spell-checked differently. (i.e. an OK spelling in nonvernacular contexts should not affect non-vernacular contexts, and vice-versa).
by (294 points)

For the projects I work with, I definitely want the non-vernacular words to be published, so I would not go the route of using any non-publishable markers. I just want a way to exclude these words from the wordlist…or like you suggested, anon542642, a way to move them to a separate interface.

As I’ve been discussing with Matthew_Lee, we are wondering if it could be a right-click or other optional action, like when we refuse errors in the basic checks. We would thus “refuse” a word, and then could choose a filter for and review “all refused words”.

0 votes

Instead of “ignoring” or “refusing” a foreign word, it would make more sense to me to mark a foreign (correctly spelled) word as such. Right-click > Mark as correctly spelled foreign word (or other option). Then Paratext would be able to recognize (in other contexts), that this word IS correct, but it doesn’t belong in the context of the vernacular word-list. And it could be displayed in a different list, or a sub-list, or have a filter to show ALL words, just vernacular words, or just foreign words.

by [Moderator]
(2.0k points)

Related questions

0 votes
1 answer
0 votes
2 answers
0 votes
4 answers
+1 vote
3 answers
Welcome to Support Bible, where you can ask questions and receive answers from other members of the community.
For we were all baptized by one Spirit so as to form one body—whether Jews or Gentiles, slave or free—and we were all given the one Spirit to drink.
1 Corinthians 12:13
2,627 questions
5,366 answers
5,041 comments
1,420 users