Renderings for key terms - could be defined by regexes in the future

Question

Hi forum,

the Biblical Terms tool is very very helpful - it does improve the quality of translations.

I am dreaming of having a PT version one day, where the users can use regexes to better define the renderings for the key terms. I am working with a language, where the PT-wildcards are certainly better than “no wildcards” but often I find myself thinking “this could be expressed so briefly and elegantly with a regex”…

So this is a feature request (could not find that category to tick, sorry) and a call to other supporters to give their feedback and input. Thank you for your consideration.

====
For those who want “evidence” here is what triggered my proposal:

For a certain isolated language, the renderings-wildcards are rather too limited:

Just a few examples:

many words as they appear in texts take a prefix, so we use *stem but sometimes the prefix is [blank] which is a valid part of the language, but it seems that the * is not-optional
punctuation throws the wildcards, but sub-clauses are legitimate and need to be allowed, the language is expressing many key-terms by phrases, and sub-clauses happen inside those for adjectives or other reasons. So there would be very long lists of “text-examples” in the renderings-window when they need to list every example involving punctuation; so at the moment key terms involving punctation are “broken”
many words in the language are stacking seveal prefixes and that seems not to work with the existing tools, at the moment the team is using a loooong list of word-prefixes (namely all exisiting combinations of several prefixes)
the language has got many short one-syllable stems (which also exist as homograph particles), so just opting for “match based on stems” would give many false positives

Paratext Sep 21, 2018 asked by Tim (855 points)

5 Answers

Dear @anon044949, thank you for your input and interest. I watched your video and liked it, learnt some useful little tricks like right-click and get a shortcut to the relevant entry in the wordlist.

I am well aware about the prefix and suffix tools and wildcards, as I have mentioned in my initial post. I do not declare them bad, just too limited for certain languages.

Now you got me interested in the morphology option again. The language that gives me trouble has super-short stems (often just CV syllables) and is happily stacking multiple affixes unto those. Guess a root has to be short in a happy-stacking environment…

Now I found this in the documentation:

“If other words contain the stem you just approved and those words have their morphology specified, Paratext also approves them as renderings for the Biblical Term.”

I am mainly afraid of false positives on all those short stems, as we have also many short TAMs and other bits and pieces floating around like CV pronouns, negation markers etc. People never get confused because of precise word-order. They would never take an object noun-root for a verb-modifier particle (not attached to the verb stem).

So if I wanted to try this out, and convert a project to start using the “Match based on stem” option, what would happen to all existing defined key terms renderings? And what about short words?

Example: Noun stem is bo (means hat). And it will typically show up in a sentence like mabubo (means POSS class-marker-class-“bu”-singular; roughly “my hat”).

There is another word in the word list also spelled bo and there is nothing specific marked about morphemes. I do not know a PT syntax for informing the system “this word will never take an affix”. This version of bo shows up as a single (not-attached) TAM in front of certain verbs in many many sentences and means “irrealis” (future or unsure or imperative etc.)

Now, according to the PT8 rules, is the second entry of bo in the wordlist considered as “having its morphology specified”? And will it produce falls hits on the noun-stem entry of bo ?

If I wanted to test this on a real project, and I would be careful about it, like marking a Point in the Project History, doing an extra backup and send-receive beforehand. Are there other things to do or to consider before ticking that harmless looking box “Match based on stem”? Is this a once-for-all change? Can it be undone if later found not helpful? Side effects?

I know the morphology-feature in the Wordlist rather well. We are using that a lot for purposes of back-translation where it helps the interlinearizer to keep all possesives apart etc. I have just never considered to click through thousands of entries, where we mainly have class-marker-prefixes and other “harmless stuff” which has never been an obstacle for the Biblical key-terms tool. For the interlinearizer it is a benefit to keep certain affixes attached, because it is less work and it helps to back-translate “shoe” versus “shoes” precisely and semi-automatically. We have many homographs as prefixes, serving as POSS, class markers, adjective-accord-markers - and if we would give them all full treatment, it might be great for the Biblical key-terms but it would be a nightmare for the interlinear work, because the user would constantly need to make useless choices, because there are four different prefixes “a-” but only one can show up on a noun-stem and only one can show up on an adjective stem. This will turn into a “can’t have them all” puzzle of benefits versus drawbacks.

Maybe I could only mark the “complex” entries in the wordlist which have extra affixes; that might concern only a few hundred entries…

Your video reminds that the only two places to mark affixes are the wordlist and the interlinear window. Are there users or consultants who have experience with the actual workflow? How bad is it, to jump between the key-terms window and the word-list window? Or do you have dedicated sessions where you first prep the word-list and then later treat the key-terms? Please share. (Yep, I just checked, the wordlist can be filtered by book, chapter, verse just like the key-terms-tool; never needed so far, but could be part of a possible solution.)

I still dream about an optional Regex feature, but I respect your challenge to consider all existing tools first.

And I need to say it after such a post: I personally like PT a lot, I have to spend a lot of time working with other tools which have much worse GUI and less consideration for the users. So if I am promoting a new feature here, or I share what is not working so far, please do not take it as beeing too negative. Just trying to find a working solution for a specific language with very short stems, many short words and very keen tendency to stack affixes and make compound words.

Nov 23, 2018 commented by Tim (855 points)

Quite familiar, spent many days on it. Early in our PT work, we had the proposed link with FW set up. But we have moved away for several reasons. Firstly the products we tread in PT are different from the products we treat in FW (although we also copy PT chapters into our text corpus, but with our own scripts). Our FW is a more dynamic tool and is being moved to a new name and status whenever there was major technical trouble or other major changes inside FW or in our context. So the automatic link between the two has created more work (for us) than benefits.

Still good that you mention your considerations. We want to keep our PT “nice and clean and simple and self-contained, and even discreet…”.

Nov 24, 2018 commented by Tim (855 points)

Tim,

I don’t think I can answer all or your questions because they are specific to the language. I worked on a Bantu language in Mozambique so I am confident that using morphology in Paratext will help you. I also know Fieldworks fairly well and parsed a large part of the New Testament with Fieldworks/Flex. The difference between the two is that Paratext uses a statistically based Artificial Intelligence approach while Flex uses a rules based approach. One has to be able to define all the morphemes and their allomorphs as well as the grammar rules for every morpheme in every word class. Paratext gives useable results much faster than Flex, and is a better approach (in my opinion) if one is not planning to write a grammar of the language or expect great linguistic precision.

I am very familiar with lots of homophorous morphemes. It is a lot of work to “untangle” these in Flex but it has the tools to do so. Paratext does not have the same linguistic power, but that is a strength if you just want the Biblical terms tool correctly identify all the forms of an approved rendering. When you parse words in Paratext you mark the affixes and roots so it will not apply an affix where a root is required. If you were to try it, I would be willing to gamble that there will not be very many false positives. I say this because Paratext does what a native speaker does, namely only works with patterns that actually are used in the data and not with patterns that are theoretically possible.

Regarding workflow, what I imagine teams doing is use the interlinearizer to gloss the words and parse them as they go along. If the person doing the glossing not able to do the parsing then another method would have to be found. There is actually great value in parsing in the wordlist tool because you see all the related words together and this helps you make better decisions and be more consistent.

I hope this short email helps you decide what direction you should take

jeffh

SIL International

Language Technology Consultant

Dallas, TX USA

Cell: [Phone Removed]

Dec 1, 2018 commented by [Expert]

Jeff_Shrum (2.9k points)
Dec 1, 2018 reshown

Alex W. · Answer 1 · 2018-09-25T04:02:35+0000

I think this is a great idea, and the reasons listed by @Tim are applicable to the languages I work with as well.

Regex-enabled renderings would also provide the function of case sensitivity. Case-sensitive renderings would helpfully identify capitalization mistakes for proper names. Currently, the Biblical Terms tool does not recognize these capitalization discrepancies. This would be hugely helpful in correctly identifying distinct vocabulary items which are differentiated in the target language text by their capitalization alone (i.e., “lord” vs. “Lord” vs. “LORD”), as discussed in a previous post.

jeffh · Answer 2 · 2018-09-28T11:44:18+0000

I think this is a interesting idea, but may be a fairly high bar for non-programmers. Certainly you would want to provide help and maybe some simple “templates” that users could use. For example, if you want to allow a variety (but limited set) of prefixes, you could specify an equivalent like:

(ni|ti|yi)hibb

And it seems like you would need to allow classes in some way. Otherwise you would have to specify a (potentially long) list of candidates in each equivalent. So define @PREF = ni|ti|yi, and then use it to define your equivalents: (@PREF)hibb. To make it less geeky, you could define your classes in a dialog user interface where the user just adds all of the allowed elements of the class, and the vertical bars are added automatically. Anyway, just some thoughts…

By the way, I don’t believe your first example is accurate - from my experience the text matched by * IS optional, so *hibb* can match hibb and Hibb.

Sep 28, 2018 answered by jeffh (1.4k points)

Thank you @jeffh for your input. It will depend on the PT developpers whether they will even consider yet-another-tool. Because I had in mind a gently approach, where users could opt for classic-limited-wildcards or for full-regex-power.

I can envisage teams who would gladly have a consultant or supporter write them some templates and then use those, rather than typing/clicking endless lists of all possible permutations of certain terms. Some users will even see the benefits of regexes in this practical context and will want to grab some book and learn more.

And just to keep this interesting, I would humbly submit my perspective that geeky is good. So advanced tools like the Edit Rendering window should be more geeky, not less. I believe normal users are not even going there. I am hearing about projects in our region which do not even really use the key terms at all…

So people who get the present setup with those “extended wildcards” will probably also get regexes - and will love the possibilities. Also learning regexes will give users more, because they work in many places.

And a personal confession with that: I am using RegexBuddy all the time (am not affiliated), to check my stuff and to benefit from a library where my heros have prepared templates and where I can store useful bits which have worked in the past. And to make it explicit: yes, your idea to help the users with a not-frightening GUI is great. Maybe a link-over to the inbuilt Regex Pal would help; this could maybe have some templates for the keyterms renderings.

I will investigate my problem with the not-optional prefix and let you know where it had hit me. Thank you.

Oct 5, 2018 commented by Tim (855 points)

Hi jeffh, I just took some time to find the problem with the “obligatory” * (I believe I had called them not-optional in my first post.)

Consider that I have a list of allowed prefixes: un- re- retro-

And allowed suffixes: -box -boss -team

Now when I put “match” in my renderings, it correctly finds “matchbox” as expected.

But in one language we have occasional “retromatchbox”, so I put “*match” in my renderings. And that one no longer finds the simple “matchbox” it only finds “retromatchbox”. This is why I get frustrated and have discovered PT8 wildcards to be obligatory (in certain contexts) and somewhat under-documented.

If there is a rule like “thou shalt either use * or depend on prefixes or suffixes but never mix” then that would explain. But it would be very sad to “loose all affixes” as a penalty whenever you use a wildcard.

I have never seen such a rule documented but that could well be my mistake. I do not find it in the “Guide” right next to the “Edit Renderings window”, where I often have to look to remind myself. Regexes are just very well documented, there are expensive books and tools about them. And I find that money well spent.

Another major frustration is the fact that the PT8 ** is meant to match “any number of words” but gets thrown off by a simple comma. So if a language regularly uses “hungry puppy” but sometimes has contructs like “hungry very, very cute puppy” then the present tool does not work. And it is such “phrases” which are the worst for cluttering up the renderings-window. Prime customers for wildcards.

There will also be “hungry muchly, muchly cute puppy”. I am basically saying, that PT should not assume “easy” or “clean” language. Minorities have the right to create funny orthographies and use them. And regexes could easily accomodate such real-language “features”, this is why I made my proposal: from hitting the limits of the present system in real texts.

Nov 23, 2018 commented by Tim (855 points)

Tim · Answer 3 · 2019-02-01T16:58:57+0000

Testimony: Thank you all who have given input into my idea about regexes for finding key-term-renderings.

Based on your collective input, I have taken a deep breath and have moved that project to the system of stem-based matches a few days ago.

I can report that the change-over was painless and I did not notice any loss of data so far. Interestingly it seems that most glosses in the interlinear back-translation tool have survived that change - and are surviving the adaptations I am now doing to the morphology. This is giving me real joy, I had expected to lose many connections.

I have deleted all prefixes and suffixes from the key-terms-tool and it will take time until all the morphology is updated in the wordlist. But the few chapters I have treated so far are looking good. I have not had any more false-positives, which is muchly helpful.

For those who want more detail: For this specific language I have invented the strategy of “longer is better”. So I do not declare the full morphology but rather keep all singular and plural nouns as “complete words” in the wordlist like a-soro and ba-soro.

And likewise all three possible “conjugations” of each verb as in base-form, irrealis-form and infinitive-form. Like toŋo, a-toŋo and tôŋo. This way, I have significantly less work in the interlinearizer, because many prefixes have several jobs (see the a- examples above on nouns and verbs) and treating entire words needs like manual assigning.

The only “disadvantage” is that we are listing a few more renderings in the key-terms-window. But the number of renderings is not a problem normally. And showing full words rather than morphological “crumbs” will probably be helpful to visiting consultants who do not speak that language and who have not memorized all the affixes. Even for the local team, it is easier seeing real words rather than crumbs of words.

So I use the morphology as such mainly for the possessive-prefixes and some suffixes, as there are some ten+ noun classes and keeping each class-marker with each adjective-stem feels like cluttering the system.

In summary, for this type of language, the stem-based approach is more suited and gives better results with not-more-work. Pitty that this option was less obvious when that project was started a few years ago. This online-group did not even exist back then I believe.

The documentation is helpful, I am thinking of the help article “How do I make Paratext recognise all renderings of a Biblical Term?” and the paragraph “Comparison of the three ways to recognise a word stem and the possible affixes”.

So I will do my part and start promoting the existance of the stem-based option in coffee breaks and when meeting other PT users and supporters in our region. Thank you all.

Renderings for key terms - could be defined by regexes in the future

Please log in or register to answer this question.

5 Answers

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Related questions

Categories