0 votes

I am looking for morphological segmentation data for translation languages, both languages of wider communication and translation languages.

For instance, in English, I would like to break a word like “unthinkable” down into [un [[think] able]] or at least un-think-able.

I know the kinds of information found in FLEx, but I’m wondering if any of the checking tools in Paratext contain this information in a useful way for languages that do not have FLEx models.

Is this information there?

Thanks!

anon892024

Paratext by (448 points)
reshown

5 Answers

0 votes
Best answer

Here’s how the XML looks in the file (might be a little bit easier to read):

  <Entry Word="heals">
    <Analysis>
      <Lexeme>Stem:heal</Lexeme>
      <Lexeme>Suffix:s</Lexeme>
    </Analysis>
  </Entry>
  <Entry Word="healing">
    <Analysis>
      <Lexeme>Stem:heal</Lexeme>
      <Lexeme>Suffix:ing</Lexeme>
    </Analysis>
  </Entry>
  <Entry Word="heal">
    <Analysis>
      <Lexeme>Stem:heal</Lexeme>
    </Analysis>
  </Entry>
  <Entry Word="healed">
    <Analysis>
      <Lexeme>Stem:heal</Lexeme>
      <Lexeme>Suffix:ed</Lexeme>
    </Analysis>
  </Entry>
  <Entry Word="headed">
    <Analysis>
      <Lexeme>Stem:head</Lexeme>
      <Lexeme>Suffix:ed</Lexeme>
    </Analysis>
  </Entry>
  <Entry Word="heading">
    <Analysis>
      <Lexeme>Stem:head</Lexeme>
      <Lexeme>Suffix:ing</Lexeme>
    </Analysis>

Note that having this information in a project depends on someone having done the morphology breakup in the Wordlist or in the Interlinear.

by [Expert]
(16.2k points)

reshown

Is this all done by hand, or is there a way to create a “first guess” using NLP?

0 votes

If the words in a Paratext project have been broken down into morphemes in the Wordlist, the breakdown is kept in the file WordAnalyses.xml. Is this something you can use? I have tried to attach a screen shot, but am not sure if it comes through. For instance, the word ācāmciintōōsii has one prefix and 3 suffixes ā-cām-ciin-tōōs-ii.

Iver+Larsen

by (869 points)
reshown
0 votes

Perfect. This is exactly what I need.

Thanks!

anon892024

by (448 points)
0 votes

It’s done by hand, but Paratext does attempt to learn how the morphology is done so after a while the “by hand” work might just be accepting what Paratext has guessed. You can see this behavior in the Wordlist or in the Interliinear.

EDIT: So in my example XML above, I might enter “healed” as “heal +ed” and then “healing”
as “heal +ing” which might make Paratext guess that “heals” is “heal +s” and “headed” is “head +ed”, etc…

by [Expert]
(16.2k points)

reshown

In the Hyphenation Tool, by hand work starts with accepting the guesses as correct and correcting wrong guesses. However, once you see that Paratext is guessing correctly, you do NOT have to keep accepting the correct guesses. It is good to go. I expect that the Morphology Tool is the same way, start training it by filtering for words with your most common affixes and once it guesses them correctly, filter for another one until you’ve dealt with your known affixes and don’t see wrong guesses. It is quite rewarding to watch Paratext’s “aha moment” when it starts getting it right.

There can be some benefit in approving many at a time to affirm your choices and both tools allow you to Shift Select many instances and then, via the menu, approving them all. MUCH better than falling asleep click-clicking.

Blessings,

0 votes

Does anyone have a good feeling for how much work it is to create a morphology that covers a translation well? What skill level is required?

by (448 points)

Johnathan,

It takes a fair amount of time to divide all of the words in a project. For those who want to do this, I recommend starting in the Wordlist tool and not the interlinear. In this way words with the same root or stem can be made to occur together. I would sort the list for the most common verbs and parse all the surface forms for a few very common verb roots/stems. In this way you will teach Paratext’s parser the morphology and it will begin guessing correctly very quickly for other verbs. Approving a correctly guest parse is much faster than manually parsing, so teaching Paratext to parse correctly is where the fastest gains can be made. Once it is guessing well, then you could switch to using the project interlinearizer and working verse by verse. If you want to do this, I would also recommend that you do not link Paratext to Flex. Doing so complicates this process, unless you are ready to correctly set up the Ample parser in Flex which takes a lot of time for most languages.

Is “a fair amount of time to divide all the words in a project” a matter of weeks? Months? Longer?

If a project has 20,000 unique words and if someone who knows the language structure well is working on it, they can easily segment 2 words / minute. At the end of the first day they will have segmented almost 1,000 words. At that point Paratext should have enough data to start making intelligent guesses so on the second day they can easily do 4 words / minute, bringing them to 3,000 words. From there on in they should be able to do 8 words / minute for another 4,000 words per day.

Seven working days.

It depends on the morphology and if someone understands what the morphemes are. The good thing (in this situation) is that Paratext does not need to know what a morpheme signifies, or even that it is an allomorph of another morpheme. Paratext just needs to know if a character string is a morpheme or not.
So if you could find someone who knows how to identify the morpheme, I would think they could get Paratext guessing the breaks in a NT correctly most of the time in one week. They would have to use the method I have already described. One could probably go through the entire NT in a month, but since it is challenging to do this kind of work day after day, I would double the time and schedule lots of breaks to reduce mental mistakes. If the person doing it has lots of questions about the morphology as they get into the task, it will also take longer to allow them to work out parts of the morphology that they do not know. Paratext will start positing patterns that may or may not actually be there, and the person will have to be able to tell Paratext “no” sometimes.

I am working in a language with complex morphology, so I have spent days dividing words in a stem and many affixes. We did set up AMPLE in a related language and used CARLA to adapt to another related language, but that was before the advent of Flex and Paratext. This only has historical interest.

It complicates things that many affixes change the spelling of the stem. I see at least two advantages to doing this morphological parsing. It helps me to spot words that a translator has marked as correctly written when in fact it is not correct. When I filter for the words with no approved morphology, I can often find misspelled words. Another advantage is that I can use the stem option in the biblical terms tool. None of the translators we work with has adequate knowledge of the morphology to do this task.

Iver+Larsen

It depends on the complexity of the morphology. The one I am working on would have about 50,000 unique words in the Bible. I take one book at a time, and even after having done two thirds of the Bible, I still find words that the program cannot parse correctly.

Iver+Larsen

Related questions

0 votes
1 answer
+1 vote
3 answers
Paratext Oct 7, 2020 asked by dcb (193 points)
0 votes
1 answer
Paratext Jul 20, 2018 asked by mnjames (1.8k points)
Welcome to Support Bible, where you can ask questions and receive answers from other members of the community.
And I tell you that you are Peter, and on this rock I will build my church, and the gates of Hades will not overcome it.
Matthew 16:18
2,649 questions
5,398 answers
5,069 comments
1,451 users