Need segmentation data for translation languages

Question

5 Answers

Best answer

Here’s how the XML looks in the file (might be a little bit easier to read):

  <Entry Word="heals">
    <Analysis>
      <Lexeme>Stem:heal</Lexeme>
      <Lexeme>Suffix:s</Lexeme>
    </Analysis>
  </Entry>
  <Entry Word="healing">
    <Analysis>
      <Lexeme>Stem:heal</Lexeme>
      <Lexeme>Suffix:ing</Lexeme>
    </Analysis>
  </Entry>
  <Entry Word="heal">
    <Analysis>
      <Lexeme>Stem:heal</Lexeme>
    </Analysis>
  </Entry>
  <Entry Word="healed">
    <Analysis>
      <Lexeme>Stem:heal</Lexeme>
      <Lexeme>Suffix:ed</Lexeme>
    </Analysis>
  </Entry>
  <Entry Word="headed">
    <Analysis>
      <Lexeme>Stem:head</Lexeme>
      <Lexeme>Suffix:ed</Lexeme>
    </Analysis>
  </Entry>
  <Entry Word="heading">
    <Analysis>
      <Lexeme>Stem:head</Lexeme>
      <Lexeme>Suffix:ing</Lexeme>
    </Analysis>

Note that having this information in a project depends on someone having done the morphology breakup in the Wordlist or in the Interlinear.

Jul 7, 2022 answered by [Expert]

Fool Running (16.2k points)
Jul 7, 2022 reshown

Related questions

0 votes

1 answer

I need Wordanalyses.xml for your language!

Paratext Jul 27, 2022 asked by anon892024 (448 points)

+1 vote

3 answers

Back Translation data

Paratext Oct 7, 2020 asked by dcb (193 points)

0 votes

1 answer

Dealing with multiple languages

Paratext Jul 20, 2018 asked by mnjames (1.8k points)

0 votes

2 answers

How do we handle languages that do not use punctuation to mark end of sentence

Paratext Nov 29, 2017 asked by MSEAIT_LT (476 points)

0 votes

1 answer

Paratext 8: Getting a 'Looking for existing language data' message in notes

Paratext Sep 4, 2018 asked by MSEAIT_LT (476 points)

Iver Larsen · Answer 1 · 2022-07-07T05:36:48+0000

If the words in a Paratext project have been broken down into morphemes in the Wordlist, the breakdown is kept in the file WordAnalyses.xml. Is this something you can use? I have tried to attach a screen shot, but am not sure if it comes through. For instance, the word ācāmciintōōsii has one prefix and 3 suffixes ā-cām-ciin-tōōs-ii.

Iver+Larsen

Fool Running · Answer 2 · 2022-07-08T12:48:10+0000

In the Hyphenation Tool, by hand work starts with accepting the guesses as correct and correcting wrong guesses. However, once you see that Paratext is guessing correctly, you do NOT have to keep accepting the correct guesses. It is good to go. I expect that the Morphology Tool is the same way, start training it by filtering for words with your most common affixes and once it guesses them correctly, filter for another one until you’ve dealt with your known affixes and don’t see wrong guesses. It is quite rewarding to watch Paratext’s “aha moment” when it starts getting it right.

There can be some benefit in approving many at a time to affirm your choices and both tools allow you to Shift Select many instances and then, via the menu, approving them all. MUCH better than falling asleep click-clicking.

Blessings,

Jul 8, 2022 commented by Shegnada (1.3k points)
Jul 8, 2022 reshown

anon892024 · Answer 3 · 2022-07-08T14:37:43+0000

Johnathan,

It takes a fair amount of time to divide all of the words in a project. For those who want to do this, I recommend starting in the Wordlist tool and not the interlinear. In this way words with the same root or stem can be made to occur together. I would sort the list for the most common verbs and parse all the surface forms for a few very common verb roots/stems. In this way you will teach Paratext’s parser the morphology and it will begin guessing correctly very quickly for other verbs. Approving a correctly guest parse is much faster than manually parsing, so teaching Paratext to parse correctly is where the fastest gains can be made. Once it is guessing well, then you could switch to using the project interlinearizer and working verse by verse. If you want to do this, I would also recommend that you do not link Paratext to Flex. Doing so complicates this process, unless you are ready to correctly set up the Ample parser in Flex which takes a lot of time for most languages.

Jul 8, 2022 commented by [Expert]

Jeff_Shrum (2.9k points)

It depends on the morphology and if someone understands what the morphemes are. The good thing (in this situation) is that Paratext does not need to know what a morpheme signifies, or even that it is an allomorph of another morpheme. Paratext just needs to know if a character string is a morpheme or not.
So if you could find someone who knows how to identify the morpheme, I would think they could get Paratext guessing the breaks in a NT correctly most of the time in one week. They would have to use the method I have already described. One could probably go through the entire NT in a month, but since it is challenging to do this kind of work day after day, I would double the time and schedule lots of breaks to reduce mental mistakes. If the person doing it has lots of questions about the morphology as they get into the task, it will also take longer to allow them to work out parts of the morphology that they do not know. Paratext will start positing patterns that may or may not actually be there, and the person will have to be able to tell Paratext “no” sometimes.

Jul 8, 2022 commented by [Expert]

Jeff_Shrum (2.9k points)

I am working in a language with complex morphology, so I have spent days dividing words in a stem and many affixes. We did set up AMPLE in a related language and used CARLA to adapt to another related language, but that was before the advent of Flex and Paratext. This only has historical interest.

It complicates things that many affixes change the spelling of the stem. I see at least two advantages to doing this morphological parsing. It helps me to spot words that a translator has marked as correctly written when in fact it is not correct. When I filter for the words with no approved morphology, I can often find misspelled words. Another advantage is that I can use the stem option in the biblical terms tool. None of the translators we work with has adequate knowledge of the morphology to do this task.

Iver+Larsen

Jul 8, 2022 commented by Iver Larsen (869 points)
Jul 8, 2022 reshown

Need segmentation data for translation languages

Please log in or register to answer this question.

5 Answers

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Related questions

Categories