0 votes

I have a colleague with a published NT, but no dictionary. They are interested in creating one, but would like to be spared as much ‘heavy lifting’ as possible. Is there a way to get the ParaTExt Wordlist to FLEx (e.g., ‘Export to XML’)? Is this recommended? Are there other recommendations? Thanks for any advice.

Paratext by (615 points)

8 Answers

+1 vote
Best answer

Paul - there are lots of thoughts on this. Some people say that you should
not build a dictionary off of translated text. Having said that you could
export the wordlist to xml and then try and import it into FLEx. However
this would give you a mess. You would have multiple different tenses and
forms. I’d suggest you look at the Rapid Word Collection site:
http://rapidwords.net/

anon848905
Americas Area Language Technology Coordinator
[Email Removed]
[Phone Removed]Office at JAARS)
[Phone Removed]Cell)
Skype name: anon848905

by (8.4k points)

Thanks anon848905. We’re well aware of RWC (I’ve gathered ~ 4,000 words this way and helped a few colleagues get started with it as well). The particular colleagues in question have spent over twenty years working in the language and are looking for a way to get wordforms into FLEx so they could start using the Bulk Edit tools available there. Rapid Word Collection Lite? I guess I’m trying to spare them the drudgery of typing in what will likely be over 5,000 stems. Right now this is the hurdle that prevents them moving ahead… I’ll keep looking for other options.

What if you extracted the word list from Paratext and created a text
document (or maybe 50 documents of 100 words each) and then gave each
document to FLEx and added to the dictionary only the wordforms that
belong there? (Depending on the language, you could take the time to set
up some parsing, so that “acted” would be parsed as “act” + “-ed” etc.)

Here’s an excerpt from my ParaTExt Wordlist exported to XML:

I don’t think it would be difficult to do a few Find / Replace passes to make it useful:
Find: <item word=" -> \lx[space]
Find: spelling=“Correct” -> [nothing]
I’m not sure what to do with the morphs, though… Maybe that would have to be done in FLEx.

I’ll run this past them and see if it’s the start they’re looking for. Thanks.

Ummm… Can’t one ‘paste’ into this forum? I pasted a section of text in this window and it showed up while composing, but it doesn’t appear once I’ve posted…

0 votes

You probably pasted text that was interpreted in a special way by the Markdown formatter or that was invalid and was thus removed.
Make sure to pay attention to the preview pane when composing.

by [Expert]
(16.2k points)

Let me try again, removing the leading < from each line:

item word=“abu” spelling=“Correct” morph=“abu” />
item word=“abua” spelling=“Correct” morph=“abu +a” />
item word=“abughami” spelling=“Correct” morph=“abu +ghami” />
item word=“abughamu” spelling=“Correct” morph=“abu +ghamu” />
item word=“abughinia” spelling=“Unknown” morph=“abu +ghini +a” />
item word=“abughita” spelling=“Correct” morph=“abu +ghita” />
item word=“abugho” spelling=“Correct” morph=“abu +gho” />
item word=“abui” spelling=“Correct” morph=“abu +i” />
item word=“abukoira” spelling=“Correct” morph=“abu +ko +ira” />
item word=“abukolu” spelling=“Unknown” morph=“abu +kolu” />
item word=“abura” spelling=“Correct” morph=“abu +ra” />
item word=“aburara” spelling=“Unknown” morph=“abu +ra +ra” />
item word=“abutughami” spelling=“Correct” morph=“abu +tu +ghami” />
item word=“abutuira” spelling=“Correct” morph=“abu +tu +ira” />
item word=“abuu” spelling=“Correct” morph=“abu +u” />

0 votes

So I’ve done a Find & Replace to change < item word=" to \lx[space]. Is there a regular expression I could use to get rid of everything following the word, i.e., from the first " to > at the end:

\lx abu" spelling=“Correct” morph=“abu” />

by (615 points)
0 votes

The following should work:
Find: ".+?>
Replace: (nothing)

P.S. I strongly recommend this website to help create regular expressions. It’s the best one I’ve found. :smile:

by [Expert]
(16.2k points)

Worked perfectly. Thank you! I think this will be enough to get my colleagues started. :wink:

0 votes

There is a utility that will extract glosses from Paratext Interlinear as either an SFM file or an Excel Spreadsheet. Might try that. http://lingtransoft.info/apps/extract-paratext-interlinear-glosses

by [Expert]
(2.9k points)
0 votes

I found another utility for viewing the glosses in the Paratext Interlinearizer. http://lingtransoft.info/apps/glossy

by [Expert]
(2.9k points)

Glossy can be used to extract the glosses in a way that could be imported into FLEx, but it’s not exactly set up for that at the moment. It’s for interactive lexicon browsing. I think I could figure out a way to produce an SFM file if folk were interested.

Has there been any progress on getting Glossy to extract the glosses from PT Interlineariser and import them into something like an SFM file? I have tried dragging the file lexicon.xml onto ParatextLexiconSetup program. But when I run the ParatextLexicon program, I don’t know what to enter in the slot labeled ‘SFM File’. I see a dialogue box entitled ‘Convert Paratext Lexicon to SFM’. Then it prompts you to enter file names for two fields:
Paratext Lexicon, and
SMF File.
I can easily enter the lexicon.xml file. But I don’t know what to put in the SFM file. Is there a list of different SFM formatting styles to choose from?

anon088806,

The SFM File is the name of the output file of the selected lexicon.xml. You can put whatever full path name you want. This will become the desired sfm file.

kent_schroeder

Hi kent_schroeder,

Thanks so much! I had no idea the solution could be that simple… Much appreciated! anon088806

0 votes

Seems we could use a utility to work with the Paratext wordlist file. Anyone out there interested in writing one?

by [Expert]
(2.9k points)

What do you want the tool do?

kent_schroeder
Software Developer/Language Technology Consultant
SIL – Africa-Focused Shared Services

Nairobi, Kenya

kent_schroeder, it seems that they want to harvest words from the wordlist tool for making a lexicon in Fieldworks. The interlinearizer data is better since it also has the gloss, but the wordlist tool would most likely have more words. If they vernacular words could be stripped out of the html file and marked with the \lx marker then that file could be imported into an existing Flex database. The user then would have to go through each entry and adjust the surface form to be a true lexeme and enter the gloss and part of speech himself.

kent_schroeder, can you give your software the ability to convert the Biblicaltermsxyz.xml file to standard format? Someone in another post is wanting to do that.

0 votes

Here is a script I’ve written for python3 (I use Wasta Linux 18.04). Note that I have hard-coded the path for the input and output in the code…

#!/usr/bin/python3
# run with python3 convertLexicon2SFM.py3

# BEST TO IMPORT INTO TEXT AND WORDS - not into the dictionary itself.
#   -Import into Standard Format Words and Glosses

#
#  MODIFIED AND ONLY MILDLY TESTED!!!!  It only includes words that 1.) have glosses
# and 2.) are marked as correctly spelled and 3.) exist in the text.
#

import codecs
from lxml import etree
xmldoc = etree.parse("/home/justin/Desktop/Lexicon.xml")
#Paratext Wordlist Export as XML 
wordlist = etree.parse("/home/justin/Desktop/wordlist.xml")
outfile=codecs.open("/home/justin/Desktop/PT7_dictionary_py3.sfm", mode="w", encoding='utf-8')
outfile.write ("\_sh v3.0  400  MDF\n\_DateStampHasFourDigitYear\n\n")
#correctWords=spellings.getroot().findall("Status")
correctWords=wordlist.getroot().findall("item")
wordlistTotal=len(correctWords)
approvedWords=[]
#for index,word in reversed( list( enumerate(correctWords) ) ) :
for index,word in reversed( list( enumerate(correctWords) ) ) :
  if word.attrib['spelling'] == "Correct" :
  #if word.attrib['State'] == "W" :
    del correctWords[index]
    approvedWords.append(word.attrib['word'])
    #approvedWords.append(word.attrib['Word'])

itemList = xmldoc.getroot().findall("Entries/item")
for item in itemList :
  Lexeme=next( item.iter("Lexeme") )
  if (Lexeme.get('Type') == 'Word') and (Lexeme.get('Form') not in approvedWords) :
    print ( 'unused', Lexeme.get('Type'), Lexeme.get('Form'), "wordlist", wordlistTotal, "incorrect", len(correctWords) )
    continue
  print ( 'good', Lexeme.get('Type'), Lexeme.get('Form'), 'count', approvedWords.count(Lexeme.get('Form')), 'unglossed remaining', len(approvedWords) )
  if approvedWords.count(Lexeme.get('Form')) :
    approvedWords.remove( Lexeme.get('Form') )
  outfile.write ("\n\\lx ")
  if Lexeme.get("Type") == "Suffix" :
    outfile.write ("-")
    #outfile.write ("-", end='')
  outfile.write ("%s" % Lexeme.get("Form"))
  #outfile.write ("%s" % Lexeme.get("Form"), end='')
  if Lexeme.get("Type") == "Prefix" :
    outfile.write ("-")
    #outfile.write ("-", end='')
  outfile.write ("\n")
  outfile.write   ("\\co_Eng %s\n" % next( item.iter("Lexeme") ).get("Type"))
  entryList = item.iter("Gloss")
  sense=1
  for element in entryList :
    if element.get("Language") == "English" :
      outfile.write ("\\sn %s\n" % sense)
      outfile.write ("\\ge %s\n" %  element.text )
      sense+=1
    if element.get("Language") == "Korean" :
      outfile.write ("\\sn %s\n" % sense)
      outfile.write ("\\g_Kor %s\n" %  element.text )
      sense+=1

for Lexeme in approvedWords :
  outfile.write ("\n\\lx %s\n" % Lexeme)
  print ("adding unglossed words", len(approvedWords), Lexeme )

outfile.close()
by (105 points)

Related questions

Welcome to Support Bible, where you can ask questions and receive answers from other members of the community.
Just as a body, though one, has many parts, but all its many parts form one body, so it is with Christ.
1 Corinthians 12:12
2,664 questions
5,423 answers
5,083 comments
1,479 users