ParaTExt Wordlist to FLEx?

Question

8 Answers

Fool Running · Answer 1 · 2015-09-11T12:54:49+0000

Let me try again, removing the leading < from each line:

item word=“abu” spelling=“Correct” morph=“abu” />
item word=“abua” spelling=“Correct” morph=“abu +a” />
item word=“abughami” spelling=“Correct” morph=“abu +ghami” />
item word=“abughamu” spelling=“Correct” morph=“abu +ghamu” />
item word=“abughinia” spelling=“Unknown” morph=“abu +ghini +a” />
item word=“abughita” spelling=“Correct” morph=“abu +ghita” />
item word=“abugho” spelling=“Correct” morph=“abu +gho” />
item word=“abui” spelling=“Correct” morph=“abu +i” />
item word=“abukoira” spelling=“Correct” morph=“abu +ko +ira” />
item word=“abukolu” spelling=“Unknown” morph=“abu +kolu” />
item word=“abura” spelling=“Correct” morph=“abu +ra” />
item word=“aburara” spelling=“Unknown” morph=“abu +ra +ra” />
item word=“abutughami” spelling=“Correct” morph=“abu +tu +ghami” />
item word=“abutuira” spelling=“Correct” morph=“abu +tu +ira” />
item word=“abuu” spelling=“Correct” morph=“abu +u” />

Sep 11, 2015 commented by Paul (627 points)

Paul · Answer 2 · 2015-09-11T21:15:12+0000

So I’ve done a Find & Replace to change < item word=" to \lx[space]. Is there a regular expression I could use to get rid of everything following the word, i.e., from the first " to > at the end:

\lx abu" spelling=“Correct” morph=“abu” />

Fool Running · Answer 3 · 2015-09-14T13:28:36+0000

The following should work:
Find: ".+?>
Replace: (nothing)

P.S. I strongly recommend this website to help create regular expressions. It’s the best one I’ve found.

Jeff_Shrum · Answer 4 · 2015-09-28T22:34:22+0000

There is a utility that will extract glosses from Paratext Interlinear as either an SFM file or an Excel Spreadsheet. Might try that. http://lingtransoft.info/apps/extract-paratext-interlinear-glosses

anon172528 · Answer 5 · 2018-11-13T11:51:49+0000

Here is a script I’ve written for python3 (I use Wasta Linux 18.04). Note that I have hard-coded the path for the input and output in the code…

#!/usr/bin/python3
# run with python3 convertLexicon2SFM.py3

# BEST TO IMPORT INTO TEXT AND WORDS - not into the dictionary itself.
#   -Import into Standard Format Words and Glosses

#
#  MODIFIED AND ONLY MILDLY TESTED!!!!  It only includes words that 1.) have glosses
# and 2.) are marked as correctly spelled and 3.) exist in the text.
#

import codecs
from lxml import etree
xmldoc = etree.parse("/home/justin/Desktop/Lexicon.xml")
#Paratext Wordlist Export as XML 
wordlist = etree.parse("/home/justin/Desktop/wordlist.xml")
outfile=codecs.open("/home/justin/Desktop/PT7_dictionary_py3.sfm", mode="w", encoding='utf-8')
outfile.write ("\_sh v3.0  400  MDF\n\_DateStampHasFourDigitYear\n\n")
#correctWords=spellings.getroot().findall("Status")
correctWords=wordlist.getroot().findall("item")
wordlistTotal=len(correctWords)
approvedWords=[]
#for index,word in reversed( list( enumerate(correctWords) ) ) :
for index,word in reversed( list( enumerate(correctWords) ) ) :
  if word.attrib['spelling'] == "Correct" :
  #if word.attrib['State'] == "W" :
    del correctWords[index]
    approvedWords.append(word.attrib['word'])
    #approvedWords.append(word.attrib['Word'])

itemList = xmldoc.getroot().findall("Entries/item")
for item in itemList :
  Lexeme=next( item.iter("Lexeme") )
  if (Lexeme.get('Type') == 'Word') and (Lexeme.get('Form') not in approvedWords) :
    print ( 'unused', Lexeme.get('Type'), Lexeme.get('Form'), "wordlist", wordlistTotal, "incorrect", len(correctWords) )
    continue
  print ( 'good', Lexeme.get('Type'), Lexeme.get('Form'), 'count', approvedWords.count(Lexeme.get('Form')), 'unglossed remaining', len(approvedWords) )
  if approvedWords.count(Lexeme.get('Form')) :
    approvedWords.remove( Lexeme.get('Form') )
  outfile.write ("\n\\lx ")
  if Lexeme.get("Type") == "Suffix" :
    outfile.write ("-")
    #outfile.write ("-", end='')
  outfile.write ("%s" % Lexeme.get("Form"))
  #outfile.write ("%s" % Lexeme.get("Form"), end='')
  if Lexeme.get("Type") == "Prefix" :
    outfile.write ("-")
    #outfile.write ("-", end='')
  outfile.write ("\n")
  outfile.write   ("\\co_Eng %s\n" % next( item.iter("Lexeme") ).get("Type"))
  entryList = item.iter("Gloss")
  sense=1
  for element in entryList :
    if element.get("Language") == "English" :
      outfile.write ("\\sn %s\n" % sense)
      outfile.write ("\\ge %s\n" %  element.text )
      sense+=1
    if element.get("Language") == "Korean" :
      outfile.write ("\\sn %s\n" % sense)
      outfile.write ("\\g_Kor %s\n" %  element.text )
      sense+=1

for Lexeme in approvedWords :
  outfile.write ("\n\\lx %s\n" % Lexeme)
  print ("adding unglossed words", len(approvedWords), Lexeme )

outfile.close()

ParaTExt Wordlist to FLEx?

Please log in or register to answer this question.

8 Answers

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Related questions

Categories