A manual restore of interlinear data after migration

Question

It seems that Paratext 8 has a problem migrating interlinear glosses if you use the Interlinearizer in 7 to gloss text using the NT Greek as your model text. A user asked for help in this situation, and I saw that after migration, the language code in the lexicon.xml and in the per-book gloss files was “el” while the code for NT greek should be “grc”. (In a test project, I glossed a few words in Greek, then migrated, and in this migrated project, the code for NT Greek became “lbj”, the code for a language from India. I have reported this issue to the developers).

The language code is used in three places in the interlinearizer data.

Inside the lexicon.xml file, in the “Gloss Language” field. This field occurs for every word that is given a gloss in that language.
in the subfolder name and the file name of the per-book interlinearizer file. For example, “interlinearizer_el_MAT.xml” is the file for glosses in the “el” language for Matthew.
Inside each per-book file, in the glossLanguage field in the second line of the file.

How did I find out that “grc” was the right code for NT greek? I glossed one word in Paratext 8 with Greek as the model, saved the change then looked at the files.

So to manually convert this data, I did:
0) close Paratext if it is open

a search-replace of lexicon.xml, and replaced “el” with “grc”, for example:
```
<Gloss Language="el">δέ</Gloss> 
```
becomes
```
<Gloss Language="grc">δέ</Gloss>
```

To limit the change just to the codes and not any “el” strings inside a larger word, include the quote marks (straight double quotes) in the search string and in the replace string.

2a) change the name of the “Interlinear_el” folder inside the project folder to “Interlinear_grc”. (If you’ve created a test file in the desired code, you would delete the folder and its file first).

2b) change the file names inside this folder from "Interlinear_el_[Bookcode].xml to "Interlinear_grc_[Bookcode].xml

change the Glosslanguage code in the second line of each per-book file to the desired code. For instance

 <InterlinearData ScrTextName="MP8" GlossLanguage="el" BookId="MAT">

 becomes

 <InterlinearData ScrTextName="MP8" GlossLanguage="grc" BookId="MAT">

Start Paratext and see if it worked.

When editing the XML files, make sure you don’t change any < or > or </ or /> codes, these are like backslashes in USFM. If you make a mistake, Paratext may reject your edited lexicon.xml and change its name to lexicon.xmlcorrupt, and start creating a new one. If you save a copy of your lexicon.xml file in another location before editing, you could bring that back if you hit this problem and cannot identify what went wrong in your edited file.

Paratext Nov 3, 2017 asked by [Expert]

sewhite (3.1k points)
Mar 1, 2021 reshown

4 Answers

Jeff_Shrum · Answer 1 · 2017-11-14T05:10:11+0000

Update – I had the same problem anon044949 had with the orthography change project. The issue was doing a search/replace on five vowels in the language to replace them with different characters. It turns out, there were a few glosses done in the new orthography in the lexicon, done after the conversion. When I converted all the old entries, there were a few duplicates. Two instances of the same word, each one with a different sense ID or gloss ID. Paratext when loading this file into memory protested and marked the lexicon file as corrupt. So besides changing the < > and </ > codes, there is a second way to “corrupt” a lexicon, end up with duplicate words. But this was a different situation than changing the language codes, this required changing the words and morphemes inside the lexicon file to match the new orthography.

Nov 29, 2017 commented by [Expert]

sewhite (3.1k points)
Nov 29, 2017 reshown

Phil_Leckrone · Answer 2 · 2018-03-02T14:17:56+0000

Yesterday I ran into a situation where in PT7 the language had been “Spanish” for the RV60 and in PT8 the language for RVR1960 is “spa”. I followed the instructions of Steven in an earlier post to make the appropriate changes, but the glosses still did not appear as approved (as they were in PT7).

Tim S. pointed out to me that in PT8 there are certain languages that display the three letter code (in this case spa), but internally use the two letter code (in this case es) for matching the interlinear data. Once I made the appropriate changes and used “es” the glosses appeared as approved.

So, if you try using the language of the model text and it doesn’t work you might try the appropriate two letter code.

A chart of these codes can be found at: https://www.loc.gov/standards/iso639-2/php/code_list.php

jwagner · Answer 3 · 2022-06-17T09:16:06+0000

sewhite:

The language code is used in three places in the interlinearizer data.

Inside the lexicon.xml file, in the “Gloss Language” field. This field occurs for every word that is given a gloss in that language.

in the subfolder name and the file name of the per-book interlinearizer file. For example, “interlinearizer_el_MAT.xml” is the file for glosses in the “el” language for Matthew.

Inside each per-book file, in the glossLanguage field in the second line of the file.

How did I find out that “grc” was the right code for NT greek? I glossed one word in Paratext 8 with Greek as the model, saved the change then looked at the files.

So to manually convert this data, I did:
0) close Paratext if it is open

a search-replace of lexicon.xml, and replaced “el” with “grc”, for example:
<Gloss Language="el">δέ</Gloss> 
becomes
<Gloss Language="grc">δέ</Gloss>
To limit the change just to the codes and not any “el” strings inside a larger word, include the quote marks (straight double quotes) in the search string and in the replace string.

2a) change the name of the “Interlinear_el” folder inside the project folder to “Interlinear_grc”. (If you’ve created a test file in the desired code, you would delete the folder and its file first).

2b) change the file names inside this folder from "Interlinear_el_[Bookcode].xml to "Interlinear_grc_[Bookcode].xml

change the Glosslanguage code in the second line of each per-book file to the desired code. For instance
 <InterlinearData ScrTextName="MP8" GlossLanguage="el" BookId="MAT">

 becomes

 <InterlinearData ScrTextName="MP8" GlossLanguage="grc" BookId="MAT">
Start Paratext and see if it worked.

When editing the XML files, make sure you don’t change any < or > or </ or /> codes, these are like backslashes in USFM. If you make a mistake, Paratext may reject your edited lexicon.xml and change its name to lexicon.xmlcorrupt, and start creating a new one. If you save a copy of your lexicon.xml file in another location before editing, you could bring that back if you hit this problem and cannot identify what went wrong in your edited file.

This process was necessary to re link our BT project with the Interlinearizer after upgrading to PT9.2 (we had originally put the language code the same as the main project rather than (en), and PT couldn’t identify it.
The only step I would add is that in the BT menu under “project settings” in PT you are able to switch the language code even after creating the project (in a standard project you cannot do this). We had multiple teams with BT projects that were created with the same language code as their main project. Changing the code to English (eng) allowed us to select our BT project in the Interlinearizer setup

(NOTE: we also figured out that you cannot use “Create Glosses for ZZZ with no model text” and choose “output to” to select a BT project. We don’t use a model text for our BT’s so this seemed ideal. But it will only let you chose a Standard Project for output. So the option to create a back translation using the BT project AS THE MODEL was the option we needed. It’s all working great now!!

A manual restore of interlinear data after migration

Please log in or register to answer this question.

4 Answers

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Related questions

Categories