The problem is, I believe, what Paratext’s XML specifies and what it doesn’t. It specifies chunk of text, gloss-id (cross-referencing the Lexicon.xml
file) and the position of the original text in some representatoin of the input stream. It does not preserve the order of occurrence. The verses are not in sequence, either.
A given word or phrase may be glossed as word(s) or stem and morphemes, (of course with different glosses) and the only way to tell which homograph is which or which level of glossing the user expects is via the positioning data. The glossing file seems to ignore case, punctuation and SFM marks, and probably other things too.
Thus if the glossing file matches the input, then assuming that Martin has understood the undocumented way in which Paratext counts characters in its internal representation of the file, then the right thing to do is cut out the word/phrase unit by position and replace with the gloss, possibly doing some case transposition as appropriate.
If the verse has changed, then that doesn’t work. If the internal representation of the file is wrong, then it doesn’t work.
Any unconstrained search and replace is going to get false matches on homographs, so that can’t be used as a general approach. There could be some guessing done, looking for fuzzy-matches plus or minus a few characters, breaking at word-gaps, for instance. That would cost a fairly large investment of programmer time, but there could be problems. If the spacing has changed because someone has reordered the words in the verse, the homograph offset by 3 characters might not be the same word. E.g. Romani has fairly free word order:
O Del te del tut haro
(the) God SUBJ give you grace
"May God give you grace"
vs
Te del o Del tut haro
SUBJ give (the) God you grace
"May God give you grace"
PTXprint can’t go into the linguistics rules of the language. Not even Paratext remembers / asks what parts of speech things are. The only reliable way to get trustworthy interlinear data is for the user to approve glosses.