0 votes

Does anyone know of a command line tool to convert USFM format to USX (or any other xml flavor)?

I’d like to write a script to grab a few different .sfm files from different projects and mass-convert them… something quite difficult with the built-in Tools–>Advanced–>Export Project to USX.

Also, USX makes some strange design decisions that seem to be direct reflections of the USFM code instead of true xml. At least they seem strange to me who has very little background in this area… but I would have expected the tag to close at the end of the chapter and not immediately after the chapter number, e.g.
<chapter number=“1” style=“c”>
    all the verses
</chapter>

instead of
<chapter number=“1” style=“c” />

I have the exact same question about verse tags.

Can anyone explain why the xml works this way, or suggest other xml formats that are more logical? These files will be used internally in a publishing project and thus we’re not too concerned about external standards.

Note I’ve tried the Haiola usfm2usfx.exe tool and can’t get it to work, but if people say that’s the best option I can try to figure out what’s wrong.

Paratext by (1.8k points)

6 Answers

0 votes
Best answer

Sorry, this will not work. You can read the result of a already-setup module that Paratext has already run the module generation on, but you can’t write a module to Paratext and have it know it’s a module and run the generation.

EDIT: Apparently I missed some questions from over two years ago: :grimacing:

Yes, it should output 3.0 format now.

Unfortunately no. rdwrtp8 is designed to read/write the raw files so that other applications can read/write Paratext data. This requires that the data be round-trip-able and the changes defined in PrintDraftChanges.txt are not round-trip-able.

by [Expert]
(16.2k points)

reshown
0 votes

I don’t know of such a tool. The best way seems to me to use a standardized location scheme for exporting the USX files, e.g. C:\My Paratext 8 Projects\[project name]\local\USX and get them from there.
Regarding the chapters and verse markers, these are called milestones and are one way to handle overlapping markup.
Is this for a print publication or digital only?

by (834 points)
reshown

This is for a print-only publication. The publishing program we’re using, SILE, takes xml inputs. It would actually take a TeX-like input (i.e. USFM), but the programmer I’m working with would prefer to use xml.

I’m assuming the format of USX will work perfectly fine for the programmer, so these other questions are for my own understanding.

I guess it’s just design decisions, but I’m wondering why whoever created USX decided that <chapter> meant “the number at the beginning of a chapter” and not “chapter”. My understand of xml is that it typically uses formatting like
<tag attributes=“aaa”>content</tag>

The way things currently are
<chapter number=“1” style=“c” > has no content
Where I would expect
<chapter number=“1” style=“c”>the actual content of the chapter<\chapter>

Maybe I’m missing something, but I don’t see how this is caused by overlapping markup issues.

Here’s another issue–following the style of religious books of our region our project plans to mark verse numbers at the end of verses, not the beginning. The way the <verse> marker currently works we’ll need to use scripting to move the tag to a different position in the plain text xml document, but if USX used the format I think it ought to:
<verse number=“1” style=“v”>verse text</verse>
then it would be quite simple to have the program reading the xml interpret that line and say “print out the attribute ‘number’ at the end of ‘verse text’ instead of the beginning”.

0 votes

A year and a half later, I still want this functionality.

The built-in Paratext “export project to USX” is exactly what I need, but I’d like it automated so I could incorporate it in a command line built script. The USFM–>USX conversion code must be in PT. Is there a way to get that code from the developers (understanding a bit of modification would be necessary on our part to get it to run under whatever language we happen to use)? If so, how would I go about doing that?

by (1.8k points)
0 votes

It sounds like you want a stand-alone tool, but if all you need is the text of a project in USX format, Paratext 8+ ships with a small command-line tool that can extract the text for a project in USX format:
rdwrtp8 -r projectShortName bookCode chapterNum resultFileName -x

The -x is to specify USX format instead of USFM format.
If chapterNum is 0 (zero), then the whole book is returned.

This can not be used on resources for obvious reasons.

It would be difficult to split out the parsing code outside of Paratext since it depends a lot on the project settings and stylesheet information to produce the USX from the USFM.

by [Expert]
(16.2k points)
0 votes

When I saw your recent post, I was going to suggest Haiola as well, but I see that you tried that before. Admittedly, I have not used it myself. I would suggest contacting the author with any issues you found, if rdwrtp8 does not provide a solution.

As for overlapping markup, I would guess an example would be Nehemiah 8. In my printed NIV, chapter 8 starts in the middle of a paragraph. And the verse number (1 in this case) is printed next to the chapter number (which is in drop caps). In other chapters, the verse number is omitted if it is 1.

In Paratext the style sheet defines a hierarchy of markers, such as id > ide > c > p > v. And for most chapters and paragraphs, this could map to XML in the way you expect (with the c element containing a p element, etc.) But for Nehemiah 8, the Paratext hierarchy breaks down, which is why I suspect USX has a c element that has not content.

by (185 points)
0 votes

Thanks, @anon291708, I think rdwrtp8 will do exactly what I need. I do have a few questions about it, having just run it.
It outputs USX in version 2.5 format, which is what PT8 used. Will it ever be updated to output in 3.0 format like PT9 does?
It doesn’t appear to use the PrintDraftChanges.txt file during the export process, which the “Export project to USX” menu option in PT8/9 does use. Is there a way to incorporate those changes?

@Bobby, thanks for suggesting Haiola. Yes, I had tried it unsuccessfully. As for the overlapping markers, after reading through this post a year or two ago and thinking through the issues, I think I’ve come to a better understanding of why it’s important (and why any system will have inherent shortcomings).

by (1.8k points)

I would now like to write a script which takes a Bible Module spec file and creates the output file from it. I was told here that rdwrtp8.exe should do this, but I can’t get it to work.

Should it? Should I be able to take a Bible Module spec file, load it into PT externally as a XXA-XXG book, then export that… and have the resulting file be the output of the Bible Module with all the scripture filled in?

The commands I’m using are:
rdwrtp8 -w PROJ XXA 0 "D:\My-Modules\module1.sfm"
and
rdwrtp8 -r PROJ XXA 0 "D:\My-Expanded-Modules\module1.sfm"

Related questions

0 votes
2 answers
0 votes
2 answers
0 votes
1 answer
Welcome to Support Bible, where you can ask questions and receive answers from other members of the community.
For just as each of us has one body with many members, and these members do not all have the same function, so in Christ we, though many, form one body, and each member belongs to all the others.
Romans 12:4-5
2,645 questions
5,394 answers
5,065 comments
1,437 users