0 votes

Is there are way to get the expressions on the User menu in RegEx Pal as a text file? The list of user expressions is very long, and it would be useful to be able to search it, for example.

I’ve had problems adding my own expressions: it seems that you’re allowed to add a search/replace pair of expressions, but when I did this a while back, the replace expression got munged.

A question I have about regexes is: what is the definition of a word-forming character (\w)? Does it include every alphabet, or only Latin? What’s the best way to search for characters in one alphabet only? For Cyrillic, I could keep the string [абвг…АБВГ…] in a test file, and copy it into a regex, but that would result in some very long strings. Is there not a better way? Can you take a step back and define strings that can be referenced in a regex with codes as short as \w ?

Despite the large number of user expressions that are included, I’m sure there are loads more that users might like. For some things I’m working on, I could do with one to replace straight quotes with curly quotes, and choose the right one for opening or closing a quotation.

Maybe there could be two categories added to this forum: one for questions like “somebody help me write an expression to do this”; and another where poeple could post expressions that have been tested on real project data so that they’re watertight. The latter could help us build a library of useful expressions.

Paratext by (1.4k points)

10 Answers

0 votes
Best answer

Yes, they are stored in My Paratext Projects\userMenu.txt. If that does not exist, then the menu is created with the default list which is in Program Files (x86)\Paratext 7\userMenuStd.txt. The format of the file is very specific, so beware if you edit it.

by [Expert]
(16.2k points)

I have lots of expressions added too my usermenu.txt file that I use
for training. However, I have not recently validated them and I’m midway
through cleaning up my file and have not completed this. I am teaching
next week, but after that I will try to clean them up.

I am looking to make sets of markers in separate files that can be added
to a usermenu as desired by an end user. I will also explain a bit about
the syntax of the usermenu file.

Once done I will post them as a file(s) in a Typesetters Community of
Practice (TCoP) google group. I am trying to make them sort of self
documenting. I have over 150 menu selections in groups like:

  • [VARIANT TEXT] (IMPLIED TEXT)
  • POSSIBLE TRANSLITERATIONS to get multiple words (up to 30 chars)
    REMOVE space after the [^
  • INTROS \mt, \toc and OUTLINES
  • DASHES BETWEEN #'s versus WORDS
  • SCRIPTURE REF PREPARATION
  • BOOK NAMES
  • CH/VS SYNTAX
  • MARKERS
  • \r PARALLEL PASSAGE
  • QUOTES MARKS
  • \f FOOTNOTE BEFORE MAKING CHANGES MARK POINT IN PROJECT HISTORY IN
    PARATEXT FIRST
  • \x XREFs

These all help with analysis, cleanup, and common conversions.

For example: footnotes/xrefs have to do with cleanup and standardizing
the syntax.

I guess these a teaser of something that could be available V in October.

anon467281
Global Publishing Services
Scripture Typesetting trainer & Regular Expression Specialist
Dallas TX

I don’t have either of these files in 7.5, and only the latter in 7.6.

I do have “c:\Program Files (x86)\Paratext 7\ParatextRegExPal\userMenuStd.txt” in both, but no userMenu.txt file in that folder.

And those files that I list contain very little text (20 lines, excluding blank lines) – nowhere near what was in the User menu when I first installed PT. Or maybe those 20 lines are the default set?

So is there something strange with my PT installation? Where is it getting the 100+ items it displays on the User menu?

Paratext looks in “My Paratext Projects” folder to see if the
usermenu.txt exists. If it does, it uses it, otherwise it uses
usermenustd.txt.

The set that installs with Paratext has always been in the neighborhood
of 20 entries.

I have a usermenu.txt with about 150 entries. Some time after our
training (ends Oct 2) I will clean it up and post it to this Google
group as well as the TCoP (Typesetting Community of Practice).

anon467281

Global Publishing Services
Scripture Typesetting trainer & Regular Expression Specialist
Dallas TX

In the end, I searched all my drives, and found usermenu.txt in “My Paratext Projects”. I had misread your posting, @anon291708, and thought you meant they were both in the program folder – profound apologies.

Hi anon467281,

I was wondering if I could receive a copy of your list of regular expressions, the usermenu.txt file you referred to with 150+ entries. We are having a training topic at the Eurasia Language Technology Workshop on using Regular Expressions and some of these could be useful to participants.

Thanks,

Stevan Vanderwerf
Language Technology Consultant

Here’s the latest. I am still not totally finished testing out all of
the selections.

D anon467281

Global Publishing Services
Scripture Typesetting trainer & Regular Expression "specialist"
Dallas, TX

Doesn’t look like the attachment was added. I have placed the contents of my UserMenu.txt file after the line of equal signs at the bottom of this reply.

Here’s the latest. I am still not totally finished testing out all of the selections.

D anon467281

Global Publishing Services
Scripture Typesetting trainer & Regular Expression "specialist"
Dallas, TX

———CH/VS SYNTAX—————————#f#
———\v VS NUMBER SYNTAX—————————#f#
CV-01 –IlO— - find invalid characters in \v verse number (found letters I l O endash emdash)#ei#\v\s?\S*:::\S*([–IlO—]|–+)\S*
CV-02 L->1 - change letter lowercase “L” and uppercase “I” in verse number to number “1”#r#\v\s?\S+:::l#1
CV-03 o->0 - change letter uppercase “o” in verse number to number “0”#r#\v\s?\S+:::O#0
CV-04 en/em->dash - change en/em dash and duplicate dashes verse range character in verse number to singledash “-”#r#\v\s?\S+:::[\u2013\u2014]|–+#-
———CH:VS SEPARATOR—————————#f#
CV-11 a- ANALYZE ch:vs separators#csd#\d[:.] ?\d
CV-12 :- STANDARDIZE to : colon ch:vs separators#r#(\d)[:.] ?(\d)#\1:\2
CV-13 .- STANDARDIZE to . period/full stop as ch:vs separators#r#(\d)[:.] ?(\d)#\1.\2
———CHAPTER SEPARATORS “;”—————————#f#
CV-21 ; c- COUNT missing chapter separator#csd#([:.]\d+)( \d+[:.]\d+)
CV-22 ; a- ADD missing chapter separator#r#([:.]\d+)( \d+[:.])#\1;\2
CV-23 9;9 c- COUNT missing spaces after chapter separators “;”#csd#(\d+[:.][\d, \p{Pd}abc\p{Lu}]+)(\d?;)\d[^\]+
CV-24 9; 9 a- ADD missing space after chapter separator “;”–>"; "#r#(\d+[:.][\d, \p{Pd}abc\p{Lu}]+)(\d?;)(?=\d[^\]+)#\1\2
———CHAPTER BRIDGE—————————#f#
CV-31 a- ANALYZE chapter bridge separators#csd#(\d+[.:][\dabc]+(, ?)?[\dabc])\s[\p{Pd}]+\s*(\d+[.:]\d)
CV-32 n- STANDARDIZE to N-dash (u2013) as chapter bridge separator#r#(\d+[.:][\dabc]+(, ?)?[\dabc])\s[\p{Pd}]+\s*(\d+[.:]\d)#\1\u2013\3
CV-33 m- STANDARDIZE to M-dash (u2014) as chapter bridge separator#r#(\d+[.:][\dabc]+(, ?)?[\dabc])\s[\p{Pd}]+\s*(\d+[.:]\d)#\1\u2014\3
———BOOK SEPARATORS/BRIDGES—————————#f#
BS-41 ; L- List invalid book sep - then run CV-42 1st then CV-43#edi#(([12] ?)?\p{Lu}[\w/~]+ \d+[:.][\d,-—a-f]+)[^;\p{Pd}] ?([\w/~]+ \d+([:.][\d,-—a-f]+))
BS-42 ,_ 1st- DO FIRST-Change book separator from ", " -> "; "#r#([123] ?)?(\p{Lu}[\w/~]+ \d+[:.][\d,-—a-f]+), ?(?=([123] ?)?\p{Lu}[\w/~]+ \d+([:.][\d,-—a-f]+))#\1\2;
BS-43 ,|_ 2nd- DO SECOND-Add "; " missing book separator#r#([123] ?)?(\p{Lu}[\w/~]+ \d+[:.][\d,-—a-f]+) (?=([123] ?)?\p{Lu}[\w/~]+ \d+([:.][\d,-—a-f]+))#\1\2;
BS-44 +; - insert missing ; bk sep between vs-no. and \xdc#r#(?<=\x .?\x(dc|ot|nt|t) ).?\x*:::(\d+)( \x(dc|ot|nt|t))#\1;\2
BB-45 – - List all book bridges#edi#(?<=[12] ?\p{Lu}[\w]* \d+[:.][\d,-a-f]+)\p{Pd}+(?=[12] ?\p{Lu}[\w]+ \d+[:.])
BB-46 – - Make all book bridges be – en-dash (u2013) as in 2Sa 9.9–1Ki 9.9#r#(?<=[12] ?\p{Lu}[\w]* \d+[:.][\d,-a-f]+)\p{Pd}+(?=[12] ?\p{Lu}[\w]+ \d+[:.])#\u2013
BB-47 — - Make all book bridges be — em-dash (u2014) as in 2Sa 9.9—1Ki 9.9#r#(?<=[12] ?\p{Lu}[\w]* \d+[:.][\d,-a-f]+)\p{Pd}+(?=[12] ?\p{Lu}[\w]+ \d+[:.])#\u2014
———VERSE SEPARATOR—————————#f#
CV-51 c- ANALYZE verse separators ", " or “,” inside ch/vs refs–Are there more ", " or “,”? Go with majority.#cd#[:.]([\d, -:.abc]+):::,.
CV-52 ,- REMOVE space in ", " vs. sep.#r#[:.]([\d, \p{Pd}:.abc]+):::,\s#,
CV-53 , - ADD space after “,” in vs. sep.#r#[:.]([\d, :.abc]+):::,(?=\S)#,
———DASHES BETWEEN #'s—————————#f#
BV-61 - list verse bridges#csd#[\S-[(]]+\d[-]\d[\S-[\);]]+
BC-62 - list chapter bridges#csd#\S+\d[\u2013]\d[\S-[\);]]+

x

———MARKERS: ADD MISSING 1 TO LEVEL 1————————#f#
M-01 - identify level markers—such as \q,\q1,q2—in order to see level-1 inconsistencies#cs#(\(q|qm|li|mt|mte|ms|s|imt|is|iq|io))(\b|\d)
M-02 - change \q to \q1 (only if there are \q2)–modify to \mt and \s and rerun as needed#r#(\q\b)#\11
M-03 - Move section head & parallel passage ref from BEFORE to AFTER chapter numbers#r#(\c .?\r\n)(\s .?\r\n)(\r .?\r\n)?#\2\3\1
M-04 - Move section head & parallel passage ref from AFTER to BEFORE chapter numbers#r#(\s .
?\r\n)(\r .?\r\n)?(\c .?\r\n)#\3\1\2

x

———DASHES BETWEEN WORDS—————————#f#
D-01 - analyze dashes between words#cs#[\p{Pd}]+
D-02 - analyze dashes between numbers#cs#(?<=\d)[\p{Pd}]+(?=\d)

x

———QUOTE MARKS—————————#f#
QT-00 all - view common QUOTATION marks#cu#[’<">\p{Pi}\p{Pf}]
QT-01 seq - list quote mark sequences (check for Valid/Invalid white space)#csd#[`’<">\p{Pi}\p{Pf}]+\s*[’<">\p{Pi}\p{Pf}]*
QT-02 mid - quotes used mid-word#csd#(?<=[\p{L}\p{M}])’<">\p{Pi}\p{Pf}
———LEGACY ENCODING—————————#f#
QT-10 <<<? - <<< must change these manually to either << < or < << BEFORE converting to curly quotes#f#<<<|>>>
QT-11 << to “#r#<<#“
QT-12 >> to ”#r#>>#”
QT-13 < to ‘#r#<#‘
QT-14 > to ’#r#>#’
——FIX DOUBLE QUOTES ENTERED AS “—————————#f#
QT-20 " - view open double inch mark#cs#(.)”([^\\s])|" (’)
QT-21 “->“ - fix open double inch mark#r#(.)”([^\\s])|" (’)#\1“\2
QT-30 " - view close double inch mark#cs#(.)"
QT-31 “->” - fix close double inch mark#r#(.)”#\1”
——FIX GLOTTALS ENTERED AS APOSTROPHES ꞌ—————————#f#
QT-40 a’a - find midword apostrophe/curly close#cs#(\w)’’
QT-41 ‘->ꞌ - chg midword apostrophe/curly close to curly close#r#(\w)’’#\1\uA78C\2
——FIX QUOTES ENTERED AS ‘—————————#f#
QT-50 ‘’=’ - apostrophes that are quote marks#cs# ‘([^’]?)’(\W)
QT-51 ‘’->’ - change apostrophes behaving like single quotes to single quotes#r# ‘([^’]
?)’(\W)# ‘\1’\2
QT-52 ‘->‘ - convert word initial straight apostrophe ’ to open curly quote ‘#r#(\s)’(\p{L}*)#\1\u2018\2
——SPACES IN BETWEEN—————————#f#
QT-60 “ ‘ - add space between “‘#r#“‘#“ ‘
QT-61 “ ‘ - add space between ’”#r#’”#’ ”
QT-62 “ ‘ - add space between ‘“#r#‘“#‘ “
QT-63 “ ‘ - add space between ”’#r#”’#” ’
———APOSTROPHES—————————#f#
AP-70 wd’wd - display mid-word straight apostrophe ’ words#csd#(\p{L}+)’(\p{L}+)
AP-71 ‘->ʼ - convert mid-word straight apostrophe ’ to curly apostrophe ʼ \u02bc#r#(\p{L}+)’(\p{L}+)#\1\u02bc\2
AP-71 wd’->ʼ - convert word ending straight apostrophe ’ to curly apostrophe ʼ \u02bc#r#(\p{L}+)’(\W)#\1\u02bc\2

x

———\f FOOTNOTE————————— BEFORE MAKING CHANGES MARK POINT IN PROJECT HISTORY IN PARATEXT FIRST —————————#f#
.have you discovered and changed footnotes (\f) that are really cross refs to \x markup?#f#
.if not, under ———\x XREF——— below, run ——IS FOOTNOTE AN XREF?—— steps first.#f#

x

  ——CALLER ID—————————#f#

A \f + examine \f caller ids (prefer +)#csd#\f [^\ ]+
B add missing space after fn caller#r#(\f \S+)(\\w+)#\1 \2
C make \f caller + (auto generated) for all#r#(?<=\f )[^\+ ]+ ?#+
D1 \fr find original references missing \fr #f#(?<=\f \S )([\d:.,\p{Pd}a-d]+)
D2 \fr add missing \fr #r#(?<=\f \S )([\d:.,\p{Pd}a-d]+)#\fr \1
D3 no \fr find missing \fr and missing origin ref#f#(?<=\f \S )\f[^r]
D4 no \fr 9.9 add missing \fr with origin reference cv-sep . and ending :#r#(?s)(\c )(\d+)(.?)(\v )(\S+)([^\r]\f \S )(\f[^r])#\1\2\3\4\5\6\fr \2.\5: \7

x

x

  ——REMOVE UNNEEDED FOOTNOTE MARKERS AND SPACES—————————#f#

E \f?* remove unneeded embedded close markers followed by open embedded marker#r#\f[a-uw-z]*(\f.)(?!*)#\1

E \f?* remove unnecessary footnote closing markers (keep \f*)#r#\f[\w-[iv]]+*#

F \f?…\f? remove repeated \f? (duplicate with text in between)#r#(\f\w )([^\])\1(([^\])\1)?#\1\2\4
G sp\f* remove space from end of \f* closing marker#r# (\f*)#\1
H sp\f remove space before a footnote#r#(?s)(?<!\v \S+)\s+(\f\s)#\1
L \fk…\ft -> …\fk* replace closing fnote key markup with closing " \fk*"#r#(?<=\fk [^\])(\S) ?(\\S+)( \ft)?#\1\fk

x

  ——ORIGIN REF \fr—————————#f#

I \fr 9.9\ add missing ending space to end of \fr#r#(\fr \S[^\ ]\S)(\)#\1 \2
J1 9.9? examine \fr ch:vs syntax (just stuff before the following #csd#\fr \d+\D[^ \]
+
J2 CV : make \fr ch/vs separator : (colon)#rd#(\fr \d+)[^:\d]([^ :\]+)#\1:\2
J3 CV . make \fr ch/vs separator . (period/full stop)#rd#(\fr \d+)[^.\d]([^ \]+)#\1.\2
K \f x \f*… examine footnote marker patterns “\f + \fr … \ft … \f*”#csn#\f .*?\f*

x

———SCRIPTURE REF PREPARATION—————————#f#
———BOOK NAMES——————possible short name and abbreviations for Scripture Reference Settings… in Paratext 7#f#
.Since \TOC2 most often matches \h, you are looking for abbrev. to use in \TOC3.#f#
.Extract pos

Correction: Program Files (x86)\Paratext 7\ParatextRegExPal\userMenuStd.txt .

I wonder where I got this userMenu.txt – maybe it’s of use to some readers, since it seems to cover different ground to anon467281’s one.

It starts after the line of equal signs at the bottom of this reply.

================================================
1-Create TOC1 for 4 part book titles#r#\mt(\d* )(.?)\r\n\mt(\d )(.?)\r\n\mt(\d )(.?)\r\n\mt(\d )(.?)\r\n#\toc1 \2 \4 \6 \8\r\n\a\1 \2\r\n\a\3 \4\r\n\a\5 \6\r\n\a\7 \8\r\n
2-Create TOC1 for 3 part book titles#r#\mt(\d
)(.?)\r\n\mt(\d )(.?)\r\n\mt(\d )(.?)\r\n#\toc1 \2 \4 \6\r\n\a\1 \2\r\n\a\3 \4\r\n\a\5 \6\r\n
3-Create TOC1 for 2 part book titles#r#\mt(\d
)(.?)\r\n\mt(\d )(.?)\r\n#\toc1 \2 \4\r\n\a\1 \2\r\n\a\3 \4\r\n
4-Create TOC1 for 1 part book titles#r#\mt(\d
)(.?)\r\n#\toc1 \2\r\n\a\1 \2\r\n
5-Create TOC1 - restore \mt’s from temporary \a’s#r#\a(?<=\d
)#\mt
7-cleanup TOC1 extra spaces#r#( +)#
8-Create TOC2 and TOC3#r#(\h )(.?)(\r\n)(\toc1.?\r\n)#\1\2\3\4\toc2 \2\3\toc3 \3
9-swap \toc1 & \toc2 contents & add empty \toc3#r#(\toc1 )(.?\r\n)(\toc2 )(.?\r\n)#\1\4\3\2\toc3 \r\n
10-create \toc2 from \h & an empty \toc3#r#(?s)(\h )(.?)(\r\n)(\.?)(?=\mt)#\1\2\3\4\toc2 \2\3\toc3 \3
99-remove TOC’s#r#\toc\d.?\r\n#
————————————#f##
11-extract \r booknames (to be \toc2)#cu#(?<=\r ((
|.?; ))[123][\p{L} ]{2,99}
12-extract \f, \ft book names (to be \toc3)#csu#(?<=\f + )(\d |\p{L})[^\;\d\s]|(?<=\f + (\w |\p{L})[^\;\d\s][^\;]; )(\w )\w{3,99}|(?<=\ft [^\]{1,40}; )(\d )\p{L}{3,99}|(?<=\ft )(\d )\p{L}{3,99}(?= \d)
Find lines that do not start with backslash code#f#\r\n[^\]
Find close codes preceded by a space# \\w+* ?
Find long poetic lines#f#\q(\s+\v )?[^\r]{70,}
Find non-word characters before footnote callers#f#[^\w]\f\s
Find missing capitals after period#f#.\s+["¿¡’?][a-z]
Count SFM clusters#c#\\S+
Count all cap words#cr#\b[A-Z][A-Z]+\b
Count footnote marker patterns#cni#\f .
\f*
Count cross reference marker patterns#cni#\x .\x*
Count book reference abbreviations#c#[A-Z]\w\w?.(?=\s
\d)
Count verse number patterns#ci#\v\s+\S+
Count chapter/verse patterns#crd#\d[-\d.:;, ]+
Extract and sort all lines#es#\.*
Extract outlines#e#(\id …|\io.)
Replace missing space after \v#r#\v(\d)#\v \1
Reformat paragraphs#r#\r\n(?!\)#
Extract (…)#e#([^)]+)|<[^>]+>
Extract parallel refs#ei#\r .

Extract cross references#e#\x .\x*
Extract all footnotes#e#\f .
\f*
Change verse bridge , to -#r#(\v \d+),\s|,#\1-\2
Remove italics in intros#r#\ip \it (.)\it*#\ip \1
Convert hyphen to n-dash in chapter range#r#(.\d+)-(\d+.)#\1–\2
Add ID info#r#(\id …).
\r\n#\1 - ??? NT [???] -Papua New Guinea 19?? (web version -2013 bd) \r\n
Add DBL ID info#r#(\id …).\r\n#\1 - ??? NT -country 19?? (DBL -2013)\r\n
Add tocs from \h \mt1#r#\h (.
)\r\n\mt1 (.)\r\n#\h \1\r\n\toc1 \2\r\n\toc2 \1\r\n\toc3 \r\n\mt1 \2\r\n
Add tocs from \h \mt1 \mt2#r#\h (.
)\r\n\mt1 (.)\r\n\mt2 (.)\r\n#\h \1\r\n\toc1 \2 \3\r\n\toc2 \1\r\n\toc3 \r\n\mt1 \2\r\n\mt2 \3\r\n
Add tocs from \h \mt2 \mt1#r#\h (.)\r\n\mt2 (.)\r\n\mt1 (.)\r\n#\h \1\r\n\toc1 \2 \3\r\n\toc2 \1\r\n\toc3 \r\n\mt2 \2\r\n\mt1 \3\r\n
Add tocs from \h \mt2 \mt1 \mt2#r#\h (.
)\r\n\mt2 (.)\r\n\mt1 (.)\r\n\mt2 (.)\r\n#\h \1\r\n\toc1 \2 \3 \4\r\n\toc2 \1\r\n\toc3 \r\n\mt2 \2\r\n\mt1 \3\r\n\mt2 \4\r\n
Add tocs from \h \mt1 \mt2 \mt1#r#\h (.
)\r\n\mt1 (.)\r\n\mt2 (.)\r\n\mt1 (.)\r\n#\h \1\r\n\toc1 \2 \3 \4\r\n\toc2 \1\r\n\toc3 \r\n\mt1 \2\r\n\mt2 \3\r\n\mt1 \4\r\n
Add tocs from \h \toc1 \mt1#r#\h (.
)\r\n\toc1 (.)\r\n\mt1 (.)\r\n#\h \1\r\n\toc1 \3\r\n\toc2 \2\r\n\toc3 \r\n\mt1 \3\r\n
Add tocs from \h \toc1 \toc2 \mt1#r#\h (.)\r\n\toc1 (.)\r\n\toc2 (.)\r\n\mt1 (.)\r\n#\h \1\r\n\toc1 \4\r\n\toc2 \2\r\n\toc3 \3\r\n\mt1 \4\r\n
Add tocs from \h \toc1 \toc2 \mt1 \mt2#r#\h (.)\r\n\toc1 (.)\r\n\toc2 (.)\r\n\mt1 (.)\r\n\mt2 (.)\r\n#\h \1\r\n\toc1 \4 \5\r\n\toc2 \2\r\n\toc3 \3\r\n\mt1 \4\r\n\mt2 \5\r\n
Change Mdash to Ndash#r#(\d)—(\d)#\1–\2
Change \qr > \rq…\rq
#r#\r\n\qr (.)\r\n# \rq \1\rq\r\n
Extract \r booknames#cu#(?<=\r ((|.?; ))[123][\p{L} ]{2,99}
Convert \ft to \x#r#(\f )(+ )(\fr )(\S+ )(\ft )(?!Kiñeke |Tati )([^\]
\d)( LXX)?.?\f*#\x * \xo \4\xt \6\7\8.\x*
Convert \f to \x where book abbrevs are less than 4 letters long#r#(\f )(+ )(\fr )(\S+ )(\ft )(?!([a-z]|[A-Z]){4})([^\]{1,50}\d).?\f*#\x * \xo \4\xt \6\7\8\x*
Convert \f to \x where the lines are less than 50 chars#r#(\f )(+ )(\fr )(\S+ )(\ft )([^\]{1,50}\d).?\f*#\x * \xo \4\xt \6\7\8.\x*
——— most common DBL checks——————#f##
* \mt --> \mt1 #r#(\mt\b)#\11
* \q --> \q1 #r#(\q\b)#\11
* Find repeated \s(9) and \r (repeated marker used for a line break?)#cd#\(s\d?|r) .+\r\n\\1
* Replace repeated \s(9) or \r marker with a space#r#(\[rs]\d? [^\r]?)\s\r\n\s*(?=[^\])#\1
* Find line break in text following a \s or \r#ei#\[rs]\d? [^\r]?(?=\r\n[^\$])
* Femove hard line break in text of \s or \r#r#(\[rs]\d? [^\r]
?)\s*\r\n(?=[^\])#
———White Space———#f##
* CNT _ \f*_ SPACE before closing note marker#c# +\[fx]+*
DELE \f* #r# +(?=\[fx]+*)#
* CNT _ \r_ SPACE before linebreak#c# +(?=\r)
DELE \r #r# +(?=\r)#
* CNT _ _ LINE INITIAL SPACE before any SFM#c#(?<=\r\n) +
DELE __ #r#(?<=\r\n) +#
* CNT “\r\nA” HARD LINE BREAKS in marker text#c# ?\r\n (?=[^\])
REPL " A" #r#\s
\r\n(?=[^\])#
* CNT ~ PARATEXT NOBREAK SPACE)#c#~
REPL space #r#~#
* CNT // PARATEXT SOFT RETURN#c#\s*//\s*
REPL space #r#\s*//\s*#
* CNT \u00A0 & ~ NOBREAK SPACES#c#[\u00a0~]
REPL space #r#[\u00a0~]#
———\f FOOTNOTE—————————#f##
* Examine footnote marker patterns [\f + \fr … \ft … \f*]#csn#\f .?\f*
* Extract all footnotes#e#\f .
?\f*
* Examine \f callers (prefer +)#c#\f \S+
* Make fn caller + (when it is something else)#r#(?<=\f )[^+]\S+ #+
* Find \f callers with missing space before next #c#\f \S+\\w+
* Examine \fr ch:vs patterns#cd#\fr [^\]*
* Find \fr with missing space before next #c#\fr \S+\
* Add missing \fr when reference already exists#r#(?<=\f + )([\d:.,-a-z]+)#\fr \1
* Examine how fn ends (with or without a “.”)#csd#.\f*
* Add the missing . when fn ends with mostly “.\f*” #r#([^.])(?=\f*)#\1.
* Remove space at end of footnote " \f*"–> “\f*#r# (\f*)#\1
* Count footnotes that are NOT after a word#c#[^\p{L}\p{M}]\f\s
———\r PARALLEL PASSAGE—————————#f##
* Count standalone \r’s versus \s \r sequences\r#ei#(?<!\s\d? .\r\n)\r .(?=\r\n)
* Remove hard line break from \r#r#(?<=\r .?)\r\n([^\r])# \1
———\x XREF—————————#f##
* Extract all cross references#e#\x .
?\x*
* Examine cross reference marker patterns [\x + \xo … \xt … \x*]#csn#\x .?\x*
* Examine \x callers#c#\x \S+
Make xref caller + (when it is something else)#r#(?<=\x )[^+]\S
#+
Make xref caller - (when it is something else)#r#(?<=\x )[^-]\S* #-
* Find \x callers with missing space before next #c#\x \S+\\w+
* Examine \xo ch:vs patterns#cd#\xo [^\]*
* Examine how \xt’s end (with or without a “.”)#csd#(?<=\xt [^\]).\x*
Remove space at end of xref " \x
”–> “\x*#r# (\x*)#\1
———DASHES BETWEEN #‘s VS WORDS—————————#f##
* Find dashes#cu#[–-—\u2011]+
* Find dashes between numbers#cu#(?<=\d)[–-—\u2011]+(?=\d)
———QUOTES MARKS—————————#f##
common QUOTATION marks#cu#[’<”>\p{Pi}\p{Pf}]
list quote mark sequences (check for Valid/Invalid white space)#csd#[’<">\p{Pi}\p{Pf}]+\s*[’<">\p{Pi}\p{Pf}]+
quotes used mid-word#csd#(?<=[\p{L}\p{M}])’<">\p{Pi}\p{Pf}
<<< must change these to either << < or < << BEFORE converting to curly quotes#f#<<<
<< to “#r#<<#“
>> to ”#r#>>#”
< to ‘#r#<#‘
> to ’#r#>#’
add space between “‘#r#“‘#“ ‘
add space between ‘“#r#‘“#‘ “
add space between ”’#r#”’#” ’
add space between ’”#r#’”#’ ”
———INTROS \mt, \toc and OUTLINES—————————#f##
review outlines#e#(\id …|\io.)
Show outlines that don’t start with a \iot (maybe a \is)#ei#\r\n\[^i][^o][^t].
?\r\n\io1.*
extract refs missing \ior at end of outline#ei#\io\d? .:::(\S+\d(?=\r))
find \ior type references in outlines#csd# \io\d? .
\r:::[(]?(\ior )?(\d[\d.:-\u2013\u2014,abc]+)(\ior*)?[)]?
find \ior type references in outline without closing )#csd#\io\d .\r\n:::(\d+[;.]\d+[\d.:abc-\u2013\u2014]+)\r\n
add MISSING \ior markup around references in outlines#r#\io\d .
:::(\S+\d+)(?=\r)#\ior \1\ior*
add MISSING ( ) to outline references#r#(?<=\io\d ).\r\n:::[(]?((\d+[;.]\d+[\d.:abc-\u2013\u2014]+)|\d+(-\d+)?)\r\n#(\1)\r\n
———ETEN (not found through ParaTExt checks————————#f##
unmarked text following a chapter (avoids schema

Can you tell us the codes for specifying RegEx Pal’s modus operandi in this file?

Clearly:

  • #f# → Find
  • #r# → Replace
  • #c# → Count

But what are these, for example?:

  • #ci#
  • #ei#

Many of the things I tried to add to the menu ended up with one of the above two codes.

The first character determines the operation to be done (as you have figured out):

  • ‘f’ = Find
  • ‘r’ = Replace
  • ‘c’ = Count
  • ‘e’ = Extract

There can also be a set of one or more characters that define options that affect how a count or an extract will be done:

  • ‘s’ = Sort
  • ‘u’ = Unique
  • ‘d’ = Combine digits
  • ‘n’ = Combine non-marker text
  • ‘i’ = Include references

So for your two examples “#ci#” = Count while including references, “#ei#” = Extract while including references
You could also have “#esdi#” which would be a sorted extraction that combines digits and includes references.

0 votes

NOTE: Some things that work in one tool don’t work in another tool. Some regex features that work in RegEx Pal do not work in Paratext searches.

I’m not sure about expressions getting “munged”.

\w Matches any word character. Equivalent to the Unicode character categories [\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}\p{Pc}]. If ECMAScript-compliant behavior is specified with the ECMAScript option, \w is equivalent to [a-zA-Z_0-9].

\p{name} Matches any character in the named character class specified by {name}. Supported names are Unicode groups and block ranges. For example, Ll, Nd, Z, IsGreek, IsBoxDrawing.

So for Cyrillic you could search for \p{Cyrillic}

The following website gives a list of the groups and block ranges: http://www.regular-expressions.info/unicode.html

by (8.4k points)
reshown

Regex Pal was developed using Python, so it uses the Python regular
expression engine.

Other places in Paratext that use regular expressions rely on the .Net
regular expression engine.

That’s why there is a difference between how expressions work in different
places.

John+Wickberg
Paratext Support

Brian Renes asked me why things like Unicode patterns would work in Regex Pal if the Python regex engine was being used.

I found that while Regex Pal is Python code, it is it using IronPython - a .Net implementation of Python - and IronPython converts regular expressions into .Net regular expressions.

So, it seems that Regex Pal is using .Net regular expressions. I did double check the IronPython source to see that this was being done.

I haven’t found a clear guide of what changes are made to the regular expressions when they are processed in IronPython.

Sorry to have been misleading in the first answer.

John+Wickberg
Paratext Support

The website you reference says:

But neither PT nor RegEx Pal seem to work without the “Is” prefix on the group name.

[quote=“anon848905, post:2, topic:852”]
The following website gives a list of the groups and block ranges: http://www.regular-expressions.info/unicode.html[/quote]

The bit that gives you that list is quite a way down the page at http://www.regular-expressions.info/unicode.html#script (far enough down that I thought I was on an introductory page, and went on a wild goose chase elsewhere in the site, looking for the list – maybe I was just tired :sunglasses:.) The sections immediately above and below this section are also worth reading.

That site has very good explanations of the various regex codes. But for the list of scripts, I found site with more – with the character ranges also listed: https://msdn.microsoft.com/en-us/library/20bw873z(v=vs.110).aspx .

BUT … what about restricting it to the alphabet for a particular language? Can I define my own named class for our language somehow? Or would I need some third party regex app like RegexBuddy to do that?

Have you thought of using ranges for character classes? E.g. \b[a-df-z]{3}\b (find all three letter words without “e”).
I tried it in RegExPal with Greek, which gets a little bit messy because of the diacritics: Then you get something like \b[ά-ώἀ-ῷ]{5}\b (unfortunately this expression includes some upper case letters).
The ranges work in according to the unicode codepoints, so you may need an overview over the order of the desired characters in the unicode tables. Maybe something like [a-z] just works for you, if you have letters with diacritics which are not composed (as for polytonic Greek) it gets more challenging.
For navigating unicode codepoint I use BabelMap.

Pretty much every language has a different alphabet: English has 26 letters, and they all fall in two contiguous blocks in most character sets (EBCDIC excluded). Polish uses K, Q & X only for loan words, but hase 9 extra letters, so if you’re looking for pure Polish words, you have 8 blocks from the Latin 26-letter alphabet (lower- and upper-case), and 18 individual letters, so your set becomes:

[a-jl-pr-wyząćęłńóśżźA-JL-PR-WYZĄĆĘŁńŃÓŚŻŹ]

Now imagine that you write a regex where that set is used many times – it will become very unwieldy.

So, when I’m talking about sets for a particular language, I’m talking about tight ones, ones that exclude characters that are not used. For our (Cyrillic-alphabet) language, the set would look similar to the above, and I could write a different one for every language that uses Cyrillic.

I presume that you can add both Find and Find-and-replace items to the User menu. Just now tried to add:

\[typed in Find box\]

… and got the error “Invalid userMenu.txt file entry when I tried to use it.

If I try to add a Find-and-replace item and then use it, Pal takes me back to the Find function and enters a completely different string in the box to the one I saved.

0 votes

Why does ^ behave normally in RegEx Pal, but three of them together don’t? The three seem to capitalise the first letter following them:

image

by (1.4k points)

No-one’s answered this one – is it a mystery?

0 votes

It is a mystery - looks like it only handles the first character. I can’t find any reference to the use of ^^^ in documentation on the web.

by (8.4k points)

wdavidhj

Just curious as to what are you trying to do?

A while ago Nathan made a post how to make a capitalized letter on the
replace in /RegExPal/. using three ^^^ was the answer. I tried to find
his old email but cannot. So sorry.

D anon467281

Global Publishing Services
Scripture Typesetting trainer & Regular Expression "specialist"
Dallas, TX

0 votes

Are any of you able to come up with a RegEx Pal syntax for changing all \it and \it* markers in the text to \add and \add* while ignoring \it and \it* markers in the footnotes?

by (346 points)

While we wait for the gurus, perhaps an easy way to do that would be in Paratext directly rather than RegexPal. There is a tick box in the Paratext Find menu which restricts searches to the text. Click on the More box and then tick “Match only in Verse Text”. Then just do a replace for \it > \add (no space or * will do it one pass and it should be safe since it has )

I would test it past a known footnote \it location, though, I’ve never actually used that tick and so can’t quite attest to its exact behavior.

Blessings,

Shegnada

Language Technology and Publishing Coordinator, SIL Nigeria

Text Processing Specialist – Complex Script, GPS, SIL Intl

Skype: Shegnada..

[Email Removed]

+1 972 974 8146

Unfortunately that trick does not work for replacing markers in the text, just the text itself.

If you don’t have \add markers in the footnotes then you could do something like this:

  1. Change all \it > \add with a find and replace.

  2. In RegExPal

Find:

\f .*?\f*:::\add

Replace:

\it

This would look for \add in footnotes and change it back to \it. If you had \add in your footnotes then it gets more complicated.

0 votes

Sorry - I always forget that the system strips markers. Should be:
Find:
\\f .?\\f\:::\\add
Replace:
\\it

by (8.4k points)

I cannot see that \\f .?\\f\:::\\it will find anything in footnotes using the Paratext RegEx Pal before making any changes. Is this syntax really correct?

0 votes

Sorry again. The * after the period got clobbered.

The format should be:

\\f .*?\\f\:::\\add

So if you want to find the \it you could search for:

\\f .*?\\f\:::\\it
by (8.4k points)
0 votes

I give up! I notice that the * before the colon got clobbered too :frowning:

\\f .*?\\f\*:::\\add

I guess the important part to notice is that you can use ::: to separate what is being searched for (on the right) from a context (on the left).

by (8.4k points)

I was converting it from a
very small print 2-page landscape document to 4-page portrait.
Should be a little easier to read. Not sure if it is valid to
call a 4-page document a cheat sheet, but it lives on as such.

      I have attached the word

document & PDF of the Regular Expression Cheat sheet.

Great! Thank you! It would be really great to have all this info available under Help in the RegEx Pal tool. I wonder if that would be a possibility.

Here is another attempt to
attach the docs to the email to the PT Support Site.

Somehow the attachment did not
make it on the copy of the email to [PT Support Site]. So I
have created a google doc & pdf at the following links:

PDF: https://drive.google.com/open?id=1bXP0j6Mv_UrD678QGXCdlp6XzdjqUUhX

    DOC:

  Everyone with an SIL/Wycliffe email

should be able to access these links. If you do not have an
SIL/Wycliffe email let me know if clicking on the link works or
fails.

In the “Share” settings for the file, click “Advanced” and
change the permission type to “On - Anyone with the link”

0 votes

This site only allows files with the following extensions: jpg, jpeg, png, gif, zip

I guess when sending a response via e-mail you don’t get a message about that restriction.

by [Expert]
(16.2k points)
0 votes

Here is some RegEx code in userMenu.txt format you may find useful:
———Contexts—————————#r##
In footnotes#f#(?<=\\f\s).*?(?=\\f\*):::
Not in footnotes#f#(?<=\A|\\f\*)(?s).*?(?=\\f\s|\Z):::
In text (ignores markers and headings) #f#(?<=((\\\+?(i[^de]\w*|[smlpq]\w*|r|d|nb|cl|s|tr|tc\w+|v\s+\S+|f[^r]\w*|x\w+|add|bk|tl|sc|nd)[\*\s])|(f|x|fe|ef|c|fr)\s+\S\s*(?=\\)|\\[fx]\*))[^\\]*?(?=\\|\Z):::
In Ref fields#f#(?<=(\\\+?(xt|ior|fig|rq|zpa-xb)\s|\\(r|mr|sr|ipr)\s))[^\\]*?(?=(\\\+?(xt|ior|fig|rq|zpa-xb)\*|\s\\)):::
Not in Ref fields#f#(?<=\A|\\\+?(xt|ior|fig|rq)\*|\\(?!(fr|cl|r|mr|sr|ipr|toc\d|(\+?(xt|ior|fig|rq)))\s))(?s).*?(?=(\\\+?(xt|ior|fig)\s|(\\(fr|r|mr|sr|rq|ipr|toc\d)\s[^\\]*?(?=\\)))|\Z):::

by (1.8k points)

Related questions

+1 vote
5 answers
0 votes
3 answers
Paratext Mar 15, 2018 asked by SIL LSS PNG (411 points)
0 votes
1 answer
Paratext Aug 23, 2021 asked by anon180868 (188 points)
Welcome to Support Bible, where you can ask questions and receive answers from other members of the community.
But if we walk in the light, as he is in the light, we have fellowship with one another, and the blood of Jesus, his Son, purifies us from all sin.
1 John 1:7
2,629 questions
5,373 answers
5,046 comments
1,420 users