0 votes

Can someone answer definitively:

  • What flavour of regex is used in RegEx Pal?

“Scope” in my title is to encourage people to comment on what is missing from the scope of regexes used in this context. See next post for an example.

Paratext by (1.4k points)
reshown

8 Answers

0 votes
Best answer

Today I was trying to use these codes:

  • \L  Causes all subsequent characters to be output in upper case, until a \E is found.
  • \U  Causes all subsequent characters to be output in lower case, until a \E is found.
  • \E  Terminates a \L or \U sequence.

… but they don’t seem to work. Are they unsupported in RegEx Pal?

If unsupported, how can I change the case of captured groups in my replacement string?

by (1.4k points)

No, these codes are not supported in RegExPal.

You can convert something to uppercase with the following:
^^^\1

But the last thing I saw written was that there is no way in RegExPal to convert to lowercase.

They are not supported in RegExPal.

D anon467281

Global Publishing Services
Scripture Typesetting trainer & Regular Expression "specialist"
Dallas, TX

Sad, that, since our translators erroneously put capitalised suffixes on the capitalised word “LORD”.

Is it possible in the future that PT use a flavour of RegEx that is more comprehensive than the one currently used? It seems a pity to have metacharacters like this unavailable.

It is possible, but very unlikely.

Neither .Net, nor IronPython (what supports Paratext’s regular expressions) have support for \L, \U, \E. It’s unlikely we’re going to switch out our base technology to one that supports it (like Perl).

So is it not a free choice which implementation you use? You’re implying that it’s not straightforward, so I guess it’s not just a case of switching out one library for another, no?

So, is there another way to do what I want? For example, can I say: if you find a letter in [АБВГҒ…], then replace it with [абвгғ…] ?

It is switching out the library for another library. However, this is, by no means, a simple thing to do. No two libraries have the exact same API and they will each have their own bugs/idiosyncrasies that will need to be worked around. Our current .Net implementation is used because, well, it’s built-in to .Net and doesn’t require anything new. We could also use Python (which Paratext uses for other things), but it also does not support what is wanted.

Theoretically, we could use a Regex-only library built for .Net that does this this type of replacement, but I haven’t been able to find one (probably because .Net has Regex functionality built-in).
Using Perl would also theoretically work, but there is no good library for .Net to interact with Perl scripts (that I could find in my quick search).

So, at least for now, there is no good way to change Paratext to work the way that’s wanted unless we added in our own parsing into the mix - which would be a lot of work for not much gain.

Unfortunately, I think what you’re left with is creating a regular expression that finds the errors and then having to fix them manually.

Can you not simply interact with the .sfm files outside of Paratext, in an editor that handles the kind of Regex you need? You lose the ability to save a history of that exact find/replace, but as long as you save a point in history before closing Paratext and working with the text files, it shouldn’t be too dangerous.

Great suggestion :slight_smile:. Thanks!

Well, you don’t get any automatic History when you use Regex Pal either. So, when using it, I do exactly what you say, and save two points in the history, e.g.:

  1. Before doing replace with RegEx_Pal tool.
  2. With RegEx_Pal tool: updated deprecated markers \s, \q, \ms, \pi.
0 votes

I’ve noticed that RegEx Pal does not accept:

(?:<string>)

… and today, I could not get this to work:

(?-i)

… i.e. turn off case insensitivity.

by (1.4k points)

(?:) is a named group. These do not work in Paratext or RegExPal.

It appears that negating case insensitivity (?-i) does not work either
which means you cannot undo it later in an expression where you have
enabled it (?i). However, by default matches matches are case sensitive
(?-i).

Hope this helps.

The reference sites call it “a non-capturing group”. Useful sometimes.

These non-capturing groups work for me in RegExPal (Paratext 7.5 and 7.6). I just tried it with this nonsensical replacement:

Search:(?:day|night)( \w+ )(\w+)
Replace: \2\1\2
(Replaces “day” and “night” with the second word after it.)

0 votes

In Find mode, is there a way to stop RegEx Pal updating the results box while you’re editing your regex? It seems to do it for every character you type, resulting in a delay each time.

by (1.4k points)

A trick you might try is when you start typing, immediately run a count….
You will end up with a screen that is not focused on the scripture text
containing results of the count. Then go ahead and type away.

One thing to note is that any time you type the OR character, “|” vertical
bar the system will almost stop since what follows the OR initially
matchees everything.

Hope this helps.

Yes, if it had got really slow, I’d have done something like that. A very helpful suggestion.

Still, are you able to explain why it is programmed to search immediately? This seems to me to render the button labelled First redundant.

Surely it should let you type your expression, and then search when you click the First button, no?

It is very common for regex software to start searching immediately because
that allows the user, especially a learner, to build the expression piece
by piece seeing the immediate result and correcting it as needed. I find
that helpful myself, but do bemoan the inevitable slow down. I appreciate
anon467281’s suggestion which will be useful in the future. It would be nice if
we could limit the search to a specific chapter where we knew the targeted
text existed until we built the expression properly.

If you are doing a lot of regexes, you might find Regex Buddy a nice extra
tool. I often paste the problem text from Paratext or other data into Regex
Buddy and build my expression there. It has an excellent regex
building/teaching tool, very extensive helps, and you can create a
searchable library of regexes. I build my regexes there and paste them into
Regex Pal or Paratext itself once I have tested them. It is a one-time paid
program but not unreasonable given its abilities.

Blessings,

Shegnada J.

Language Technology and Publishing Coordinator, Nigeria

Text Processing Specialist GPS Dallas

Skype: Shegnada..

+[Phone Removed]
image

image

To limit the search. Use Tool > Choose Books. This will at least keep the search process from looking at the entire project. Note that each time you change the tool RegExPal reverts to All Books

Yes, I use that.

But, talking of books, why does RegEx Pal take much longer to search books that don’t contain biblical text (FRT, GLO, XXA, etc.) than it takes for the Bible books, even though the content of my XXA amounts to much less than any Bible book (bar Obadiah or 3 John)? Its progress bar shows it searching from chapter 1 to chapter ≈300 in each of these!

I don’t know how it compares with Regex Buddy (which Shegnada recommends),
but https://regex101.com/ is an online tool that does something similar.
You can paste in your text then build your regex and get immediate
feedback on how your regex is being interpreted.

anon806807

The tools in Paratext use verses to limit the search. For books like the Glossary where there is no verse the tool takes longer to process. I’m sure the developers could give a more technical answer.

The reason for peripheral material taking a long time is because of the fact that RegExPal only searches one chapter at a time. The peripheral material have no defined max chapter number so our only option is to go through all possible ones looking for text (well, 998 of them anyways). Theoretically, we could search them by-book instead, but that would slow down the text views considerably.

This has been my favorite site for creating regular expressions as it also allows you to debug them when they don’t work. You can see step-by-step what it does. :smiley:

I’m a bit confused: the data for one book is in a single .SFM file – isn’t that what RegEx Pal is searching?

I can confirm Shegnada’s testimony. I am not affiliated but a very happy user of RegexBuddy and two other paid products from JGS: a powerful editor and a clipboard-booster (which integrate very nicely).

RB works 100% offline, good for rural projects.

RegexBuddy helps an ordinary linguist to build almost any search - and allows to build a personal library for later use or sharing.

You can pull a copy of your data (say a book from PT) and do very powerful testing within RegexBuddy before you ever paste your regex into PT. This is useful for anything complex which is meant to alter data.

RB comes with excellent documentation and a very friendly and super competent closed user forum.

0 votes

Cross-posting this: it defines the flavour

From PT regex flavour and scope (click link to see the rest of this thread):

by (1.4k points)
reshown
0 votes

Cross-posting this, since it’s about RegEx Pal.

From PT regex flavour and scope (click link to see the thread, which has more discussion of marking Replace operations in the Project History:

by (1.4k points)
reshown
0 votes

(?s) – i.e. DOTALL mode (the dot (.) matches new line characters (\r\n)) – does not seem to work in PT8. Is this inline modifier not supported?

by (1.4k points)
reshown
0 votes

It should be (?-s) to match newline in the Paratext find window. This is not needed in RegExPal since it is the default there.

by (8.4k points)

So is rexegg.com wrong when it says:

This implies that (?-s) would disable DOTALL.

So why does this regex to find blank lines after Hebrew titles in the Psalm not work (I want to delete the blank lines)?:

(\\d.*)\\b

In the raw .SFM text, there is a newline after the \d, and the .* won’t match anything after the \d .

0 votes

Sorry - I mis-read your message. (?-s) is used in Paratext to turn off the match newline (which is on by default). In RegExPal there is an option to have . match newline or the (?s) code should work.

Try this for the search:
regex:(?<=\\d.*?)\\b

The (?s) is not needed since it is default in Paratext find. The ? after the * says “don’t be greedy”
This will find the \b and if the replace is blank it will remove the \b.

by (8.4k points)

It works now, but it didn’t seem to before. The most obvious thing would be that I was switched to Cyrillic keyboard when typing (?s), but it can’t be that, because there’s no letter in Cyrillic that can be confused with “s”.

Thanks. I’m going to have to refine this so that it only replaces a blank line that comes immediately after a Hebrew title. But, for that question, I’ll move the conversation to a more appropriate thread, here: Expressions for RegEx Pal .

Related questions

0 votes
1 answer
Paratext Jul 20, 2016 asked by wdavidhj (1.4k points)
+1 vote
5 answers
0 votes
1 answer
Paratext Aug 23, 2021 asked by anon180868 (188 points)
Welcome to Support Bible, where you can ask questions and receive answers from other members of the community.
And over all these virtues put on love, which binds them all together in perfect unity.
Colossians 3:14
2,648 questions
5,397 answers
5,069 comments
1,448 users