Regex help to find issues at the end of a sentence

Question

I received this email from a translator. I could write an elementary regex expression, but believe I would miss some things with possible sentence ending punctuation. Can somebody else help?

I am hoping to do a final punctuation check that is not part of the standard checks in Paratext. Specifically, I want to make sure that we haven’t left off any periods at the end of verses. (I have caught a few of those as we have been recording.) So I am basically looking for some REGEX code that would identify places where a verse starts with a capitol letter but the prior verse does NOT end with a period. (I know this may identify some false positives, like if a verse starts with a proper noun, but it will at least give me a shorter list to check than checking all verses. If you are able to write some REGEX code to help me, I would greatly appreciate it.

Thank you,

james_post

Paratext Mar 30, 2023 asked by [Moderator]

james_post (2.0k points)

2 Answers

Best answer

First some feedback, with compliments for writing that in ten minutes:

(?<=\p{L}) => Is good, but I propose to include “the letter” in the actual catch, so that the user will see where the regex spotted a potential “end of sentence”. Compliments for catching upper case and lower case letters. So I replace ?<= with ?: in my proposal.
(?<!\\[\w+]) => I believe for your intension to avoid false positives with some marker, this part has to be to the left of your “letter”. Also the + should be outside the square brackets as many markers have multiple characters like \s1.
And certain markers include the *, for example end-of-something like \em*. I vaguely remember that certain markers can or must even include +. If that is the case, you can easily add to the list of characters like [\w*+].
\s*(?:\\\w+)?\s* => This one I replace with "skip over any number of potential markers and any whitespace between the end of the verse and the start of the next verse.
\\v [\d\w]+ \p{Lu} => Find a verse followed by an upper-case letter. Seems you are confident here about a “classic space” between \v and the verse number and after the verse number, so I kept that syntax.

I would like to submit a modified regex, not claiming it is the final solution, just hopefully covering more cases:

(?<!\\[\w\*]*)(?:\p{L})\s*(?:\\[\w\*]+\s+)*\\v [\d\w]+ \p{Lu}

Here is the blurb for my regex, as per my tool:

(?<!\\[\w\*]*)(?:\p{L})\s*(?:\\[\w\*]+\s+)*\\v [\d\w]+ \p{Lu}

(?<!\\[\w\])(?:\p{L})\s(?:\\[\w\]+\s+)*\\v [\d\w]+ \p{Lu}

(?<!\\[\w\*]*)(?:\p{L})\s*(?:\\[\w\*]+\s+)*\\v [\d\w]+ \p{Lu}

Options: Case insensitive; Exact spacing; Dot matches line breaks

Assert that it is impossible to match the regex below backwards at this position (negative lookbehind)
- Match the backslash character
- Match a single character present in the list below
  - Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
  - A “word character” (Unicode; any letter or ideograph, digit, letter number, connector punctuation)
  - The literal character “*”
Match the regular expression below
- Match a character from the Unicode category “letter” (any kind of letter from any language)
Match a single character that is a “whitespace character” (any Unicode separator, tab, line feed, carriage return, vertical tab, form feed, next line, zero-width space)
- Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
Match the regular expression below
- Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
- Match the backslash character
- Match a single character present in the list below
  - Between one and unlimited times, as many times as possible, giving back as needed (greedy)
  - A “word character” (Unicode; any letter or ideograph, digit, letter number, connector punctuation)
  - The literal character “*”
- Match a single character that is a “whitespace character” (any Unicode separator, tab, line feed, carriage return, vertical tab, form feed, next line, zero-width space)
  - Between one and unlimited times, as many times as possible, giving back as needed (greedy)
Match the backslash character
Match the character string “v ” literally (case insensitive)
Match a single character present in the list below
- Between one and unlimited times, as many times as possible, giving back as needed (greedy)
- A “digit” (any decimal number in any Unicode script)
- A “word character” (Unicode; any letter or ideograph, digit, letter number, connector punctuation)
Match the character “ ” literally
Match a character from the Unicode category “uppercase letter” (an uppercase letter that has a lowercase variant)

Created with RegexBuddy

And here is some sample text, that I used for testing. Most likely incomplete (at the end of each example, I mark, whether it is a catch or not):

\v 8 Some text without \em punctuation\em* \rem \any \v 9 A new verse which is a new sentence, starting with a capital.  [catch]

\v 8 Some text without \em punctuation\em*  \v 9 A new verse which is a new sentence, starting with a capital.  [catch]

\v 8 Some text without punctuation \v 9a A new verse which is a new sentence, starting with a capital.  [catch]

\v 8 Some text with proper \em punctuation.\em* \just \one \more \v 9 A new verse which is a new sentence, starting with a capital.  [punctuation present, so no catch]

\v 8 Some text ending an an all-cap word like \em LORD\em* \v 9 A new verse which is a new sentence, starting with a capital.  [catch]

\v 10 A long sentence \whatever \v 11a  over two verses or more.  [No punctuation needed, so no catch]

So, this was fun. Hope I am not too far off. I can sadly never remember what regex-flavour PT uses. If there are issues with syntax because of flavours, I could easily apply a different one like .NET or Java and re-run my tool.

Again: Not the solution but maybe useful to get a dialog going.

hth

Apr 20, 2023 answered by Tim (855 points)
Apr 20, 2023 reshown

Fool Running · Answer 1 · 2023-04-03T13:19:18+0000

The best I could come up with in 10 minutes was the following:

(?<=\p{L})(?<!\\[\w+])\s*(?:\\\w+)?\s*\\v [\d\w]+ \p{Lu}

I’m not sure how accurate it is (it passed all my testing data, but it certainly wasn’t extensive) so someone can probably do better.

Brief explanation of the expression:

(?<=\p{L}) => Look behind for a letter (i.e. the end of the previous verse)
(?<!\\[\w+]) => Make sure the letter isn’t part of a marker (negative look behind)
\s*(?:\\\w+)?\s* => Skip over any single marker and any whitespace between the end of the verse and the start of the next verse.
\\v [\d\w]+ \p{Lu} => Find a verse followed by an upper-case letter.

Regex help to find issues at the end of a sentence

Please log in or register to answer this question.

2 Answers

(?<!\\[\w\])(?:\p{L})\s(?:\\[\w\]+\s+)*\\v [\d\w]+ \p{Lu}

Please log in or register to add a comment.

Please log in or register to add a comment.

Related questions

Categories