First some feedback, with compliments for writing that in ten minutes:
-
(?<=\p{L})
=> Is good, but I propose to include “the letter” in the actual catch, so that the user will see where the regex spotted a potential “end of sentence”. Compliments for catching upper case and lower case letters. So I replace ?<= with ?: in my proposal.
-
(?<!\\[\w+])
=> I believe for your intension to avoid false positives with some marker, this part has to be to the left of your “letter”. Also the +
should be outside the square brackets as many markers have multiple characters like \s1
.
-
And certain markers include the *
, for example end-of-something like \em*
. I vaguely remember that certain markers can or must even include +
. If that is the case, you can easily add to the list of characters like [\w*+].
-
\s*(?:\\\w+)?\s*
=> This one I replace with "skip over any number of potential markers and any whitespace between the end of the verse and the start of the next verse.
-
\\v [\d\w]+ \p{Lu}
=> Find a verse followed by an upper-case letter. Seems you are confident here about a “classic space” between \v
and the verse number and after the verse number, so I kept that syntax.
I would like to submit a modified regex, not claiming it is the final solution, just hopefully covering more cases:
(?<!\\[\w\*]*)(?:\p{L})\s*(?:\\[\w\*]+\s+)*\\v [\d\w]+ \p{Lu}
Here is the blurb for my regex, as per my tool:
(?<!\\[\w\*]*)(?:\p{L})\s*(?:\\[\w\*]+\s+)*\\v [\d\w]+ \p{Lu}
(?<!\\[\w\*]*)(?:\p{L})\s*(?:\\[\w\*]+\s+)*\\v [\d\w]+ \p{Lu}
(?<!\\[\w\*]*)(?:\p{L})\s*(?:\\[\w\*]+\s+)*\\v [\d\w]+ \p{Lu}Options: Case insensitive; Exact spacing; Dot matches line breaks
-
Assert that it is impossible to match the regex below backwards at this position (negative lookbehind)
- Match the backslash character
-
Match a single character present in the list below
- Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
- A “word character” (Unicode; any letter or ideograph, digit, letter number, connector punctuation)
- The literal character “*”
-
Match the regular expression below
- Match a character from the Unicode category “letter” (any kind of letter from any language)
-
Match a single character that is a “whitespace character” (any Unicode separator, tab, line feed, carriage return, vertical tab, form feed, next line, zero-width space)
- Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
-
Match the regular expression below
- Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
- Match the backslash character
-
Match a single character present in the list below
- Between one and unlimited times, as many times as possible, giving back as needed (greedy)
- A “word character” (Unicode; any letter or ideograph, digit, letter number, connector punctuation)
- The literal character “*”
-
Match a single character that is a “whitespace character” (any Unicode separator, tab, line feed, carriage return, vertical tab, form feed, next line, zero-width space)
- Between one and unlimited times, as many times as possible, giving back as needed (greedy)
- Match the backslash character
- Match the character string “v ” literally (case insensitive)
-
Match a single character present in the list below
- Between one and unlimited times, as many times as possible, giving back as needed (greedy)
- A “digit” (any decimal number in any Unicode script)
- A “word character” (Unicode; any letter or ideograph, digit, letter number, connector punctuation)
- Match the character “ ” literally
- Match a character from the Unicode category “uppercase letter” (an uppercase letter that has a lowercase variant)
And here is some sample text, that I used for testing. Most likely incomplete (at the end of each example, I mark, whether it is a catch or not):
\v 8 Some text without \em punctuation\em* \rem \any \v 9 A new verse which is a new sentence, starting with a capital. [catch]
\v 8 Some text without \em punctuation\em* \v 9 A new verse which is a new sentence, starting with a capital. [catch]
\v 8 Some text without punctuation \v 9a A new verse which is a new sentence, starting with a capital. [catch]
\v 8 Some text with proper \em punctuation.\em* \just \one \more \v 9 A new verse which is a new sentence, starting with a capital. [punctuation present, so no catch]
\v 8 Some text ending an an all-cap word like \em LORD\em* \v 9 A new verse which is a new sentence, starting with a capital. [catch]
\v 10 A long sentence \whatever \v 11a over two verses or more. [No punctuation needed, so no catch]
So, this was fun. Hope I am not too far off. I can sadly never remember what regex-flavour PT uses. If there are issues with syntax because of flavours, I could easily apply a different one like .NET or Java and re-run my tool.
Again: Not the solution but maybe useful to get a dialog going.
hth