0 votes

I received this email from a translator. I could write an elementary regex expression, but believe I would miss some things with possible sentence ending punctuation. Can somebody else help?

I am hoping to do a final punctuation check that is not part of the standard checks in Paratext. Specifically, I want to make sure that we haven’t left off any periods at the end of verses. (I have caught a few of those as we have been recording.) So I am basically looking for some REGEX code that would identify places where a verse starts with a capitol letter but the prior verse does NOT end with a period. (I know this may identify some false positives, like if a verse starts with a proper noun, but it will at least give me a shorter list to check than checking all verses. If you are able to write some REGEX code to help me, I would greatly appreciate it.

Thank you,

james_post

Paratext by [Moderator]
(2.1k points)

2 Answers

+1 vote
Best answer

First some feedback, with compliments for writing that in ten minutes:

  • (?<=\p{L}) => Is good, but I propose to include “the letter” in the actual catch, so that the user will see where the regex spotted a potential “end of sentence”. Compliments for catching upper case and lower case letters. So I replace ?<= with ?: in my proposal.

  • (?<!\\[\w+]) => I believe for your intension to avoid false positives with some marker, this part has to be to the left of your “letter”. Also the + should be outside the square brackets as many markers have multiple characters like \s1.

  • And certain markers include the *, for example end-of-something like \em*. I vaguely remember that certain markers can or must even include +. If that is the case, you can easily add to the list of characters like [\w*+].

  • \s*(?:\\\w+)?\s* => This one I replace with "skip over any number of potential markers and any whitespace between the end of the verse and the start of the next verse.

  • \\v [\d\w]+ \p{Lu} => Find a verse followed by an upper-case letter. Seems you are confident here about a “classic space” between \v and the verse number and after the verse number, so I kept that syntax.

I would like to submit a modified regex, not claiming it is the final solution, just hopefully covering more cases:

(?<!\\[\w\*]*)(?:\p{L})\s*(?:\\[\w\*]+\s+)*\\v [\d\w]+ \p{Lu}

Here is the blurb for my regex, as per my tool:

(?<!\\[\w\*]*)(?:\p{L})\s*(?:\\[\w\*]+\s+)*\\v [\d\w]+ \p{Lu}

(?<!\\[\w\*]*)(?:\p{L})\s*(?:\\[\w\*]+\s+)*\\v [\d\w]+ \p{Lu}

(?<!\\[\w\*]*)(?:\p{L})\s*(?:\\[\w\*]+\s+)*\\v [\d\w]+ \p{Lu}

Options: Case insensitive; Exact spacing; Dot matches line breaks

  • Assert that it is impossible to match the regex below backwards at this position (negative lookbehind)
      • Match the backslash character
      • Match a single character present in the list below
        • Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
        • A “word character” (Unicode; any letter or ideograph, digit, letter number, connector punctuation)
        • The literal character “*”
    • Match the regular expression below
        • Match a character from the Unicode category “letter” (any kind of letter from any language)
      • Match a single character that is a “whitespace character” (any Unicode separator, tab, line feed, carriage return, vertical tab, form feed, next line, zero-width space)
        • Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
        • Match the regular expression below
          • Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
          • Match the backslash character
          • Match a single character present in the list below
            • Between one and unlimited times, as many times as possible, giving back as needed (greedy)
            • A “word character” (Unicode; any letter or ideograph, digit, letter number, connector punctuation)
            • The literal character “*”
          • Match a single character that is a “whitespace character” (any Unicode separator, tab, line feed, carriage return, vertical tab, form feed, next line, zero-width space)
            • Between one and unlimited times, as many times as possible, giving back as needed (greedy)
          • Match the backslash character
          • Match the character string “v ” literally (case insensitive)
          • Match a single character present in the list below
            • Between one and unlimited times, as many times as possible, giving back as needed (greedy)
            • A “digit” (any decimal number in any Unicode script)
            • A “word character” (Unicode; any letter or ideograph, digit, letter number, connector punctuation)
          • Match the character “ ” literally
          • Match a character from the Unicode category “uppercase letter” (an uppercase letter that has a lowercase variant)

          Created with RegexBuddy

          And here is some sample text, that I used for testing. Most likely incomplete (at the end of each example, I mark, whether it is a catch or not):

          \v 8 Some text without \em punctuation\em* \rem \any \v 9 A new verse which is a new sentence, starting with a capital.  [catch]
          
          \v 8 Some text without \em punctuation\em*  \v 9 A new verse which is a new sentence, starting with a capital.  [catch]
          
          \v 8 Some text without punctuation \v 9a A new verse which is a new sentence, starting with a capital.  [catch]
          
          \v 8 Some text with proper \em punctuation.\em* \just \one \more \v 9 A new verse which is a new sentence, starting with a capital.  [punctuation present, so no catch]
          
          \v 8 Some text ending an an all-cap word like \em LORD\em* \v 9 A new verse which is a new sentence, starting with a capital.  [catch]
          
          \v 10 A long sentence \whatever \v 11a  over two verses or more.  [No punctuation needed, so no catch]
          

          So, this was fun. Hope I am not too far off. I can sadly never remember what regex-flavour PT uses. If there are issues with syntax because of flavours, I could easily apply a different one like .NET or Java and re-run my tool.

          Again: Not the solution but maybe useful to get a dialog going.

          hth

          by (842 points)
          reshown
          0 votes

          The best I could come up with in 10 minutes was the following:

          (?<=\p{L})(?<!\\[\w+])\s*(?:\\\w+)?\s*\\v [\d\w]+ \p{Lu}
          

          I’m not sure how accurate it is (it passed all my testing data, but it certainly wasn’t extensive) so someone can probably do better. :grin:

          Brief explanation of the expression:

          • (?<=\p{L}) => Look behind for a letter (i.e. the end of the previous verse)
          • (?<!\\[\w+]) => Make sure the letter isn’t part of a marker (negative look behind)
          • \s*(?:\\\w+)?\s* => Skip over any single marker and any whitespace between the end of the verse and the start of the next verse.
          • \\v [\d\w]+ \p{Lu} => Find a verse followed by an upper-case letter.
          by [Expert]
          (16.2k points)

          reshown
          Welcome to Support Bible, where you can ask questions and receive answers from other members of the community.
          Finally, all of you, be like-minded, be sympathetic, love one another, be compassionate and humble.
          1 Peter 3:8
          2,479 questions
          5,175 answers
          4,875 comments
          1,283 users