0 votes

I received this email from a translator. I could write an elementary regex expression, but believe I would miss some things with possible sentence ending punctuation. Can somebody else help?

I am hoping to do a final punctuation check that is not part of the standard checks in Paratext. Specifically, I want to make sure that we haven’t left off any periods at the end of verses. (I have caught a few of those as we have been recording.) So I am basically looking for some REGEX code that would identify places where a verse starts with a capitol letter but the prior verse does NOT end with a period. (I know this may identify some false positives, like if a verse starts with a proper noun, but it will at least give me a shorter list to check than checking all verses. If you are able to write some REGEX code to help me, I would greatly appreciate it.

Thank you,

james_post

Paratext by [Moderator]
(2.0k points)

2 Answers

+1 vote
Best answer

First some feedback, with compliments for writing that in ten minutes:

  • (?<=\p{L}) => Is good, but I propose to include “the letter” in the actual catch, so that the user will see where the regex spotted a potential “end of sentence”. Compliments for catching upper case and lower case letters. So I replace ?<= with ?: in my proposal.

  • (?<!\\[\w+]) => I believe for your intension to avoid false positives with some marker, this part has to be to the left of your “letter”. Also the + should be outside the square brackets as many markers have multiple characters like \s1.

  • And certain markers include the *, for example end-of-something like \em*. I vaguely remember that certain markers can or must even include +. If that is the case, you can easily add to the list of characters like [\w*+].

  • \s*(?:\\\w+)?\s* => This one I replace with "skip over any number of potential markers and any whitespace between the end of the verse and the start of the next verse.

  • \\v [\d\w]+ \p{Lu} => Find a verse followed by an upper-case letter. Seems you are confident here about a “classic space” between \v and the verse number and after the verse number, so I kept that syntax.

I would like to submit a modified regex, not claiming it is the final solution, just hopefully covering more cases:

(?<!\\[\w\*]*)(?:\p{L})\s*(?:\\[\w\*]+\s+)*\\v [\d\w]+ \p{Lu}

Here is the blurb for my regex, as per my tool:

(?<!\\[\w\*]*)(?:\p{L})\s*(?:\\[\w\*]+\s+)*\\v [\d\w]+ \p{Lu}

(?<!\\[\w\*]*)(?:\p{L})\s*(?:\\[\w\*]+\s+)*\\v [\d\w]+ \p{Lu}

(?<!\\[\w\*]*)(?:\p{L})\s*(?:\\[\w\*]+\s+)*\\v [\d\w]+ \p{Lu}

Options: Case insensitive; Exact spacing; Dot matches line breaks

  • Assert that it is impossible to match the regex below backwards at this position (negative lookbehind)
      • Match the backslash character
      • Match a single character present in the list below
        • Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
        • A “word character” (Unicode; any letter or ideograph, digit, letter number, connector punctuation)
        • The literal character “*”
    • Match the regular expression below
        • Match a character from the Unicode category “letter” (any kind of letter from any language)
      • Match a single character that is a “whitespace character” (any Unicode separator, tab, line feed, carriage return, vertical tab, form feed, next line, zero-width space)
        • Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
        • Match the regular expression below
          • Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
          • Match the backslash character
          • Match a single character present in the list below
            • Between one and unlimited times, as many times as possible, giving back as needed (greedy)
            • A “word character” (Unicode; any letter or ideograph, digit, letter number, connector punctuation)
            • The literal character “*”
          • Match a single character that is a “whitespace character” (any Unicode separator, tab, line feed, carriage return, vertical tab, form feed, next line, zero-width space)
            • Between one and unlimited times, as many times as possible, giving back as needed (greedy)
          • Match the backslash character
          • Match the character string “v ” literally (case insensitive)
          • Match a single character present in the list below
            • Between one and unlimited times, as many times as possible, giving back as needed (greedy)
            • A “digit” (any decimal number in any Unicode script)
            • A “word character” (Unicode; any letter or ideograph, digit, letter number, connector punctuation)
          • Match the character “ ” literally
          • Match a character from the Unicode category “uppercase letter” (an uppercase letter that has a lowercase variant)

          Created with RegexBuddy

          And here is some sample text, that I used for testing. Most likely incomplete (at the end of each example, I mark, whether it is a catch or not):

          \v 8 Some text without \em punctuation\em* \rem \any \v 9 A new verse which is a new sentence, starting with a capital.  [catch]
          
          \v 8 Some text without \em punctuation\em*  \v 9 A new verse which is a new sentence, starting with a capital.  [catch]
          
          \v 8 Some text without punctuation \v 9a A new verse which is a new sentence, starting with a capital.  [catch]
          
          \v 8 Some text with proper \em punctuation.\em* \just \one \more \v 9 A new verse which is a new sentence, starting with a capital.  [punctuation present, so no catch]
          
          \v 8 Some text ending an an all-cap word like \em LORD\em* \v 9 A new verse which is a new sentence, starting with a capital.  [catch]
          
          \v 10 A long sentence \whatever \v 11a  over two verses or more.  [No punctuation needed, so no catch]
          

          So, this was fun. Hope I am not too far off. I can sadly never remember what regex-flavour PT uses. If there are issues with syntax because of flavours, I could easily apply a different one like .NET or Java and re-run my tool.

          Again: Not the solution but maybe useful to get a dialog going.

          hth

          by (855 points)
          reshown
          0 votes

          The best I could come up with in 10 minutes was the following:

          (?<=\p{L})(?<!\\[\w+])\s*(?:\\\w+)?\s*\\v [\d\w]+ \p{Lu}
          

          I’m not sure how accurate it is (it passed all my testing data, but it certainly wasn’t extensive) so someone can probably do better. :grin:

          Brief explanation of the expression:

          • (?<=\p{L}) => Look behind for a letter (i.e. the end of the previous verse)
          • (?<!\\[\w+]) => Make sure the letter isn’t part of a marker (negative look behind)
          • \s*(?:\\\w+)?\s* => Skip over any single marker and any whitespace between the end of the verse and the start of the next verse.
          • \\v [\d\w]+ \p{Lu} => Find a verse followed by an upper-case letter.
          by [Expert]
          (16.2k points)

          reshown
          We have the same issue for TkLy - Toka-Ley Bible Translation project. As noted in the paragraph above, " I want to make sure that we haven’t left off any periods at the end of verses."
          I copied and pasted the REGEX code into the Find window. Paratext found 73 items. But unfortunately, all but five were verses followed by a Section Heading or at a Chapter Number. Four of the five exceptions were valid errors that had no period.
          Is there a way to modify the REGEX to omit verses followed by Section Headings?
          Thank you,
          Mark
          Welcome to Support Bible, where you can ask questions and receive answers from other members of the community.
          Dear friends, since God so loved us, we also ought to love one another.
          1 John 4:11
          2,664 questions
          5,423 answers
          5,083 comments
          1,480 users