0 votes

Hello,

I was just checking for incorrect use of capitalization. In particular, I know that the translators have sometimes accidentally written “LORD” instead of \nd Lord\nd*. But I was surprised that Run Basic Checks…>Capitalization didn’t catch instances of all caps.

This regex will find any two adjacent capital letters, but also returns a lot of noise. For example, book name abbrevations, language codes, and image file names.
regex:\w*[A-ZĆÄÏÜ]\w*[A-ZĆÄÏÜ]\w
(for the regex-uninitiated, [A-ZĆÄÏÜ] means any character from A to Z, or any of the special characters Ć, Ä, Ï, and Ü- upercase only)

I looked for places to change the capitalization rules but could only find a place to approve or disaprove words with mixed capitalization (in which words in ALL CAPS do not appear).

I also searched the Paratext help files and this forum, but couldn’t find any information.

Is there currently an easy way to find words in ALL CAPS? If not, could the functionality be added? Perhaps this could show up in the mixed caps search results, even though they are technically all-caps. Or, maybe it would be better for them to have their own category.

Thanks so much!

Paratext by (364 points)
reshown

6 Answers

0 votes
Best answer

I love using regex but it always makes my head spin. Here is how I’ve managed to break it down, for anyone who is interested. I used underscores as spaces to keep everything aligned.

Breakdown of regex that finds words in ALL CAPS that are not preceded by certain markers.
(?<!\\(id|h|toc|mt|fig).*)\b\p{Lu}+\b

  1. Look for a word boundary
    __________________________\b_________
  2. Look for any capital letter at the start of a word
    __________________________\b\p{Lu}___
  3. Look for one or more capital letters, starting at the start of a word
    __________________________\b\p{Lu}+__
  4. Look for a word of any length that is all capital letters
    __________________________\b\p{Lu}+\b
  5. After finding an all-caps word, check if there was an optional grouping just before it
    (?_______________________)\b\p{Lu}+\b
  6. After finding an all-caps word, make sure there wasn’t a backslash just before it
    (?<!\\___________________)\b\p{Lu}+\b
  7. After finding an all-caps word, make sure there wasn’t an \id or \h or \toc or \mt or \fig just before it
    (?<!\\(id|h|toc|mt|fig)__)\b\p{Lu}+\b
  8. After finding an all-caps word, make sure there wasn’t an \id or \h or \toc or \mt or \fig anywhere earlier in the line
    (?<!\\(id|h|toc|mt|fig).*)\b\p{Lu}+\b

I think personally I will remove |mt because I want to check for all caps in book titles too.

It does occur to me, what if a figure is inserted in the start or in the middle of a verse? Following words in ALL CAPS would not be detected by this regex.

(\\fig(.*?)\\fig\*) captures the entire figure tag, but that ended up not helping. This is what I came up with:
(?<!\\(id|h|toc).*)\b\p{Lu}+\b.*?(?!\\fig\*)

Here is the breakdown:

  1. After finding an all-caps word, make sure there wasn’t an \id or \h or \toc anywhere earlier in the line.
    (?<!\\(id|h|toc).*)\b\p{Lu}+\b______________
  2. After finding an all-caps word, make sure there wasn’t an \id or \h or \toc anywhere earlier in the line. Then look ahead any number of characters.
    (?<!\\(id|h|toc).*)\b\p{Lu}+\b.*?___________
  3. After finding an all-caps word, make sure there wasn’t an \id or \h or \toc anywhere earlier in the line. Then look ahead any number of characters to make sure \fig* isn’t found anywhere afterward in the line.
    (?<!\\(id|h|toc).*)\b\p{Lu}+\b.*?(?!\\fig\*)

For some reason, this regex didn’t catch “\mt1 GÉNESIS”. Any ideas on why?

by (364 points)
reshown

A \fig should never be inside a verse or paragraph because it will break up the text and formatting in all sorts of bad ways depending upon your output software. It needs to be placed at the end of a paragraph. Of course we run regex’s at all sorts of times when the text is not especially clean but I wanted to mention that restriction.

0 votes

Hi anon094061.
You can put the following regular expression in the Paratext search engine to see the words that are all uppercase: regex:\b\p{Lu}+\b

where:
“regex:” is the command to tell Paratext that it is a regular expression.
“\b” is Word boundary
“\p{Lu}” indicates uppercase letters
“+” indicates one or more times the above formula.
That is, it will look for one or more consecutive uppercase letters that have a word delimiter at the ends, such as spaces or commas.


Indeed, the Mixed capitalization inventory… considers unusual combinations, so it does not list here words in all capital letters.

by (844 points)
reshown
0 votes

It’s a bit more complicated, but you can also use:
regex:(?<!\\(id|h|toc|mt|fig).*)\b\p{Lu}+\b

this is to disregard the content of certain markers that we already know will be capitalized such as “id” or “mt”.
You can add other markers to exclude by adding an “|” and the marker name as it is in the larger parenthesis.

by (844 points)
reshown

Wow, this is really great! Thanks so much anon689242- just what I need.

I think this check would make for a great addition to “Assignments and Progress.”

0 votes

Ah, I just realized that my \fig \fig* solution filters out ALL CAPS results occurring in a verse before the figure. So, something more advanced is needed. I’ll research how to tell regex to ignore a group.

by (364 points)
reshown
+1 vote

I have found https://regex101.com/ helpful in the analysis of regular expressions, though I understand that an Internet connection may be an issue for some. I’m not sure which “flavor” of regex corresponds to what Paratext uses.

by (296 points)

According to my notes, Paratext 9, RegEx Pal, PrintDraft Changes.txt and DBLchanges.txt all use
C# (NET 2.0-4.8.1 & .NET Core 1.0-3.0)

I would be happy to have that confirmed.

0 votes

Really good observations. Thank you @Shegnada and @anon806807.

I was just thinking that we might want this regex to check the alt text of figures too. I think it’s just the image filename that is offending, so why don’t we just filter that? Something like
\b.*\.(TIF|GIF|JPG|JPEG)\b?

Maybe with a negative lookahead? Something like this?

(?:\b.*\.(TIF(F)|GIF|JP(E)G|PNG|EPS|BMP|WEBP)\b)

It should ignore something like CN01628B.tif, too, though.

I’m not sitting at my computer as I write this but I hope to test it soon.

by (364 points)
reshown

Related questions

0 votes
3 answers
0 votes
2 answers
PTXprint Oct 25, 2023 asked by Ruth Mathys (140 points)
0 votes
6 answers
0 votes
6 answers
PTXprint Feb 20, 2021 asked by anon054969 (123 points)
0 votes
0 answers
Welcome to Support Bible, where you can ask questions and receive answers from other members of the community.
Just as a body, though one, has many parts, but all its many parts form one body, so it is with Christ.
1 Corinthians 12:12
2,648 questions
5,397 answers
5,069 comments
1,449 users