How to find incorrect use of ALL CAPS

Question

Hello,

I was just checking for incorrect use of capitalization. In particular, I know that the translators have sometimes accidentally written “LORD” instead of \nd Lord\nd*. But I was surprised that Run Basic Checks…>Capitalization didn’t catch instances of all caps.

This regex will find any two adjacent capital letters, but also returns a lot of noise. For example, book name abbrevations, language codes, and image file names.
regex:\w*[A-ZĆÄÏÜ]\w*[A-ZĆÄÏÜ]\w
(for the regex-uninitiated, [A-ZĆÄÏÜ] means any character from A to Z, or any of the special characters Ć, Ä, Ï, and Ü- upercase only)

I looked for places to change the capitalization rules but could only find a place to approve or disaprove words with mixed capitalization (in which words in ALL CAPS do not appear).

I also searched the Paratext help files and this forum, but couldn’t find any information.

Is there currently an easy way to find words in ALL CAPS? If not, could the functionality be added? Perhaps this could show up in the mixed caps search results, even though they are technically all-caps. Or, maybe it would be better for them to have their own category.

Thanks so much!

Paratext May 17, 2021 asked by alex_larkin (370 points)
May 18, 2021 reshown

6 Answers

Best answer

I love using regex but it always makes my head spin. Here is how I’ve managed to break it down, for anyone who is interested. I used underscores as spaces to keep everything aligned.

Breakdown of regex that finds words in ALL CAPS that are not preceded by certain markers.
(?<!\\(id|h|toc|mt|fig).*)\b\p{Lu}+\b

Look for a word boundary
__________________________\b_________
Look for any capital letter at the start of a word
__________________________\b\p{Lu}___
Look for one or more capital letters, starting at the start of a word
__________________________\b\p{Lu}+__
Look for a word of any length that is all capital letters
__________________________\b\p{Lu}+\b
After finding an all-caps word, check if there was an optional grouping just before it
(?_______________________)\b\p{Lu}+\b
After finding an all-caps word, make sure there wasn’t a backslash just before it
(?<!\\___________________)\b\p{Lu}+\b
After finding an all-caps word, make sure there wasn’t an \id or \h or \toc or \mt or \fig just before it
(?<!\\(id|h|toc|mt|fig)__)\b\p{Lu}+\b
After finding an all-caps word, make sure there wasn’t an \id or \h or \toc or \mt or \fig anywhere earlier in the line
(?<!\\(id|h|toc|mt|fig).*)\b\p{Lu}+\b

I think personally I will remove |mt because I want to check for all caps in book titles too.

It does occur to me, what if a figure is inserted in the start or in the middle of a verse? Following words in ALL CAPS would not be detected by this regex.

(\\fig(.*?)\\fig\*) captures the entire figure tag, but that ended up not helping. This is what I came up with:
(?<!\\(id|h|toc).*)\b\p{Lu}+\b.*?(?!\\fig\*)

Here is the breakdown:

After finding an all-caps word, make sure there wasn’t an \id or \h or \toc anywhere earlier in the line.
(?<!\\(id|h|toc).*)\b\p{Lu}+\b______________
After finding an all-caps word, make sure there wasn’t an \id or \h or \toc anywhere earlier in the line. Then look ahead any number of characters.
(?<!\\(id|h|toc).*)\b\p{Lu}+\b.*?___________
After finding an all-caps word, make sure there wasn’t an \id or \h or \toc anywhere earlier in the line. Then look ahead any number of characters to make sure \fig* isn’t found anywhere afterward in the line.
(?<!\\(id|h|toc).*)\b\p{Lu}+\b.*?(?!\\fig\*)

For some reason, this regex didn’t catch “\mt1 GÉNESIS”. Any ideas on why?

May 17, 2021 answered by alex_larkin (370 points)
May 17, 2021 reshown

Related questions

0 votes

3 answers

Change Print Draft output of \nd...\nd* to small caps?

Paratext Apr 10, 2015 asked by anon451647 (646 points)

0 votes

2 answers

Outputting text as all caps

PTXprint Oct 25, 2023 asked by Ruth Mathys (146 points)

0 votes

6 answers

Bulk marking incorrect spellings

Paratext Jul 14, 2021 asked by anon859055 (130 points)

0 votes

6 answers

How can I get small caps on \nd

PTXprint Feb 20, 2021 asked by anon054969 (123 points)

0 votes

0 answers

Use of \sc in scripture text

Paratext Mar 15, 2019 asked by anon467281 (571 points)

Pepe · Answer 1 · 2021-05-17T16:51:56+0000

Hi anon094061.
You can put the following regular expression in the Paratext search engine to see the words that are all uppercase: regex:\b\p{Lu}+\b

where:
“regex:” is the command to tell Paratext that it is a regular expression.
“\b” is Word boundary
“\p{Lu}” indicates uppercase letters
“+” indicates one or more times the above formula.
That is, it will look for one or more consecutive uppercase letters that have a word delimiter at the ends, such as spaces or commas.

Indeed, the Mixed capitalization inventory… considers unusual combinations, so it does not list here words in all capital letters.

Pepe · Answer 2 · 2021-05-17T17:36:52+0000

It’s a bit more complicated, but you can also use:
regex:(?<!\\(id|h|toc|mt|fig).*)\b\p{Lu}+\b

this is to disregard the content of certain markers that we already know will be capitalized such as “id” or “mt”.
You can add other markers to exclude by adding an “|” and the marker name as it is in the larger parenthesis.

alex_larkin · Answer 3 · 2021-05-17T23:10:37+0000

Ah, I just realized that my \fig \fig* solution filters out ALL CAPS results occurring in a verse before the figure. So, something more advanced is needed. I’ll research how to tell regex to ignore a group.

anon806807 · Answer 4 · 2021-05-18T03:21:29+0000

I have found https://regex101.com/ helpful in the analysis of regular expressions, though I understand that an Internet connection may be an issue for some. I’m not sure which “flavor” of regex corresponds to what Paratext uses.

alex_larkin · Answer 5 · 2021-05-18T15:25:02+0000

Really good observations. Thank you @Shegnada and @anon806807.

I was just thinking that we might want this regex to check the alt text of figures too. I think it’s just the image filename that is offending, so why don’t we just filter that? Something like
\b.*\.(TIF|GIF|JPG|JPEG)\b?

Maybe with a negative lookahead? Something like this?

(?:\b.*\.(TIF(F)|GIF|JP(E)G|PNG|EPS|BMP|WEBP)\b)

It should ignore something like CN01628B.tif, too, though.

I’m not sitting at my computer as I write this but I hope to test it soon.

How to find incorrect use of ALL CAPS

Please log in or register to answer this question.

6 Answers

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Related questions

Categories