0 votes

I found an instance of a capitalised word following a comma that shouldn’t be capitalised, and now I’m wondering if there are more. Obviously, many names fit this criteria and would also be found with the simple expression:
\,\s[A-Z]
Is there any way to exclude even the more common names from this, e.g., ‘God’?

Paratext by (615 points)

1 Answer

0 votes
Best answer

Hi Paul

No easy way to “ignore” capitalized names that I know of. Might I
suggest using an expression in RegExPal to find all capital initial
words rather than just the capital letter. With the list sorted you
could look thru the list and easily pick out the occurrences that are
NOT names.

The following looks for a comma followed by a space, then a capital
letter, followed by a string of lowercase letters with possible diacritics.

, \p{Lu}[\p{Ll}\p{M}]+

The matches look like:

The sorted list looks like:

Hope this helps.

D anon467281 Global Publishing Services WBT Central (Africa, Europe,
Eurasia thru India) DBL Curator, Scripture Typesetting trainer & Regular
Expression “specialist” Dallas, TX

by (571 points)
reshown

Beautiful (once I found Tools > Count/Extract…)! Thanks, anon467281.

Paul

Follow on question: I’m trying to include the next three or four words in
the sorted list, i.e., some context (this would be helpful to determine if
a result is part of a Technical Term). This works:
, \p{Lu}[\p{Ll}] \p{Ll}[\p{Ll}]+
and returns a capitalised word followed by a space and any lowercase word;
but this doesn’t:
, \p{Lu}[\p{Ll}] \p{Lu}[\p{Ll}]+
and neither does this:
, \p{Lu}[\p{Ll}] \p{Ll}[\p{Ll}] \p{Lu}[\p{Ll}]+
Is it not possible to look up two capitalised words in a row?

Paul

Paul,

Make sure you have the + for 1 or more at the end of each “word”. In your
examples you only have the + at the end of the second word. This means you
are searching for an uppercase lowercase space word

anon848905
Americas Area Language Technology Coordinator
[Email Removed]
[Phone Removed]Office at JAARS)
[Phone Removed]Cell)
Skype name: anon848905

Thanks, anon848905! I didn’t understand the significance of +. But something
about your response actually helped me come up with something more flexible:
, \p{Lu}[\p{Ll}]+ \w+ \w+ \w+
As long as the first condition is met–comma space Uppercase.word–it will
return any kind of word following, Capital or lowercase. And I can extend
the ‘context’ as far as I want–even before the comma, if so desired. Fun
fun!

Thanks again,

Paul

Hi Paul

Some explanations of regular expression syntax to help your knowledge
acquisition.

  • Technically \w matches all letters and numbers, while \p{L} just
    matches letters.
  • The + matches 1 or more occurrences of the previous match. An *
    matches zero or more.
  • The [ ] are your self made definition of a class that will match any
    character inside the brackets.
    o When you have one item in the brackets there is no need for the
    brackets.
    [\p{Ll}] is the same as \p{Ll}.
    It does not hurt to have the square brackets, they are however
    unnecessary.

D anon467281

Global Publishing Services
WBT Central (Africa, Europe, Eurasia thru India) DBL Curator, Scripture Typesetting trainer & Regular Expression "specialist"
Dallas, TX

* would be better than + for a language like English, because you need to allow one-letter words, like “a”.

@Paul: @anon467281’s expression was matching one-or-more of lowercase-or-modifier. The or in my sentence means you need the square brackets – you’re listing more than one thing that can match each time – each of the multiple times the + looks for the “one or more” it wants to match.

Modifier is typically a diacritic – and accent on a letter – and is needed if you might have decomposed diacritics.

I’d suggest you might need to include word-medial punctuation for many languages, too. So for English (and with some context, as you suggested):

, \p{Lu}[\p{Ll}\p{M}-'’]* [\p{L}-'’]+ [\p{L}-'’]+

This allows hyphens and either straight or curly apostrophes. The \p{M} allows for loan words that have diacritics. But a flaw is that, by grabbing some context, it won’t match if there are less than three words between the comma and the next punctuation mark (you could add more regex code to fix that).

You might want to allow for more than one capital in the first word. e.g. if “eVisa” is a word, and you want it capitalised “EVisa” at the start of a sentence, but not after a comma:

, \p{Lu}[\p{L}\p{M}-'’]* [\p{L}-'’]+ [\p{L}-'’]+

– just one lowercase L removed from the second character class (character classes are the \p things).

wdavidhj & Paul & others

First, these types of discussions are fun and educational. They can
present the potential power of using regular expressions.

The expression can be as complicated as you want to make it. It’s
complexity depends on you languages words such as mid-word
capitalization, medial word characters such as a hyphen or apostrophe,
and other variables unique to your language.

Paul introduced some great considerations. I have responded to some of
those and added a new potential regular expression to use in the dialog
that follows.

D anon467281

Global Publishing Services
WBT Central (Africa, Europe, Eurasia thru India) DBL Curator, Scripture Typesetting trainer & Regular Expression "specialist"
Dallas, TX

[wdavidhj] wdavidhj [Link Removed]
January 18

anon467281:

  * The + matches 1 or more occurrences of the previous match. An
    * matches zero or more.
  • would be better than + for a language like English, because you need
    to allow one-letter words, like “a”.

    * The [ ] are your self-made definition of a class that will
      match any character inside the brackets.
      When you have one item in the brackets there is no need for
      the brackets.
      [\p{Ll}] is the same as \p{Ll}.
      It does not hurt to have the square brackets, they are however
      unnecessary.
    

@Paul [Link Removed]: @anon467281
[Link Removed]’s expression was matching
one-or-more of /lowercase-or-modifier/^† . The /or/ in my sentence
means you need the square brackets – you’re listing more than one
thing that can match each time – each of the multiple times the +
looks for the “one or more” it wants to match.

^† Modifier is typically a diacritic – and accent on a letter – and is
needed if you might have decomposed diacritics.

I’d suggest you might need to include word-medial punctuation for many
languages, too. So for English (and with some context, as you suggested):

, \p{Lu}[\p{Ll}\p{M}-’’]* [\p{L}-’’]+ [\p{L}-’’]+ |

This allows hyphens and either straight or curly apostrophes. The

\p{M}| allows

Inside the square brackets a - between to items is a range character
matching anything from the character before thru the character after as
in a-z matching the letters a thru z.

Simply placing the space followed by the 2nd word in parenthesis and
including an * (asterisk)plus after the closing parenthesis like the
following would match 0 to an infinite number of following words. But
that is probably way more than you need to evaluate the context than you
need.

( [\p{L}-’’]+)*

for loan words that have diacritics. But a flaw is that, by grabbing
some context, it won’t match if there are less than three words
between the comma and the next punctuation mark (you could add more
regex code to fix that).

A slight variation of the expression as follows limits the number of
words to 0-4 after the initial , and capitalized word.

I might suggest also matching an optional punctuation character at the
end of each word. That would make the end expression look like:

, \p{Lu}[\p{L}\p{M}-’’]* \p{P}?( [\p{L}\p{M}-’’]\p{P}?){0,4}

The {0,4} is a repeat what precedes (in this case of a word with
possible ending punctuation) of as few as zero times to as many as 4 times.

Welcome to Support Bible, where you can ask questions and receive answers from other members of the community.
Accept the one whose faith is weak, without quarreling over disputable matters.
Romans 14:1
2,645 questions
5,394 answers
5,065 comments
1,437 users