Regex to find capitals following commas

Question

1 Answer

Best answer

Hi Paul

No easy way to “ignore” capitalized names that I know of. Might I
suggest using an expression in RegExPal to find all capital initial
words rather than just the capital letter. With the list sorted you
could look thru the list and easily pick out the occurrences that are
NOT names.

The following looks for a comma followed by a space, then a capital
letter, followed by a string of lowercase letters with possible diacritics.

, \p{Lu}[\p{Ll}\p{M}]+

The matches look like:

The sorted list looks like:

Hope this helps.

D anon467281 Global Publishing Services WBT Central (Africa, Europe,
Eurasia thru India) DBL Curator, Scripture Typesetting trainer & Regular
Expression “specialist” Dallas, TX

Jan 14, 2018 answered by anon467281 (571 points)
Jan 14, 2018 reshown

Hi Paul

Some explanations of regular expression syntax to help your knowledge
acquisition.

Technically \w matches all letters and numbers, while \p{L} just
matches letters.
The + matches 1 or more occurrences of the previous match. An *
matches zero or more.
The [ ] are your self made definition of a class that will match any
character inside the brackets.
o When you have one item in the brackets there is no need for the
brackets.
[\p{Ll}] is the same as \p{Ll}.
It does not hurt to have the square brackets, they are however
unnecessary.

D anon467281

Global Publishing Services
WBT Central (Africa, Europe, Eurasia thru India) DBL Curator, Scripture Typesetting trainer & Regular Expression "specialist"
Dallas, TX

Jan 15, 2018 commented by anon467281 (571 points)
Jan 15, 2018 reshown

* would be better than + for a language like English, because you need to allow one-letter words, like “a”.

@Paul: @anon467281’s expression was matching one-or-more of lowercase-or-modifier^†. The or in my sentence means you need the square brackets – you’re listing more than one thing that can match each time – each of the multiple times the + looks for the “one or more” it wants to match.

^† Modifier is typically a diacritic – and accent on a letter – and is needed if you might have decomposed diacritics.

I’d suggest you might need to include word-medial punctuation for many languages, too. So for English (and with some context, as you suggested):

, \p{Lu}[\p{Ll}\p{M}-'’]* [\p{L}-'’]+ [\p{L}-'’]+

This allows hyphens and either straight or curly apostrophes. The \p{M} allows for loan words that have diacritics. But a flaw is that, by grabbing some context, it won’t match if there are less than three words between the comma and the next punctuation mark (you could add more regex code to fix that).

You might want to allow for more than one capital in the first word. e.g. if “eVisa” is a word, and you want it capitalised “EVisa” at the start of a sentence, but not after a comma:

, \p{Lu}[\p{L}\p{M}-'’]* [\p{L}-'’]+ [\p{L}-'’]+

– just one lowercase L removed from the second character class (character classes are the \p things).

Jan 18, 2018 commented by wdavidhj (1.4k points)

wdavidhj & Paul & others

First, these types of discussions are fun and educational. They can
present the potential power of using regular expressions.

The expression can be as complicated as you want to make it. It’s
complexity depends on you languages words such as mid-word
capitalization, medial word characters such as a hyphen or apostrophe,
and other variables unique to your language.

Paul introduced some great considerations. I have responded to some of
those and added a new potential regular expression to use in the dialog
that follows.

D anon467281

Global Publishing Services
WBT Central (Africa, Europe, Eurasia thru India) DBL Curator, Scripture Typesetting trainer & Regular Expression "specialist"
Dallas, TX

[wdavidhj] wdavidhj [Link Removed]
January 18
anon467281:

  * The + matches 1 or more occurrences of the previous match. An
    * matches zero or more.
would be better than + for a language like English, because you need
to allow one-letter words, like “a”.
* The [ ] are your self-made definition of a class that will
  match any character inside the brackets.
  When you have one item in the brackets there is no need for
  the brackets.
  [\p{Ll}] is the same as \p{Ll}.
  It does not hurt to have the square brackets, they are however
  unnecessary.
@Paul [Link Removed]: @anon467281
[Link Removed]’s expression was matching
one-or-more of /lowercase-or-modifier/^† . The /or/ in my sentence
means you need the square brackets – you’re listing more than one
thing that can match each time – each of the multiple times the +
looks for the “one or more” it wants to match.

^† Modifier is typically a diacritic – and accent on a letter – and is
needed if you might have decomposed diacritics.

I’d suggest you might need to include word-medial punctuation for many
languages, too. So for English (and with some context, as you suggested):

, \p{Lu}[\p{Ll}\p{M}-’’]* [\p{L}-’’]+ [\p{L}-’’]+ |

This allows hyphens and either straight or curly apostrophes. The

\p{M}| allows

Inside the square brackets a - between to items is a range character
matching anything from the character before thru the character after as
in a-z matching the letters a thru z.

Simply placing the space followed by the 2nd word in parenthesis and
including an * (asterisk)plus after the closing parenthesis like the
following would match 0 to an infinite number of following words. But
that is probably way more than you need to evaluate the context than you
need.

( [\p{L}-’’]+)*

for loan words that have diacritics. But a flaw is that, by grabbing
some context, it won’t match if there are less than three words
between the comma and the next punctuation mark (you could add more
regex code to fix that).

A slight variation of the expression as follows limits the number of
words to 0-4 after the initial , and capitalized word.

I might suggest also matching an optional punctuation character at the
end of each word. That would make the end expression look like:

, \p{Lu}[\p{L}\p{M}-’’]* \p{P}?( [\p{L}\p{M}-’’]\p{P}?){0,4}

The {0,4} is a repeat what precedes (in this case of a word with
possible ending punctuation) of as few as zero times to as many as 4 times.

Jan 18, 2018 commented by anon467281 (571 points)
Jan 18, 2018 reshown

Regex to find capitals following commas

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Related questions

Categories