wdavidhj & Paul & others
First, these types of discussions are fun and educational. They can
present the potential power of using regular expressions.
The expression can be as complicated as you want to make it. It’s
complexity depends on you languages words such as mid-word
capitalization, medial word characters such as a hyphen or apostrophe,
and other variables unique to your language.
Paul introduced some great considerations. I have responded to some of
those and added a new potential regular expression to use in the dialog
that follows.
D anon467281
Global Publishing Services
WBT Central (Africa, Europe, Eurasia thru India) DBL Curator, Scripture Typesetting trainer & Regular Expression "specialist"
Dallas, TX
[wdavidhj] wdavidhj [Link Removed]
January 18
anon467281:
* The + matches 1 or more occurrences of the previous match. An
* matches zero or more.
-
would be better than + for a language like English, because you need
to allow one-letter words, like “a”.
* The [ ] are your self-made definition of a class that will
match any character inside the brackets.
When you have one item in the brackets there is no need for
the brackets.
[\p{Ll}] is the same as \p{Ll}.
It does not hurt to have the square brackets, they are however
unnecessary.
@Paul [Link Removed]: @anon467281
[Link Removed]’s expression was matching
one-or-more of /lowercase-or-modifier/^† . The /or/ in my sentence
means you need the square brackets – you’re listing more than one
thing that can match each time – each of the multiple times the +
looks for the “one or more” it wants to match.
^† Modifier is typically a diacritic – and accent on a letter – and is
needed if you might have decomposed diacritics.
I’d suggest you might need to include word-medial punctuation for many
languages, too. So for English (and with some context, as you suggested):
, \p{Lu}[\p{Ll}\p{M}-’’]* [\p{L}-’’]+ [\p{L}-’’]+ |
This allows hyphens and either straight or curly apostrophes. The
\p{M}| allows
Inside the square brackets a - between to items is a range character
matching anything from the character before thru the character after as
in a-z matching the letters a thru z.
Simply placing the space followed by the 2nd word in parenthesis and
including an * (asterisk)plus after the closing parenthesis like the
following would match 0 to an infinite number of following words. But
that is probably way more than you need to evaluate the context than you
need.
( [\p{L}-’’]+)*
for loan words that have diacritics. But a flaw is that, by grabbing
some context, it won’t match if there are less than three words
between the comma and the next punctuation mark (you could add more
regex code to fix that).
A slight variation of the expression as follows limits the number of
words to 0-4 after the initial , and capitalized word.
I might suggest also matching an optional punctuation character at the
end of each word. That would make the end expression look like:
, \p{Lu}[\p{L}\p{M}-’’]* \p{P}?( [\p{L}\p{M}-’’]\p{P}?){0,4}
The {0,4} is a repeat what precedes (in this case of a word with
possible ending punctuation) of as few as zero times to as many as 4 times.