I’m a Regular Expressions nerd. While I was writing a regex to look for verses with no ending punctuation:
[^\\][^\p{P}\d]\r\n\\v
or
[^\\l][^\p{P}\d]\r\n\\v \d+ \p{Lu}
, I realized that the C# flavor of RegEx in RegexPal (and FLEx filtering/Process) supports shorthand (not longhand) Unicode character categories.
Some of the most useful are:
-
\p{L}
Letter (nearly any orthography)
-
\p{Lu}
Uppercase letter
-
\p{Ll}
Lowercase letter
-
\p{P}
Punctuation
-
\p{M}
Any diacritic
If the P after the slash is capital, it means NOT that character class.
The whole list:
-
\p{L}
: any kind of letter from any language.
-
\p{Ll}
: a lowercase letter that has an uppercase variant.
-
\p{Lu}
: an uppercase letter that has a lowercase variant.
-
\p{Lt}
: a letter that appears at the start of a word when only the first letter of the word is capitalized.
-
\p{L&}
: a letter that exists in lowercase and uppercase variants (combination of Ll, Lu and Lt).
-
\p{Lm}
: a special character that is used like a letter.
-
\p{Lo}
: a letter or ideograph that does not have lowercase and uppercase variants.
-
\p{M}
: a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.).
-
\p{Mn}
: a character intended to be combined with another character without taking up extra space (e.g. accents, umlauts, etc.).
-
\p{Mc}
: a character intended to be combined with another character that takes up extra space (vowel signs in many Eastern languages).
-
\p{Me}
: a character that encloses the character it is combined with (circle, square, keycap, etc.).
-
\p{Z}
: any kind of whitespace or invisible separator.
-
\p{Zs}
: a whitespace character that is invisible, but does take up space.
-
\p{Zl}
: line separator character U+2028.
-
\p{Zp}
: paragraph separator character U+2029.
-
\p{S}
: math symbols, currency signs, dingbats, box-drawing characters, etc.
-
\p{Sm}
: any mathematical symbol.
-
\p{Sc}
: any currency sign.
-
\p{Sk}
: a combining character (mark) as a full character on its own.
-
\p{So}
: various symbols that are not math symbols, currency signs, or combining characters.
-
\p{N}
: any kind of numeric character in any script.
-
\p{Nd}
: a digit zero through nine in any script except ideographic scripts.
-
\p{Nl}
: a number that looks like a letter, such as a Roman numeral.
-
\p{No}
: a superscript or subscript digit, or a number that is not a digit 0–9 (excluding numbers from ideographic scripts).
-
\p{P}
: any kind of punctuation character.
-
\p{Pd}
: any kind of hyphen or dash.
-
\p{Ps}
: any kind of opening bracket.
-
\p{Pe}
: any kind of closing bracket.
-
\p{Pi}
: any kind of opening quote.
-
\p{Pf}
: any kind of closing quote.
-
\p{Pc}
: a punctuation character such as an underscore that connects words.
-
\p{Po}
: any kind of punctuation character that is not a dash, bracket, quote or connector.
-
\p{C}
: invisible control characters and unused code points.
-
\p{Cc}
: an ASCII or Latin-1 control character: 0x00–0x1F and 0x7F–0x9F.
-
\p{Cf}
: invisible formatting indicator.
-
\p{Co}
: any code point reserved for private use.
-
\p{Cs}
: one half of a surrogate pair in UTF-16 encoding.
-
\p{Cn}
: any code point to which no character has been assigned.
Full Documentation:
https://www.regular-expressions.info/unicode.html#category