Some Advanced Regex Options for RegexPal

Question

I’m a Regular Expressions nerd. While I was writing a regex to look for verses with no ending punctuation:
[^\\][^\p{P}\d]\r\n\\v
or
[^\\l][^\p{P}\d]\r\n\\v \d+ \p{Lu}
, I realized that the C# flavor of RegEx in RegexPal (and FLEx filtering/Process) supports shorthand (not longhand) Unicode character categories.

Some of the most useful are:

\p{L} Letter (nearly any orthography)
\p{Lu} Uppercase letter
\p{Ll} Lowercase letter
\p{P} Punctuation
\p{M} Any diacritic

If the P after the slash is capital, it means NOT that character class.

\P{L} Any non-letter

The whole list:

\p{L}: any kind of letter from any language.
- \p{Ll}: a lowercase letter that has an uppercase variant.
- \p{Lu}: an uppercase letter that has a lowercase variant.
- \p{Lt}: a letter that appears at the start of a word when only the first letter of the word is capitalized.
- \p{L&}: a letter that exists in lowercase and uppercase variants (combination of Ll, Lu and Lt).
- \p{Lm}: a special character that is used like a letter.
- \p{Lo}: a letter or ideograph that does not have lowercase and uppercase variants.
\p{M}: a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.).
- \p{Mn}: a character intended to be combined with another character without taking up extra space (e.g. accents, umlauts, etc.).
- \p{Mc}: a character intended to be combined with another character that takes up extra space (vowel signs in many Eastern languages).
- \p{Me}: a character that encloses the character it is combined with (circle, square, keycap, etc.).
\p{Z}: any kind of whitespace or invisible separator.
- \p{Zs}: a whitespace character that is invisible, but does take up space.
- \p{Zl}: line separator character U+2028.
- \p{Zp}: paragraph separator character U+2029.
\p{S}: math symbols, currency signs, dingbats, box-drawing characters, etc.
- \p{Sm}: any mathematical symbol.
- \p{Sc}: any currency sign.
- \p{Sk}: a combining character (mark) as a full character on its own.
- \p{So}: various symbols that are not math symbols, currency signs, or combining characters.
\p{N}: any kind of numeric character in any script.
- \p{Nd}: a digit zero through nine in any script except ideographic scripts.
- \p{Nl}: a number that looks like a letter, such as a Roman numeral.
- \p{No}: a superscript or subscript digit, or a number that is not a digit 0–9 (excluding numbers from ideographic scripts).
\p{P}: any kind of punctuation character.
- \p{Pd}: any kind of hyphen or dash.
- \p{Ps}: any kind of opening bracket.
- \p{Pe}: any kind of closing bracket.
- \p{Pi}: any kind of opening quote.
- \p{Pf}: any kind of closing quote.
- \p{Pc}: a punctuation character such as an underscore that connects words.
- \p{Po}: any kind of punctuation character that is not a dash, bracket, quote or connector.
\p{C}: invisible control characters and unused code points.
- \p{Cc}: an ASCII or Latin-1 control character: 0x00–0x1F and 0x7F–0x9F.
- \p{Cf}: invisible formatting indicator.
- \p{Co}: any code point reserved for private use.
- \p{Cs}: one half of a surrogate pair in UTF-16 encoding.
- \p{Cn}: any code point to which no character has been assigned.

Full Documentation:
https://www.regular-expressions.info/unicode.html#category

Paratext Feb 5, 2023 asked by Matthew_Lee (231 points)
Feb 6, 2023 reshown

Some Advanced Regex Options for RegexPal

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Related questions

Categories