+4 votes

I’m a Regular Expressions nerd. While I was writing a regex to look for verses with no ending punctuation:
[^\\][^\p{P}\d]\r\n\\v
or
[^\\l][^\p{P}\d]\r\n\\v \d+ \p{Lu}
, I realized that the C# flavor of RegEx in RegexPal (and FLEx filtering/Process) supports shorthand (not longhand) Unicode character categories.

Some of the most useful are:

  • \p{L} Letter (nearly any orthography)
  • \p{Lu} Uppercase letter
  • \p{Ll} Lowercase letter
  • \p{P} Punctuation
  • \p{M} Any diacritic

If the P after the slash is capital, it means NOT that character class.

  • \P{L} Any non-letter

The whole list:

  • \p{L}: any kind of letter from any language.
    • \p{Ll}: a lowercase letter that has an uppercase variant.
    • \p{Lu}: an uppercase letter that has a lowercase variant.
    • \p{Lt}: a letter that appears at the start of a word when only the first letter of the word is capitalized.
    • \p{L&}: a letter that exists in lowercase and uppercase variants (combination of Ll, Lu and Lt).
    • \p{Lm}: a special character that is used like a letter.
    • \p{Lo}: a letter or ideograph that does not have lowercase and uppercase variants.
  • \p{M}: a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.).
    • \p{Mn}: a character intended to be combined with another character without taking up extra space (e.g. accents, umlauts, etc.).
    • \p{Mc}: a character intended to be combined with another character that takes up extra space (vowel signs in many Eastern languages).
    • \p{Me}: a character that encloses the character it is combined with (circle, square, keycap, etc.).
  • \p{Z}: any kind of whitespace or invisible separator.
    • \p{Zs}: a whitespace character that is invisible, but does take up space.
    • \p{Zl}: line separator character U+2028.
    • \p{Zp}: paragraph separator character U+2029.
  • \p{S}: math symbols, currency signs, dingbats, box-drawing characters, etc.
    • \p{Sm}: any mathematical symbol.
    • \p{Sc}: any currency sign.
    • \p{Sk}: a combining character (mark) as a full character on its own.
    • \p{So}: various symbols that are not math symbols, currency signs, or combining characters.
  • \p{N}: any kind of numeric character in any script.
    • \p{Nd}: a digit zero through nine in any script except ideographic scripts.
    • \p{Nl}: a number that looks like a letter, such as a Roman numeral.
    • \p{No}: a superscript or subscript digit, or a number that is not a digit 0–9 (excluding numbers from ideographic scripts).
  • \p{P}: any kind of punctuation character.
    • \p{Pd}: any kind of hyphen or dash.
    • \p{Ps}: any kind of opening bracket.
    • \p{Pe}: any kind of closing bracket.
    • \p{Pi}: any kind of opening quote.
    • \p{Pf}: any kind of closing quote.
    • \p{Pc}: a punctuation character such as an underscore that connects words.
    • \p{Po}: any kind of punctuation character that is not a dash, bracket, quote or connector.
  • \p{C}: invisible control characters and unused code points.
    • \p{Cc}: an ASCII or Latin-1 control character: 0x00–0x1F and 0x7F–0x9F.
    • \p{Cf}: invisible formatting indicator.
    • \p{Co}: any code point reserved for private use.
    • \p{Cs}: one half of a surrogate pair in UTF-16 encoding.
    • \p{Cn}: any code point to which no character has been assigned.

Full Documentation:
https://www.regular-expressions.info/unicode.html#category

Paratext by (231 points)
reshown

1 Answer

0 votes
***** Added to remove this comment from unanswered questions *****
by (8.1k points)

Related questions

0 votes
3 answers
0 votes
0 answers
+1 vote
5 answers
0 votes
2 answers
Welcome to Support Bible, where you can ask questions and receive answers from other members of the community.
For just as each of us has one body with many members, and these members do not all have the same function, so in Christ we, though many, form one body, and each member belongs to all the others.
Romans 12:4-5
2,564 questions
5,294 answers
5,000 comments
1,374 users