+4 votes

I’m a Regular Expressions nerd. While I was writing a regex to look for verses with no ending punctuation:
[^\\][^\p{P}\d]\r\n\\v
or
[^\\l][^\p{P}\d]\r\n\\v \d+ \p{Lu}
, I realized that the C# flavor of RegEx in RegexPal (and FLEx filtering/Process) supports shorthand (not longhand) Unicode character categories.

Some of the most useful are:

  • \p{L} Letter (nearly any orthography)
  • \p{Lu} Uppercase letter
  • \p{Ll} Lowercase letter
  • \p{P} Punctuation
  • \p{M} Any diacritic

If the P after the slash is capital, it means NOT that character class.

  • \P{L} Any non-letter

The whole list:

  • \p{L}: any kind of letter from any language.
    • \p{Ll}: a lowercase letter that has an uppercase variant.
    • \p{Lu}: an uppercase letter that has a lowercase variant.
    • \p{Lt}: a letter that appears at the start of a word when only the first letter of the word is capitalized.
    • \p{L&}: a letter that exists in lowercase and uppercase variants (combination of Ll, Lu and Lt).
    • \p{Lm}: a special character that is used like a letter.
    • \p{Lo}: a letter or ideograph that does not have lowercase and uppercase variants.
  • \p{M}: a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.).
    • \p{Mn}: a character intended to be combined with another character without taking up extra space (e.g. accents, umlauts, etc.).
    • \p{Mc}: a character intended to be combined with another character that takes up extra space (vowel signs in many Eastern languages).
    • \p{Me}: a character that encloses the character it is combined with (circle, square, keycap, etc.).
  • \p{Z}: any kind of whitespace or invisible separator.
    • \p{Zs}: a whitespace character that is invisible, but does take up space.
    • \p{Zl}: line separator character U+2028.
    • \p{Zp}: paragraph separator character U+2029.
  • \p{S}: math symbols, currency signs, dingbats, box-drawing characters, etc.
    • \p{Sm}: any mathematical symbol.
    • \p{Sc}: any currency sign.
    • \p{Sk}: a combining character (mark) as a full character on its own.
    • \p{So}: various symbols that are not math symbols, currency signs, or combining characters.
  • \p{N}: any kind of numeric character in any script.
    • \p{Nd}: a digit zero through nine in any script except ideographic scripts.
    • \p{Nl}: a number that looks like a letter, such as a Roman numeral.
    • \p{No}: a superscript or subscript digit, or a number that is not a digit 0–9 (excluding numbers from ideographic scripts).
  • \p{P}: any kind of punctuation character.
    • \p{Pd}: any kind of hyphen or dash.
    • \p{Ps}: any kind of opening bracket.
    • \p{Pe}: any kind of closing bracket.
    • \p{Pi}: any kind of opening quote.
    • \p{Pf}: any kind of closing quote.
    • \p{Pc}: a punctuation character such as an underscore that connects words.
    • \p{Po}: any kind of punctuation character that is not a dash, bracket, quote or connector.
  • \p{C}: invisible control characters and unused code points.
    • \p{Cc}: an ASCII or Latin-1 control character: 0x00–0x1F and 0x7F–0x9F.
    • \p{Cf}: invisible formatting indicator.
    • \p{Co}: any code point reserved for private use.
    • \p{Cs}: one half of a surrogate pair in UTF-16 encoding.
    • \p{Cn}: any code point to which no character has been assigned.

Full Documentation:
https://www.regular-expressions.info/unicode.html#category

Paratext by (231 points)
reshown

1 Answer

0 votes
***** Added to remove this comment from unanswered questions *****
by (8.3k points)

Related questions

0 votes
3 answers
0 votes
0 answers
+1 vote
5 answers
0 votes
2 answers
Welcome to Support Bible, where you can ask questions and receive answers from other members of the community.
Just as a body, though one, has many parts, but all its many parts form one body, so it is with Christ.
1 Corinthians 12:12
2,618 questions
5,350 answers
5,037 comments
1,421 users