docs.rodeo

MDN Web Docs mirror

Character class: [...], [^...]

{{jsSidebar}} 

A character class matches any character in or not in a custom set of characters. When the v flag is enabled, it can also be used to match finite-length strings.

Syntax

[]
[abc]
[A-Z]

[^]
[^abc]
[^A-Z]

// `v` mode only
[operand1&&operand2]
[operand1--operand2]
[\q{substring}]

Parameters

Description

A character class specifies a list of characters between square brackets and matches any character in the list. The v flag drastically changes how character classes are parsed and interpreted. The following syntaxes are available in both v mode and non-v mode:

These syntaxes can occur any number of times, and the character sets they represent are unioned. For example, /[a-zA-Z0-9]/ matches any letter or digit.

The ^ prefix in a character class creates a complement class. For example, [^abc] matches any character except a, b, or c. The ^ character is a literal character when it appears in the middle of a character class — for example, [a^b] matches the characters a, ^, and b.

The lexical grammar does a very rough parse of regex literals, so that it does not end the regex literal at a / character which appears within a character class. This means /[/]/ is valid without needing to escape the /.

The boundaries of a character range must not specify more than one character, which happens if you use a character class escape. For example:

/[\s-9]/u; // SyntaxError: Invalid regular expression: Invalid character class

In Unicode-unaware mode, character ranges where one boundary is a character class makes the - become a literal character. This is a deprecated syntax for web compatibility, and you should not rely on it.

/[\s-9]/.test("-"); // true

In Unicode-unaware mode, regexes are interpreted as a sequence of BMP characters. Therefore, surrogate pairs in character classes represent two characters instead of one.

/[😄]/.test("\ud83d"); // true
/[😄]/u.test("\ud83d"); // false

/[😄-😛]/.test("😑"); // SyntaxError: Invalid regular expression: /[😄-😛]/: Range out of order in character class
/[😄-😛]/u.test("😑"); // true

Even if the pattern ignores case, the case of the two ends of a range is significant in determining which characters belong to the range. For example, the pattern /[E-F]/i only matches E, F, e, and f, while the pattern /[E-f]/i matches all uppercase and lowercase {{Glossary("ASCII")}}  letters (because it spans over E–Z and a–f), as well as [, \, ], ^, _, and `.

Non-v-mode character class

Non-v-mode character classes interpret most character literally and have less restrictions about the characters they can contain. For example, . is the literal dot character, not the wildcard. The only characters that cannot appear literally are \, ], and -.

v-mode character class

The basic idea of character classes in v mode remains the same: you can still use most characters literally, use - to denote character ranges, and use escape sequences. One of the most important features of the v flag is set notation within character classes. As previously mentioned, normal character classes can express unions by concatenating two ranges, such as using [A-Z0-9] to mean “the union of the set [A-Z] and the set [0-9]”. However, there’s no easy way to represent other operations with character sets, such as intersection and difference.

With the v flag, intersection is expressed with &&, and subtraction with --. The absence of both implies union. The two operands of && or -- can be a character, character escape, character class escape, or even another character class. For example, to express “a word character that’s not an underscore”, you can use [\w--_]. You cannot mix operators on the same level. For example, [\w&&[A-z]--_] is a syntax error. However, because you can nest character classes, you can be explicit by writing [\w&&[[A-z]--_]] or [[\w&&[A-z]]--_] (which both mean [A-Za-z]). Similarly, [AB--C] is invalid and you need to write [A[B--C]] (which just means [AB]).

In v mode, the Unicode character class escape \p can match finite-length strings, such as emojis. For symmetry, regular character classes can also match more than one character. To write a “string literal” in a character class, you wrap the string in \q{...}. The only regex syntax supported here is disjunction — apart from this, \q must completely enclose literals (including escaped characters). This ensures that character classes can only match finite-length strings with finitely many possibilities.

Because the character class syntax is now more sophisticated, more characters are reserved and forbidden from appearing literally.

Complement character classes [^...] cannot possibly be able to match strings longer than one character. For example, [\q{ab|c}] is valid and matches the string "ab", but [^\q{ab|c}] is invalid because it’s unclear how many characters should be consumed. The check is done by checking if all \q contain single characters and all \p specify character properties — for unions, all operands must be purely characters; for intersections, at least one operand must be purely characters; for subtraction, the leftmost operand must be purely characters. The check is syntactic without looking at the actual character set being specified, which means although /[^\q{ab|c}--\q{ab}]/v is equivalent to /[^c]/v, it’s still rejected.

Complement classes and case-insensitive matching

Case-insensitive matching works by case-folding both the expected character set and the matched string. When specifying complement classes, the order in which JavaScript performs case-folding and complementing is important. In brief, [^...] in u mode matches allCharacters - caseFold(original), while in v mode matches caseFold(allCharacters) - caseFold(original). This ensures that all complement class syntaxes, including [^...], \P, \W, etc., cancel each other out.

Consider the following two regexes (to simplify things, let’s assume that Unicode characters are one of three kinds: lowercase, uppercase, and caseless, and each uppercase letter has a unique lowercase counterpart, and vice versa):

const r1 = /\p{Lowercase_Letter}/iu;
const r2 = /[^\P{Lowercase_Letter}]/iu;

The r2 is a double negation and seems to be equivalent with r1. But in fact, r1 matches all lower- and uppercase ASCII letters, while r2 matches none. Here’s a step-by-step explanation:

The main observation here is that after [^...] negates the match, the expected character set may not be a subset of the set of case-folded Unicode characters, causing the case-folded input to not be in the expected character set. In v mode, the set of all characters is also case-folded. The \P character class itself also works slightly differently in v mode (see Unicode character class escape). All of these ensure that [^\P{Lowercase_Letter}] and \p{Lowercase_Letter} are strictly equivalent.

Examples

Matching hexadecimal digits

The following function determines whether a string contains a valid hexadecimal number:

function isHexadecimal(str) {
  return /^[0-9A-F]+$/i.test(str);
}

isHexadecimal("2F3"); // true
isHexadecimal("beef"); // true
isHexadecimal("undefined"); // false

Using intersection

The following function matches Greek letters.

function greekLetters(str) {
  return str.match(/[\p{Script_Extensions=Greek}&&\p{Letter}]/gv);
}

// 𐆊 is U+1018A GREEK ZERO SIGN
greekLetters("π𐆊P0零αAΣ"); // [ 'π', 'α', 'Σ' ]

Using subtraction

The following function matches all non-ASCII numbers.

function nonASCIINumbers(str) {
  return str.match(/[\p{Decimal_Number}--[0-9]]/gv);
}

// 𑜹 is U+11739 AHOM DIGIT NINE
nonASCIINumbers("𐆊0零1𝟜𑜹a"); // [ '𝟜', '𑜹' ]

Matching strings

The following function matches all line terminator sequences, including the line terminator characters and the sequence \r\n (CRLF).

function getLineTerminators(str) {
  return str.match(/[\r\n\u2028\u2029\q{\r\n}]/gv);
}

getLineTerminators(`
A poem\r
Is split\r\n
Into many
Stanzas
`); // [ '\r', '\r\n', '\n' ]

This example is exactly equivalent to /(?:\r|\n|\u2028|\u2029|\r\n)/gu or /(?:[\r\n\u2028\u2029]|\r\n)/gu, except shorter.

The most useful case of \q{} is when doing subtraction and intersection. Previously, this was possible with multiple lookaheads. The following function matches flags that are not one of the American, Chinese, Russian, British, and French flags.

function notUNSCPermanentMember(flag) {
  return /^[\p{RGI_Emoji_Flag_Sequence}--\q{🇺🇸|🇨🇳|🇷🇺|🇬🇧|🇫🇷}]$/v.test(flag);
}

notUNSCPermanentMember("🇺🇸"); // false
notUNSCPermanentMember("🇩🇪"); // true

This example is mostly equivalent to /^(?!🇺🇸|🇨🇳|🇷🇺|🇬🇧|🇫🇷)\p{RGI_Emoji_Flag_Sequence}$/v, except perhaps more performant.

Specifications

{{Specifications}} 

Browser compatibility

{{Compat}} 

See also

In this article

View on MDN