-\1f
-File: lispref.info, Node: Syntax of Regexps, Next: Regexp Example, Up: Regular Expressions
-
-Syntax of Regular Expressions
------------------------------
-
- Regular expressions have a syntax in which a few characters are
-special constructs and the rest are "ordinary". An ordinary character
-is a simple regular expression that matches that character and nothing
-else. The special characters are `.', `*', `+', `?', `[', `]', `^',
-`$', and `\'; no new special characters will be defined in the future.
-Any other character appearing in a regular expression is ordinary,
-unless a `\' precedes it.
-
- For example, `f' is not a special character, so it is ordinary, and
-therefore `f' is a regular expression that matches the string `f' and
-no other string. (It does _not_ match the string `ff'.) Likewise, `o'
-is a regular expression that matches only `o'.
-
- Any two regular expressions A and B can be concatenated. The result
-is a regular expression that matches a string if A matches some amount
-of the beginning of that string and B matches the rest of the string.
-
- As a simple example, we can concatenate the regular expressions `f'
-and `o' to get the regular expression `fo', which matches only the
-string `fo'. Still trivial. To do something more powerful, you need
-to use one of the special characters. Here is a list of them:
-
-`. (Period)'
- is a special character that matches any single character except a
- newline. Using concatenation, we can make regular expressions
- like `a.b', which matches any three-character string that begins
- with `a' and ends with `b'.
-
-`*'
- is not a construct by itself; it is a quantifying suffix operator
- that means to repeat the preceding regular expression as many
- times as possible. In `fo*', the `*' applies to the `o', so `fo*'
- matches one `f' followed by any number of `o's. The case of zero
- `o's is allowed: `fo*' does match `f'.
-
- `*' always applies to the _smallest_ possible preceding
- expression. Thus, `fo*' has a repeating `o', not a repeating `fo'.
-
- The matcher processes a `*' construct by matching, immediately, as
- many repetitions as can be found; it is "greedy". Then it
- continues with the rest of the pattern. If that fails,
- backtracking occurs, discarding some of the matches of the
- `*'-modified construct in case that makes it possible to match the
- rest of the pattern. For example, in matching `ca*ar' against the
- string `caaar', the `a*' first tries to match all three `a's; but
- the rest of the pattern is `ar' and there is only `r' left to
- match, so this try fails. The next alternative is for `a*' to
- match only two `a's. With this choice, the rest of the regexp
- matches successfully.
-
- Nested repetition operators can be extremely slow if they specify
- backtracking loops. For example, it could take hours for the
- regular expression `\(x+y*\)*a' to match the sequence
- `xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxz'. The slowness is because
- Emacs must try each imaginable way of grouping the 35 `x''s before
- concluding that none of them can work. To make sure your regular
- expressions run fast, check nested repetitions carefully.
-
-`+'
- is a quantifying suffix operator similar to `*' except that the
- preceding expression must match at least once. It is also
- "greedy". So, for example, `ca+r' matches the strings `car' and
- `caaaar' but not the string `cr', whereas `ca*r' matches all three
- strings.
-
-`?'
- is a quantifying suffix operator similar to `*', except that the
- preceding expression can match either once or not at all. For
- example, `ca?r' matches `car' or `cr', but does not match anything
- else.
-
-`*?'
- works just like `*', except that rather than matching the longest
- match, it matches the shortest match. `*?' is known as a
- "non-greedy" quantifier, a regexp construct borrowed from Perl.
-
- This construct very useful for when you want to match the text
- inside a pair of delimiters. For instance, `/\*.*?\*/' will match
- C comments in a string. This could not be achieved without the
- use of greedy quantifier.
-
- This construct has not been available prior to XEmacs 20.4. It is
- not available in FSF Emacs.
-
-`+?'
- is the `+' analog to `*?'.
-
-`\{n,m\}'
- serves as an interval quantifier, analogous to `*' or `+', but
- specifies that the expression must match at least N times, but no
- more than M times. This syntax is supported by most Unix regexp
- utilities, and has been introduced to XEmacs for the version 20.3.
-
-`[ ... ]'
- `[' begins a "character set", which is terminated by a `]'. In
- the simplest case, the characters between the two brackets form
- the set. Thus, `[ad]' matches either one `a' or one `d', and
- `[ad]*' matches any string composed of just `a's and `d's
- (including the empty string), from which it follows that `c[ad]*r'
- matches `cr', `car', `cdr', `caddaar', etc.
-
- The usual regular expression special characters are not special
- inside a character set. A completely different set of special
- characters exists inside character sets: `]', `-' and `^'.
-
- `-' is used for ranges of characters. To write a range, write two
- characters with a `-' between them. Thus, `[a-z]' matches any
- lower case letter. Ranges may be intermixed freely with individual
- characters, as in `[a-z$%.]', which matches any lower case letter
- or `$', `%', or a period.
-
- To include a `]' in a character set, make it the first character.
- For example, `[]a]' matches `]' or `a'. To include a `-', write
- `-' as the first character in the set, or put it immediately after
- a range. (You can replace one individual character C with the
- range `C-C' to make a place to put the `-'.) There is no way to
- write a set containing just `-' and `]'.
-
- To include `^' in a set, put it anywhere but at the beginning of
- the set.
-
-`[^ ... ]'
- `[^' begins a "complement character set", which matches any
- character except the ones specified. Thus, `[^a-z0-9A-Z]' matches
- all characters _except_ letters and digits.
-
- `^' is not special in a character set unless it is the first
- character. The character following the `^' is treated as if it
- were first (thus, `-' and `]' are not special there).
-
- Note that a complement character set can match a newline, unless
- newline is mentioned as one of the characters not to match.
-
-`^'
- is a special character that matches the empty string, but only at
- the beginning of a line in the text being matched. Otherwise it
- fails to match anything. Thus, `^foo' matches a `foo' that occurs
- at the beginning of a line.
-
- When matching a string instead of a buffer, `^' matches at the
- beginning of the string or after a newline character `\n'.
-
-`$'
- is similar to `^' but matches only at the end of a line. Thus,
- `x+$' matches a string of one `x' or more at the end of a line.
-
- When matching a string instead of a buffer, `$' matches at the end
- of the string or before a newline character `\n'.
-
-`\'
- has two functions: it quotes the special characters (including
- `\'), and it introduces additional special constructs.
-
- Because `\' quotes special characters, `\$' is a regular
- expression that matches only `$', and `\[' is a regular expression
- that matches only `[', and so on.
-
- Note that `\' also has special meaning in the read syntax of Lisp
- strings (*note String Type::), and must be quoted with `\'. For
- example, the regular expression that matches the `\' character is
- `\\'. To write a Lisp string that contains the characters `\\',
- Lisp syntax requires you to quote each `\' with another `\'.
- Therefore, the read syntax for a regular expression matching `\'
- is `"\\\\"'.
-
- *Please note:* For historical compatibility, special characters are
-treated as ordinary ones if they are in contexts where their special
-meanings make no sense. For example, `*foo' treats `*' as ordinary
-since there is no preceding expression on which the `*' can act. It is
-poor practice to depend on this behavior; quote the special character
-anyway, regardless of where it appears.
-
- For the most part, `\' followed by any character matches only that
-character. However, there are several exceptions: characters that,
-when preceded by `\', are special constructs. Such characters are
-always ordinary when encountered on their own. Here is a table of `\'
-constructs:
-
-`\|'
- specifies an alternative. Two regular expressions A and B with
- `\|' in between form an expression that matches anything that
- either A or B matches.
-
- Thus, `foo\|bar' matches either `foo' or `bar' but no other string.
-
- `\|' applies to the largest possible surrounding expressions.
- Only a surrounding `\( ... \)' grouping can limit the grouping
- power of `\|'.
-
- Full backtracking capability exists to handle multiple uses of
- `\|'.
-
-`\( ... \)'
- is a grouping construct that serves three purposes:
-
- 1. To enclose a set of `\|' alternatives for other operations.
- Thus, `\(foo\|bar\)x' matches either `foox' or `barx'.
-
- 2. To enclose an expression for a suffix operator such as `*' to
- act on. Thus, `ba\(na\)*' matches `bananana', etc., with any
- (zero or more) number of `na' strings.
-
- 3. To record a matched substring for future reference.
-
- This last application is not a consequence of the idea of a
- parenthetical grouping; it is a separate feature that happens to be
- assigned as a second meaning to the same `\( ... \)' construct
- because there is no conflict in practice between the two meanings.
- Here is an explanation of this feature:
-
-`\DIGIT'
- matches the same text that matched the DIGITth occurrence of a `\(
- ... \)' construct.
-
- In other words, after the end of a `\( ... \)' construct. the
- matcher remembers the beginning and end of the text matched by that
- construct. Then, later on in the regular expression, you can use
- `\' followed by DIGIT to match that same text, whatever it may
- have been.
-
- The strings matching the first nine `\( ... \)' constructs
- appearing in a regular expression are assigned numbers 1 through 9
- in the order that the open parentheses appear in the regular
- expression. So you can use `\1' through `\9' to refer to the text
- matched by the corresponding `\( ... \)' constructs.
-
- For example, `\(.*\)\1' matches any newline-free string that is
- composed of two identical halves. The `\(.*\)' matches the first
- half, which may be anything, but the `\1' that follows must match
- the same exact text.
-
-`\(?: ... \)'
- is called a "shy" grouping operator, and it is used just like `\(
- ... \)', except that it does not cause the matched substring to be
- recorded for future reference.
-
- This is useful when you need a lot of grouping `\( ... \)'
- constructs, but only want to remember one or two. Then you can use
- not want to remember them for later use with `match-string'.
-
- Using `\(?: ... \)' rather than `\( ... \)' when you don't need
- the captured substrings ought to speed up your programs some,
- since it shortens the code path followed by the regular expression
- engine, as well as the amount of memory allocation and string
- copying it must do. The actual performance gain to be observed
- has not been measured or quantified as of this writing.
-
- The shy grouping operator has been borrowed from Perl, and has not
- been available prior to XEmacs 20.3, nor is it available in FSF
- Emacs.
-
-`\w'
- matches any word-constituent character. The editor syntax table
- determines which characters these are. *Note Syntax Tables::.
-
-`\W'
- matches any character that is not a word constituent.
-
-`\sCODE'
- matches any character whose syntax is CODE. Here CODE is a
- character that represents a syntax code: thus, `w' for word
- constituent, `-' for whitespace, `(' for open parenthesis, etc.
- *Note Syntax Tables::, for a list of syntax codes and the
- characters that stand for them.
-
-`\SCODE'
- matches any character whose syntax is not CODE.
-
- The following regular expression constructs match the empty
-string--that is, they don't use up any characters--but whether they
-match depends on the context.
-
-`\`'
- matches the empty string, but only at the beginning of the buffer
- or string being matched against.
-
-`\''
- matches the empty string, but only at the end of the buffer or
- string being matched against.
-
-`\='
- matches the empty string, but only at point. (This construct is
- not defined when matching against a string.)
-
-`\b'
- matches the empty string, but only at the beginning or end of a
- word. Thus, `\bfoo\b' matches any occurrence of `foo' as a
- separate word. `\bballs?\b' matches `ball' or `balls' as a
- separate word.
-
-`\B'
- matches the empty string, but _not_ at the beginning or end of a
- word.
-
-`\<'
- matches the empty string, but only at the beginning of a word.
-
-`\>'
- matches the empty string, but only at the end of a word.
-
- Not every string is a valid regular expression. For example, a
-string with unbalanced square brackets is invalid (with a few
-exceptions, such as `[]]'), and so is a string that ends with a single
-`\'. If an invalid regular expression is passed to any of the search
-functions, an `invalid-regexp' error is signaled.
-
- - Function: regexp-quote string
- This function returns a regular expression string that matches
- exactly STRING and nothing else. This allows you to request an
- exact string match when calling a function that wants a regular
- expression.
-
- (regexp-quote "^The cat$")
- => "\\^The cat\\$"
-
- One use of `regexp-quote' is to combine an exact string match with
- context described as a regular expression. For example, this
- searches for the string that is the value of `string', surrounded
- by whitespace:
-
- (re-search-forward
- (concat "\\s-" (regexp-quote string) "\\s-"))