+File: lispref.info, Node: Syntax of Regexps, Next: Regexp Example, Up: Regular Expressions
+
+Syntax of Regular Expressions
+-----------------------------
+
+ Regular expressions have a syntax in which a few characters are
+special constructs and the rest are "ordinary". An ordinary character
+is a simple regular expression that matches that character and nothing
+else. The special characters are `.', `*', `+', `?', `[', `]', `^',
+`$', and `\'; no new special characters will be defined in the future.
+Any other character appearing in a regular expression is ordinary,
+unless a `\' precedes it.
+
+ For example, `f' is not a special character, so it is ordinary, and
+therefore `f' is a regular expression that matches the string `f' and
+no other string. (It does _not_ match the string `ff'.) Likewise, `o'
+is a regular expression that matches only `o'.
+
+ Any two regular expressions A and B can be concatenated. The result
+is a regular expression that matches a string if A matches some amount
+of the beginning of that string and B matches the rest of the string.
+
+ As a simple example, we can concatenate the regular expressions `f'
+and `o' to get the regular expression `fo', which matches only the
+string `fo'. Still trivial. To do something more powerful, you need
+to use one of the special characters. Here is a list of them:
+
+`. (Period)'
+ is a special character that matches any single character except a
+ newline. Using concatenation, we can make regular expressions
+ like `a.b', which matches any three-character string that begins
+ with `a' and ends with `b'.
+
+`*'
+ is not a construct by itself; it is a quantifying suffix operator
+ that means to repeat the preceding regular expression as many
+ times as possible. In `fo*', the `*' applies to the `o', so `fo*'
+ matches one `f' followed by any number of `o's. The case of zero
+ `o's is allowed: `fo*' does match `f'.
+
+ `*' always applies to the _smallest_ possible preceding
+ expression. Thus, `fo*' has a repeating `o', not a repeating `fo'.
+
+ The matcher processes a `*' construct by matching, immediately, as
+ many repetitions as can be found; it is "greedy". Then it
+ continues with the rest of the pattern. If that fails,
+ backtracking occurs, discarding some of the matches of the
+ `*'-modified construct in case that makes it possible to match the
+ rest of the pattern. For example, in matching `ca*ar' against the
+ string `caaar', the `a*' first tries to match all three `a's; but
+ the rest of the pattern is `ar' and there is only `r' left to
+ match, so this try fails. The next alternative is for `a*' to
+ match only two `a's. With this choice, the rest of the regexp
+ matches successfully.
+
+ Nested repetition operators can be extremely slow if they specify
+ backtracking loops. For example, it could take hours for the
+ regular expression `\(x+y*\)*a' to match the sequence
+ `xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxz'. The slowness is because
+ Emacs must try each imaginable way of grouping the 35 `x''s before
+ concluding that none of them can work. To make sure your regular
+ expressions run fast, check nested repetitions carefully.
+
+`+'
+ is a quantifying suffix operator similar to `*' except that the
+ preceding expression must match at least once. It is also
+ "greedy". So, for example, `ca+r' matches the strings `car' and
+ `caaaar' but not the string `cr', whereas `ca*r' matches all three
+ strings.
+
+`?'
+ is a quantifying suffix operator similar to `*', except that the
+ preceding expression can match either once or not at all. For
+ example, `ca?r' matches `car' or `cr', but does not match anything
+ else.
+
+`*?'
+ works just like `*', except that rather than matching the longest
+ match, it matches the shortest match. `*?' is known as a
+ "non-greedy" quantifier, a regexp construct borrowed from Perl.
+
+ This construct is very useful for when you want to match the text
+ inside a pair of delimiters. For instance, `/\*.*?\*/' will match
+ C comments in a string. This could not easily be achieved without
+ the use of a non-greedy quantifier.
+
+ This construct has not been available prior to XEmacs 20.4. It is
+ not available in FSF Emacs.
+
+`+?'
+ is the non-greedy version of `+'.
+
+`??'
+ is the non-greedy version of `?'.
+
+`\{n,m\}'
+ serves as an interval quantifier, analogous to `*' or `+', but
+ specifies that the expression must match at least N times, but no
+ more than M times. This syntax is supported by most Unix regexp
+ utilities, and has been introduced to XEmacs for the version 20.3.
+
+ Unfortunately, the non-greedy version of this quantifier does not
+ exist currently, although it does in Perl.
+
+`[ ... ]'
+ `[' begins a "character set", which is terminated by a `]'. In
+ the simplest case, the characters between the two brackets form
+ the set. Thus, `[ad]' matches either one `a' or one `d', and
+ `[ad]*' matches any string composed of just `a's and `d's
+ (including the empty string), from which it follows that `c[ad]*r'
+ matches `cr', `car', `cdr', `caddaar', etc.
+
+ The usual regular expression special characters are not special
+ inside a character set. A completely different set of special
+ characters exists inside character sets: `]', `-' and `^'.
+
+ `-' is used for ranges of characters. To write a range, write two
+ characters with a `-' between them. Thus, `[a-z]' matches any
+ lower case letter. Ranges may be intermixed freely with individual
+ characters, as in `[a-z$%.]', which matches any lower case letter
+ or `$', `%', or a period.
+
+ To include a `]' in a character set, make it the first character.
+ For example, `[]a]' matches `]' or `a'. To include a `-', write
+ `-' as the first character in the set, or put it immediately after
+ a range. (You can replace one individual character C with the
+ range `C-C' to make a place to put the `-'.) There is no way to
+ write a set containing just `-' and `]'.
+
+ To include `^' in a set, put it anywhere but at the beginning of
+ the set.
+
+`[^ ... ]'
+ `[^' begins a "complement character set", which matches any
+ character except the ones specified. Thus, `[^a-z0-9A-Z]' matches
+ all characters _except_ letters and digits.
+
+ `^' is not special in a character set unless it is the first
+ character. The character following the `^' is treated as if it
+ were first (thus, `-' and `]' are not special there).
+
+ Note that a complement character set can match a newline, unless
+ newline is mentioned as one of the characters not to match.
+
+`^'
+ is a special character that matches the empty string, but only at
+ the beginning of a line in the text being matched. Otherwise it
+ fails to match anything. Thus, `^foo' matches a `foo' that occurs
+ at the beginning of a line.
+
+ When matching a string instead of a buffer, `^' matches at the
+ beginning of the string or after a newline character `\n'.
+
+`$'
+ is similar to `^' but matches only at the end of a line. Thus,
+ `x+$' matches a string of one `x' or more at the end of a line.
+
+ When matching a string instead of a buffer, `$' matches at the end
+ of the string or before a newline character `\n'.
+
+`\'
+ has two functions: it quotes the special characters (including
+ `\'), and it introduces additional special constructs.
+
+ Because `\' quotes special characters, `\$' is a regular
+ expression that matches only `$', and `\[' is a regular expression
+ that matches only `[', and so on.
+
+ Note that `\' also has special meaning in the read syntax of Lisp
+ strings (*note String Type::), and must be quoted with `\'. For
+ example, the regular expression that matches the `\' character is
+ `\\'. To write a Lisp string that contains the characters `\\',
+ Lisp syntax requires you to quote each `\' with another `\'.
+ Therefore, the read syntax for a regular expression matching `\'
+ is `"\\\\"'.
+
+ *Please note:* For historical compatibility, special characters are
+treated as ordinary ones if they are in contexts where their special
+meanings make no sense. For example, `*foo' treats `*' as ordinary
+since there is no preceding expression on which the `*' can act. It is
+poor practice to depend on this behavior; quote the special character
+anyway, regardless of where it appears.
+
+ For the most part, `\' followed by any character matches only that
+character. However, there are several exceptions: characters that,
+when preceded by `\', are special constructs. Such characters are
+always ordinary when encountered on their own. Here is a table of `\'
+constructs:
+
+`\|'
+ specifies an alternative. Two regular expressions A and B with
+ `\|' in between form an expression that matches anything that
+ either A or B matches.
+
+ Thus, `foo\|bar' matches either `foo' or `bar' but no other string.
+
+ `\|' applies to the largest possible surrounding expressions.
+ Only a surrounding `\( ... \)' grouping can limit the grouping
+ power of `\|'.
+
+ Full backtracking capability exists to handle multiple uses of
+ `\|'.
+
+`\( ... \)'
+ is a grouping construct that serves three purposes:
+
+ 1. To enclose a set of `\|' alternatives for other operations.
+ Thus, `\(foo\|bar\)x' matches either `foox' or `barx'.
+
+ 2. To enclose an expression for a suffix operator such as `*' to
+ act on. Thus, `ba\(na\)*' matches `bananana', etc., with any
+ (zero or more) number of `na' strings.
+
+ 3. To record a matched substring for future reference.
+
+ This last application is not a consequence of the idea of a
+ parenthetical grouping; it is a separate feature that happens to be
+ assigned as a second meaning to the same `\( ... \)' construct
+ because there is no conflict in practice between the two meanings.
+ Here is an explanation of this feature:
+
+`\DIGIT'
+ matches the same text that matched the DIGITth occurrence of a `\(
+ ... \)' construct.
+
+ In other words, after the end of a `\( ... \)' construct. the
+ matcher remembers the beginning and end of the text matched by that
+ construct. Then, later on in the regular expression, you can use
+ `\' followed by DIGIT to match that same text, whatever it may
+ have been.
+
+ The strings matching the first nine `\( ... \)' constructs
+ appearing in a regular expression are assigned numbers 1 through 9
+ in the order that the open parentheses appear in the regular
+ expression. So you can use `\1' through `\9' to refer to the text
+ matched by the corresponding `\( ... \)' constructs.
+
+ For example, `\(.*\)\1' matches any newline-free string that is
+ composed of two identical halves. The `\(.*\)' matches the first
+ half, which may be anything, but the `\1' that follows must match
+ the same exact text.
+
+`\(?: ... \)'
+ is called a "shy" grouping operator, and it is used just like `\(
+ ... \)', except that it does not cause the matched substring to be
+ recorded for future reference.
+
+ This is useful when you need a lot of grouping `\( ... \)'
+ constructs, but only want to remember one or two - or if you have
+ more than nine groupings and need to use backreferences to refer to
+ the groupings at the end.
+
+ Using `\(?: ... \)' rather than `\( ... \)' when you don't need
+ the captured substrings ought to speed up your programs some,
+ since it shortens the code path followed by the regular expression
+ engine, as well as the amount of memory allocation and string
+ copying it must do. The actual performance gain to be observed
+ has not been measured or quantified as of this writing.
+
+ The shy grouping operator has been borrowed from Perl, and has not
+ been available prior to XEmacs 20.3, nor is it available in FSF
+ Emacs.
+
+`\w'
+ matches any word-constituent character. The editor syntax table
+ determines which characters these are. *Note Syntax Tables::.
+
+`\W'
+ matches any character that is not a word constituent.
+
+`\sCODE'
+ matches any character whose syntax is CODE. Here CODE is a
+ character that represents a syntax code: thus, `w' for word
+ constituent, `-' for whitespace, `(' for open parenthesis, etc.
+ *Note Syntax Tables::, for a list of syntax codes and the
+ characters that stand for them.
+
+`\SCODE'
+ matches any character whose syntax is not CODE.
+
+ The following regular expression constructs match the empty
+string--that is, they don't use up any characters--but whether they
+match depends on the context.
+
+`\`'
+ matches the empty string, but only at the beginning of the buffer
+ or string being matched against.
+
+`\''
+ matches the empty string, but only at the end of the buffer or
+ string being matched against.
+
+`\='
+ matches the empty string, but only at point. (This construct is
+ not defined when matching against a string.)
+
+`\b'
+ matches the empty string, but only at the beginning or end of a
+ word. Thus, `\bfoo\b' matches any occurrence of `foo' as a
+ separate word. `\bballs?\b' matches `ball' or `balls' as a
+ separate word.
+
+`\B'
+ matches the empty string, but _not_ at the beginning or end of a
+ word.
+
+`\<'
+ matches the empty string, but only at the beginning of a word.
+
+`\>'
+ matches the empty string, but only at the end of a word.
+
+ Not every string is a valid regular expression. For example, a
+string with unbalanced square brackets is invalid (with a few
+exceptions, such as `[]]'), and so is a string that ends with a single
+`\'. If an invalid regular expression is passed to any of the search
+functions, an `invalid-regexp' error is signaled.
+
+ - Function: regexp-quote string
+ This function returns a regular expression string that matches
+ exactly STRING and nothing else. This allows you to request an
+ exact string match when calling a function that wants a regular
+ expression.
+
+ (regexp-quote "^The cat$")
+ => "\\^The cat\\$"
+
+ One use of `regexp-quote' is to combine an exact string match with
+ context described as a regular expression. For example, this
+ searches for the string that is the value of `string', surrounded
+ by whitespace:
+
+ (re-search-forward
+ (concat "\\s-" (regexp-quote string) "\\s-"))