Perl Regular Expressions

Examples and Guidelines

Isaac Lin

version 1

If you are using the Opera web browser, use full screen mode to see the presentation in slide show format.

Location of presentation: http://www.carrotpatch.org/reference/regexp-tutorial/

Introduction

What is a regular expression?
- A little language to describe a pattern of characters.
- Used to find matches in a text string and extract subpatterns, as well as substitute different text.
This presentation will give examples of Perl regular expressions and tips for creating efficient expressions.
For a whirlwind overview of Perl, including regular expressions: see "Overview of Perl Basics"

Outline

Review of basic syntax
Separating a string in two
Substitution
Backreferences
- Matching paired quotes
Never-ending regexps (Backtracking)
- Matching single-quoted strings
- Avoiding backtracking
- Unrolling the loop
Tips

Outline

Review of basic syntax
Separating a string in two
Substitution
Backreferences
- Matching paired quotes
Never-ending regexps (Backtracking)
- Matching single-quoted strings
- Avoiding backtracking
- Unrolling the loop
Tips

Review of basic syntax

Match a pattern:
/match_pattern/mods
m{match_pattern}mods

Substitute:
s/match_pattern/replacement_text/mods
s{match_pattern}{replacement_text}mods

any non-alphanumeric character can be used as delimiters
- Recommend sticking with // (default) or {} (like standard block statements).
mods : list of one-letter modifiers

In Perl documentation, when a regular expression modifier is discussed, it is usually written with a / in front of it. For example, "using the /x modifier..."

Review of basic syntax (cont.)

mods : list of one-letter modifiers
- using the x modifier is recommended so you can put whitespace and # comments in the match_pattern.

Basic regexp syntax

foo matches foo
foo|bar matches either foo or bar
[abcd] matches either a, b, c, or d
[^abcd] matches any character other than a, b, c, or d
. matches any character (Perl: other than newline)
^ at the start of the pattern means the pattern must start matching from the beginning of the string.
$ at the end of the pattern means the pattern must finish matching at the end of the string (Perl: or just before newline at end).

Regexp quantifiers

a* matches zero or more occurrences of a
a+ matches one or more occurrences of a
a? matches zero or one occurrence of a
a{1,3} matches one, two, or three occurrences of a
a{3} matches aaa
() are used for grouping. Example: foo(bar)* matches foo followed by zero or more occurrences of bar
Quantifier matching is greedy: from left to right, as much as possible of each quantifier is matched, as long as the entire pattern can still match.

Greedy matching

Example: regexp (foo){1,2}(foo)?(foo)+(foo)*
- matching against "foofoofoofoofoo"
  - (foo){1,2} matches the first and second foo's
  - (foo)? matches the third foo
  - (foo)+ matches the fourth and fifth foo's
  - (foo)* matches zero foo's
- matching against "foofoofoo"
  - (foo){1,2} matches the first and second foo's
  - (foo)? matches zero foo's
  - (foo)+ matches the third foo
  - (foo)* matches zero foo's

Character classes

\s matches any whitespace character (e.g. space, tab, newline). \S matches any non-whitespace character.
\d matches any digit character. \D matches any non-digit character.
\w matches any alphanumeric character or _ (“word” characters). \W matches any non-word character.
\b matches either a transition from a word character to a non-word character, or vice-versa (word boundary).
\ followed by a non-alphanumeric character matches that character.

Character classes (cont.)

\x{abcd} matches the codepoint designated by hexadecimal abcd.
\Q automatically escapes any following metacharacters in the pattern up to \E

Outline

Review of basic syntax
Separating a string in two
Substitution
Backreferences
- Matching paired quotes
Never-ending regexps (Backtracking)
- Matching single-quoted strings
- Avoiding backtracking
- Unrolling the loop
Tips

Separating a string in two

Separate "999(ACK)" into two parts: message ID, message name
Components:
- Stuff before (
- Stuff inside )
- )

Separating a string in two: regex

Separate "999(ACK)" into two parts: message ID, message name
```
if ($msgstring =~ /^(.+)$(.+)$$/)
{
  $messageID = $1;
  $messageName = $2;
}
```
- ^: match at start of string
- .: match any character (other than newline)
- *: the preceding subpattern must match zero or more times
- $, $: match (, ) — \ is used to escape metacharacters

Separating a string in two: regex (cont.)

Separate "999(ACK)" into two parts: message ID, message name
```
if ($msgstring =~ /^(.+)$(.+)$$/)
{
  $messageID = $1;
  $messageName = $2;
}
```
- ( ): identifies subpattern; the matching result is captured
- $: match end of string (or just before newline at end of string)
- $messageID is set to 999 (content of 1^st capture group)
- $messageName is set to ACK (content of 2^nd capture group)

Unexpected use case

Regexp: /^(.+)$(.+)$$/

What if we want to separate 999(Some(special)message)?

Unexpected use case

Regexp: /^(.+)$(.+)$$/

What if we want to separate 999(Some(special)message)?
Regexps match from left to right and are greedy.
- .+ will match as many characters as possible while still allowing the entire expression to match
- Patterns are satisfied from left to right, so the first .+ gets the first opportunity to be greedy and match as many characters as possible.
999(Some(special)message)
- $messageID is set to 999(Some
  - The leftmost .+ matched as many characters as possible.
- $messageName is set to special)message

Separating a string, revisited

General rule for fixing problems: be more specific in what you want to match

if ($msgstring =~ m{
      ^
      ([^()]+)    # $1 - anything but ()
      \(
        (.+)      # $2 - anything
      \)
      $
    }x)

Will match 999(Some(special)message)
- x modifier: ignore whitespace and comments (everything from # to the next newline is ignored)

Is this regexp good enough?

If just for a quick-and-dirty job, a regexp that matches all the use cases you can think of is often good enough.
For something that needs to work long term, it is best to work out a complete specification of the data you are trying to match, and then reflect this in your regexp.
Example: a message string consists of
- numeric message ID
- left parenthesis
- alphanumeric characters, spaces, or hyphens
- right parenthesis
- ^(\d+)$([- \w]+)$$

An example of parsing an HTML tag is given later, with a quick-and-dirty regexp, and a more complete regexp afterwards.

Non-greedy match

Adding a ? after a quantifier (e.g. *?, +?, ??) will match the fewest times possible while still allowing the entire expression to match.

if ($msgstring =~ m{
      ^
      (.+?)    # $1 (non-greedy)
      \(
        (.+)   # $2 (greedy by default)
      \)
      $
    }x)

Also will match 999(Some(special)message) with the desired results.

Outline

Review of basic syntax
Separating a string in two
Substitution
Backreferences
- Matching paired quotes
Never-ending regexps (Backtracking)
- Matching single-quoted strings
- Avoiding backtracking
- Unrolling the loop
Tips

Global substitution

Swap each pair of characters:
$text =~ s{(.)(.)}{$2$1}gxs;
g : global substitution — keep replacing until no more matches. Each new search picks up where the last match ended.
s : . matches any character, including newline

Note the /m modifier is not required for this particular substitution. /s is required so that . will match any character.

Outline

Review of basic syntax
Separating a string in two
Substitution
Backreferences
- Matching paired quotes
Never-ending regexps (Backtracking)
- Matching single-quoted strings
- Avoiding backtracking
- Unrolling the loop
Tips

Backreferences

Within a regular expression, the result of a previous capture group can be accessed.
Example: /My (favou?rite) colou?r is also Tom's \1./
- Matches My favourite colour is also Tom's favourite.
- Matches My favorite color is also Tom's favorite.
- Fails to match My favorite color is also Tom's favourite.

Modifying HTML on Web pages

Sample problem: the images from a web server directory are being moved to a new location.
- Any HTML files in that directory with relative references to the images such as <img src="someimage.jpg"> must be changed to <img src="/new_location/someimage.jpg">.
- Tags like <img src="/path/to/someimage.jpg"> must be left alone.
- Note absolute references such as <img src="/old_location/someimage.jpg"> must also be changed — this is left as an exercise.

Description of problem

Find patterns in the following format:
- starts with <img
- ends with > or />
- <img is followed by one or more instances of attribute_name="attribute value", with either matching double quotes or matching single quotes around the attribute value
- attribute_name consists of alphanumeric characters
- attribute value consists of any characters other than the quote character surrounding the value
- the = sign can optionally have one or more spaces on either side of it
If the value for the src attribute does not begin with a /, then prepend /new_location/

Outline of required regexp

Verbal description of desired regexp:

<img
followed by zero or more attributes
followed by the src attribute
followed by zero or more attributes
followed by > or />

Simplified description

Less accurate but simplified description:

<img
followed by ~~zero or more attributes~~ stuff
followed by the src attribute
followed by ~~zero or more attributes~~ stuff
followed by > or />

First attempt (simplified version)

<img
stuff other than closing > plus trailing whitespace: [^>]*\s+
src attribute
- src
- equals: = optionally surrounded by whitespace: \s*=\s*
- open quote, non-/ nor quote, non-quote characters: ['"][^/'"][^'"]*
- close quote, matching the opening quote: need a backreference
stuff other than closing >: [^>]*
/?>

Assembling the regex

¹   $newString =~ s{
²      (<img
³         [^>]* \s+
⁴         src \s*=\s*)       # $1 - stuff up to quote
⁵         (['"])             # $2 - quote
⁶         ([^/'"]
⁷          [^'"]*)           # $3 - attribute value
⁸         (\2
⁹         [^>]* /?>
¹⁰        )                  # $4 - rest
¹¹  }{$1$2$docroot$3$4}gx;

Shortcomings:

Doesn't support quotes contained within the value for src
Doesn't support > character within attribute values

Checking for non-quote character

Introducing lookahead assertion:

(?=some pattern): checks if some pattern matches, without advancing the current position
(?!some pattern): checks if some pattern does not match, without advancing the current position
Do not match /: (?!/)
Do not match previous group 2: (?!\2)

Assembling the regex (2)

¹   $newString =~ s{
²      (<img
³         [^>]* \s+
⁴         src \s*=\s*)       # $1 - stuff up to quote
⁵         (['"])             # $2 - quote
⁶         ((?!/)             # Not a /
⁷          (?!\2).*)         # $3 - attribute value
⁸         (\2
⁹         [^>]* /?>
¹⁰        )                  # $4 - rest
¹¹  }{$1$2$docroot$3$4}gx;

Shortcomings:

Doesn't support > character within attribute values

More shortcomings

Easy-to-miss use case: <img src="foo.jpg" title="date='today' src='me'" />
- What will the result be?

More shortcomings

Easy-to-miss use case: <img src="foo.jpg" title="date='today' src='me'" />
- The second src='me' will get altered, but the real src attribute will not.
Need a regexp in the following form:
< tagname ( attributeSetting )* src = value restOfTag >
- In first two attempts, [^>]*\s+ was used for ( attributeSetting )* but this is too primitive to handle the special case above.
- [^>]*> is used for the restOfTag; this also may not be robust enough for some input
For well-known problems, such as parsing HTML, it is often better to reuse existing Perl modules from CPAN instead of writing your own code.

Revised regex

¹   $newString =~ s{
²      (<img
³       (?: \s+ \w+ \s*=\s* (['"])(?!\2).* \2 )* #  $2
⁴         src \s*=\s*)       # $1 - stuff up to quote
⁵         (['"])             # $3 - quote
⁶         ((?!/)             # Not a /
⁷          (?!\3).*)         # $4 - attribute value
⁸         (\3
⁹       (?: \s+ \w+ \s*=\s* (['"])(?!\6).* \6 )* #  $6
¹⁰          \s*/?>
¹¹        )                  # $5 - rest
¹²  }{$1$3$docroot$4$5}gx;

lines 3 and 9 are modified with better patterns to match attribute settings
(?: ) does not capture

Outline

Review of basic syntax
Separating a string in two
Substitution
Backreferences
- Matching paired quotes
Never-ending regexps (Backtracking)
- Matching single-quoted strings
- Avoiding backtracking
- Unrolling the loop
Tips

Never-ending regexps (backtracking)

Avoid subpatterns that can match many different ways.
- Example: match vacation spots with a triple-A rating and above
```
m{ A*AAA \s+ (?:Hawaii|Bermuda) }x
```
- Will spend a long time trying to match AAAAAA.....AAA Fort Lauderdale
- Each time the destination fails to match, Perl will backtrack and try the match again, with a different matching pattern for the first part:
  - AAAAAA.....AAA
  - AAAAA.....AAA
  - AAA.....AAA
  - ...
- When possible, write patterns that can only match one way.
```
m{ \b AAAA* \s+ (?:Hawaii|Bermuda) }x
```
  - have specific patterns to anchor the match

\b : boundary between non-word character and word character. This anchors the expression to a specific starting point, so the rest can only match one way.
Putting A* at the end is not necessary, but as a general rule, where possible, put specific matches at the start of a pattern. This can often help reduce the amount of backtracking by providing earlier anchor points for pattern matches.

Never-ending combinations

Nested quantifiers quickly multiply the number of potential matches.
```
'hihihihihihi' =~ / (?: (?: hi )+ )* /x
```
How many different ways can the regexp match at the start of the string of hi's?
If the first + is replaced with a *, how many ways now?

Ways that each parenthesized expression can match at the start of the string:

+ can match 1 hi, and then * can match 0-6 times: 7 ways
+ can match 2 hi's, and then * can match 0-3 times: 4 ways
+ can match 3 hi's, and then * can match 0-2 times: 3 ways
+ can match 4 hi's, and then * can match 0-1 times: 2 ways
+ can match 5 hi's, and then * can match 0-1 times: 2 ways
+ can match 6 hi's, and then * can match 0-1 times: 2 ways

There are least 20 different ways to match at the start of the string.

Note the whole pattern can match zero times anywhere in the string.

Matching single-quoted strings

Specification:
- String begins and ends with an '.
- String consists of any non-quote or non-\ character.
- Exception: string can contain a \' or \\ pair
Straightforward encoding of specification:
```
qr/^' (?: [^\\\'] | \\[\\\'])* '$/x,
```
- [^\\\'] : any character except ' or \
- \\[\\\'] : \, followed by \ or '
Note the (?: ... | ... )* construct results in backtracking when one part of the alternation fails.

Avoiding backtracking

Write a pattern that can only match one way.
- Many scenarios have a pattern to match normal cases, and a different pattern to match special cases. The most obvious regexp pattern to use is the following:
  ( normal | special )+
- This can be thought of as a loop where in each iteration, you either match one case or the other.

Note the cases do not have to be literally "normal" and "special"; the important aspect is that there are two different cases that are matched by two distinct regexp patterns.

Note the regexp for a single-quoted string given earlier follows this format.

Unrolling the loop

"Unrolling the loop" pattern:
normal * ( special normal * )*
- A matching string consists of normal cases, interspersed with special cases. This can be described as
  - zero or more normal cases; followed by
  - zero or more instances of
    - one special case; followed by
    - zero or more normal cases
- Note you can think of the cases as "case 1" and "case 2" instead of "normal" and "special". This pattern is most efficient however when case 2 is indeed less frequent than case 1.
- As written, this pattern matches zero length cases. It should be anchored with a specific match at either end (if possible, at the start).

Note this is just the general outline of the unrolled loop pattern; it should be tailored for each particular regexp.

Unrolled regexp requirements

"Unrolling the loop" pattern:
normal * ( special normal * )*
The patterns for normal and special must not be able to match the same content, to avoid a never-ending match.
special must match at least one character, in order to anchor the match at that point and prevent backtracking.
special should not have a + or * after it, because (?: something + )* or (?: something * )* can match in multiple ways, leading to the never-ending match.
- Beware of nested *'s and +'s: they result in never-ending matches.

Single-quoted strings, unrolled

For legibility, the unrolled version is broken down into segments:

my $chars = qr/ [^\\\'] /x;
my $escapedChars = qr/ \\[\\\'] /x;
my $re =
  qr{^' $chars*
        (?: $escapedChars $chars* )*
      '
    }x;

Note $escapedChars+ is not used, to avoid a never-ending match. The trailing * after (?: ) is relied upon to match on multiple escaped characters.

Outline

Review of basic syntax
Separating a string in two
Substitution
Backreferences
- Matching paired quotes
Never-ending regexps (Backtracking)
- Matching single-quoted strings
- Avoiding backtracking
- Unrolling the loop
Tips

Tips

Writing and debugging:
- Be specific in your pattern to avoid unexpected matches.
- Break up your pattern into separate components to work on.
  - Use qr{} syntax to assign a pattern to a variable.
- Use /x modifier to enable comments within the regex.
Efficiency
- Only use capturing parentheses when necessary; prefer using (?: ) instead.
- Avoid never-ending matches by anchoring the regex to patterns that can only match one way.
- "Unrolling the loop" pattern:
  normal * ( special normal * )*

Tips

Writing and debugging:
- Be specific in your pattern to avoid unexpected matches.
- Break up your pattern into separate components to work on.
  - Use qr{} syntax to assign a pattern to a variable.
- Use /x modifier to enable comments within the regex.
Efficiency
- Only use capturing parentheses when necessary; prefer using (?: ) instead.
- Avoid never-ending matches by anchoring the regex to patterns that can only match one way.
- "Unrolling the loop" pattern:
  normal * ( special normal * )*
Don't Panic!

Perl Regular Expressions: Tutorial

February 9, 2015

Perl Regular Expressions

Examples and Guidelines

Isaac Lin

version 1

Introduction

Outline

Outline

Review of basic syntax

Review of basic syntax (cont.)

Basic regexp syntax

Regexp quantifiers

Greedy matching

Character classes

Character classes (cont.)

Outline

Separating a string in two

Separating a string in two: regex

Separating a string in two: regex (cont.)

Unexpected use case

Unexpected use case

Separating a string, revisited

Is this regexp good enough?

Non-greedy match

Outline

Global substitution

Outline

Backreferences

Modifying HTML on Web pages

Description of problem

Outline of required regexp

Simplified description

First attempt (simplified version)

Assembling the regex

Checking for non-quote character

Assembling the regex (2)

More shortcomings

More shortcomings

Revised regex

Outline

Never-ending regexps (backtracking)

Never-ending combinations

Matching single-quoted strings

Avoiding backtracking

Unrolling the loop

Unrolled regexp requirements

Single-quoted strings, unrolled

Outline

Tips

Tips