Perl Regular Expressions

Examples and Guidelines

Isaac Lin

version 1

If you are using the Opera web browser, use full screen mode to see the presentation in slide show format.

Location of presentation: http://www.carrotpatch.org/reference/regexp-tutorial/

Introduction

  • What is a regular expression?
    • A little language to describe a pattern of characters.
    • Used to find matches in a text string and extract subpatterns, as well as substitute different text.
  • This presentation will give examples of Perl regular expressions and tips for creating efficient expressions.
  • For a whirlwind overview of Perl, including regular expressions: see "Overview of Perl Basics"

Outline

  • Review of basic syntax
  • Separating a string in two
  • Substitution
  • Backreferences
    • Matching paired quotes
  • Never-ending regexps (Backtracking)
    • Matching single-quoted strings
    • Avoiding backtracking
    • Unrolling the loop
  • Tips

Outline

  • Review of basic syntax
  • Separating a string in two
  • Substitution
  • Backreferences
    • Matching paired quotes
  • Never-ending regexps (Backtracking)
    • Matching single-quoted strings
    • Avoiding backtracking
    • Unrolling the loop
  • Tips

Review of basic syntax

Match a pattern:
/match_pattern/mods
m{match_pattern}mods

Substitute:
s/match_pattern/replacement_text/mods
s{match_pattern}{replacement_text}mods

  • any non-alphanumeric character can be used as delimiters
    • Recommend sticking with // (default) or {} (like standard block statements).
  • mods : list of one-letter modifiers
In Perl documentation, when a regular expression modifier is discussed, it is usually written with a / in front of it. For example, "using the /x modifier..."

Review of basic syntax (cont.)

  • mods : list of one-letter modifiers
    • using the x modifier is recommended so you can put whitespace and # comments in the match_pattern.

Basic regexp syntax

  • foo matches foo
  • foo|bar matches either foo or bar
  • [abcd] matches either a, b, c, or d
  • [^abcd] matches any character other than a, b, c, or d
  • . matches any character (Perl: other than newline)
  • ^ at the start of the pattern means the pattern must start matching from the beginning of the string.
  • $ at the end of the pattern means the pattern must finish matching at the end of the string (Perl: or just before newline at end).

Regexp quantifiers

  • a* matches zero or more occurrences of a
  • a+ matches one or more occurrences of a
  • a? matches zero or one occurrence of a
  • a{1,3} matches one, two, or three occurrences of a
  • a{3} matches aaa
  • () are used for grouping. Example: foo(bar)* matches foo followed by zero or more occurrences of bar
  • Quantifier matching is greedy: from left to right, as much as possible of each quantifier is matched, as long as the entire pattern can still match.

Greedy matching

  • Example: regexp (foo){1,2}(foo)?(foo)+(foo)*
    • matching against "foofoofoofoofoo"
      • (foo){1,2} matches the first and second foo's
      • (foo)? matches the third foo
      • (foo)+ matches the fourth and fifth foo's
      • (foo)* matches zero foo's
    • matching against "foofoofoo"
      • (foo){1,2} matches the first and second foo's
      • (foo)? matches zero foo's
      • (foo)+ matches the third foo
      • (foo)* matches zero foo's

Character classes

  • \s matches any whitespace character (e.g. space, tab, newline). \S matches any non-whitespace character.
  • \d matches any digit character. \D matches any non-digit character.
  • \w matches any alphanumeric character or _ (“word” characters). \W matches any non-word character.
  • \b matches either a transition from a word character to a non-word character, or vice-versa (word boundary).
  • \ followed by a non-alphanumeric character matches that character.

Character classes (cont.)

  • \x{abcd} matches the codepoint designated by hexadecimal abcd.
  • \Q automatically escapes any following metacharacters in the pattern up to \E

Outline

  • Review of basic syntax
  • Separating a string in two
  • Substitution
  • Backreferences
    • Matching paired quotes
  • Never-ending regexps (Backtracking)
    • Matching single-quoted strings
    • Avoiding backtracking
    • Unrolling the loop
  • Tips

Separating a string in two

  • Separate "999(ACK)" into two parts: message ID, message name
  • Components:
    • Stuff before (
    • Stuff inside )
    • )

Separating a string in two: regex

  • Separate "999(ACK)" into two parts: message ID, message name
    if ($msgstring =~ /^(.+)\((.+)\)$/)
    {
      $messageID = $1;
      $messageName = $2;
    }
    
    • ^: match at start of string
    • .: match any character (other than newline)
    • *: the preceding subpattern must match zero or more times
    • \(, \): match (, ) — \ is used to escape metacharacters

Separating a string in two: regex (cont.)

  • Separate "999(ACK)" into two parts: message ID, message name
    if ($msgstring =~ /^(.+)\((.+)\)$/)
    {
      $messageID = $1;
      $messageName = $2;
    }
    
    • ( ): identifies subpattern; the matching result is captured
    • $: match end of string (or just before newline at end of string)
    • $messageID is set to 999 (content of 1st capture group)
    • $messageName is set to ACK (content of 2nd capture group)

Unexpected use case

Regexp: /^(.+)\((.+)\)$/
  • What if we want to separate 999(Some(special)message)?

Unexpected use case

Regexp: /^(.+)\((.+)\)$/
  • What if we want to separate 999(Some(special)message)?
  • Regexps match from left to right and are greedy.
    • .+ will match as many characters as possible while still allowing the entire expression to match
    • Patterns are satisfied from left to right, so the first .+ gets the first opportunity to be greedy and match as many characters as possible.
  • 999(Some(special)message)
    • $messageID is set to 999(Some
      • The leftmost .+ matched as many characters as possible.
    • $messageName is set to special)message

Separating a string, revisited

  • General rule for fixing problems: be more specific in what you want to match
    if ($msgstring =~ m{
          ^
          ([^()]+)    # $1 - anything but ()
          \(
            (.+)      # $2 - anything
          \)
          $
        }x)
    
  • Will match 999(Some(special)message)
    • x modifier: ignore whitespace and comments (everything from # to the next newline is ignored)

Is this regexp good enough?

  • If just for a quick-and-dirty job, a regexp that matches all the use cases you can think of is often good enough.
  • For something that needs to work long term, it is best to work out a complete specification of the data you are trying to match, and then reflect this in your regexp.
  • Example: a message string consists of
    • numeric message ID
    • left parenthesis
    • alphanumeric characters, spaces, or hyphens
    • right parenthesis
    • ^(\d+)\(([- \w]+)\)$

An example of parsing an HTML tag is given later, with a quick-and-dirty regexp, and a more complete regexp afterwards.

Non-greedy match

  • Adding a ? after a quantifier (e.g. *?, +?, ??) will match the fewest times possible while still allowing the entire expression to match.
    if ($msgstring =~ m{
          ^
          (.+?)    # $1 (non-greedy)
          \(
            (.+)   # $2 (greedy by default)
          \)
          $
        }x)
    
  • Also will match 999(Some(special)message) with the desired results.

Outline

  • Review of basic syntax
  • Separating a string in two
  • Substitution
  • Backreferences
    • Matching paired quotes
  • Never-ending regexps (Backtracking)
    • Matching single-quoted strings
    • Avoiding backtracking
    • Unrolling the loop
  • Tips

Global substitution

  • Swap each pair of characters:
    $text =~ s{(.)(.)}{$2$1}gxs;
  • g : global substitution — keep replacing until no more matches. Each new search picks up where the last match ended.
  • s : . matches any character, including newline
Note the /m modifier is not required for this particular substitution. /s is required so that . will match any character.

Outline

  • Review of basic syntax
  • Separating a string in two
  • Substitution
  • Backreferences
    • Matching paired quotes
  • Never-ending regexps (Backtracking)
    • Matching single-quoted strings
    • Avoiding backtracking
    • Unrolling the loop
  • Tips

Backreferences

  • Within a regular expression, the result of a previous capture group can be accessed.
  • Example: /My (favou?rite) colou?r is also Tom's \1./
    • Matches My favourite colour is also Tom's favourite.
    • Matches My favorite color is also Tom's favorite.
    • Fails to match My favorite color is also Tom's favourite.

Modifying HTML on Web pages

  • Sample problem: the images from a web server directory are being moved to a new location.
    • Any HTML files in that directory with relative references to the images such as <img src="someimage.jpg"> must be changed to <img src="/new_location/someimage.jpg">.
    • Tags like <img src="/path/to/someimage.jpg"> must be left alone.
    • Note absolute references such as <img src="/old_location/someimage.jpg"> must also be changed — this is left as an exercise.

Description of problem

  • Find patterns in the following format:
    • starts with <img 
    • ends with > or />
    • <img is followed by one or more instances of  attribute_name="attribute value", with either matching double quotes or matching single quotes around the attribute value
    • attribute_name consists of alphanumeric characters
    • attribute value consists of any characters other than the quote character surrounding the value
    • the = sign can optionally have one or more spaces on either side of it
  • If the value for the src attribute does not begin with a /, then prepend /new_location/

Outline of required regexp

Verbal description of desired regexp:

  • <img
  • followed by zero or more attributes
  • followed by the src attribute
  • followed by zero or more attributes
  • followed by > or />

Simplified description

Less accurate but simplified description:

  • <img
  • followed by zero or more attributes stuff
  • followed by the src attribute
  • followed by zero or more attributes stuff
  • followed by > or />

First attempt (simplified version)

  • <img
  • stuff other than closing > plus trailing whitespace: [^>]*\s+
  • src attribute
    • src
    • equals: = optionally surrounded by whitespace: \s*=\s*
    • open quote, non-/ nor quote, non-quote characters: ['"][^/'"][^'"]*
    • close quote, matching the opening quote: need a backreference
  • stuff other than closing >: [^>]*
  • /?>

Assembling the regex

1   $newString =~ s{
2      (<img
3         [^>]* \s+
4         src \s*=\s*)       # $1 - stuff up to quote
5         (['"])             # $2 - quote
6         ([^/'"]
7          [^'"]*)           # $3 - attribute value
8         (\2
9         [^>]* /?>
10        )                  # $4 - rest
11  }{$1$2$docroot$3$4}gx;
  • Shortcomings:
    • Doesn't support quotes contained within the value for src
    • Doesn't support > character within attribute values

Checking for non-quote character

Introducing lookahead assertion:

  • (?=some pattern): checks if some pattern matches, without advancing the current position
  • (?!some pattern): checks if some pattern does not match, without advancing the current position
  • Do not match /: (?!/)
  • Do not match previous group 2: (?!\2)

Assembling the regex (2)

1   $newString =~ s{
2      (<img
3         [^>]* \s+
4         src \s*=\s*)       # $1 - stuff up to quote
5         (['"])             # $2 - quote
6         ((?!/)             # Not a /
7          (?!\2).*)         # $3 - attribute value
8         (\2
9         [^>]* /?>
10        )                  # $4 - rest
11  }{$1$2$docroot$3$4}gx;
  • Shortcomings:
    • Doesn't support > character within attribute values

More shortcomings

  • Easy-to-miss use case: <img src="foo.jpg" title="date='today' src='me'" />
    • What will the result be?

More shortcomings

  • Easy-to-miss use case: <img src="foo.jpg" title="date='today' src='me'" />
    • The second src='me' will get altered, but the real src attribute will not.
  • Need a regexp in the following form:
    < tagname ( attributeSetting )* src = value restOfTag >
    • In first two attempts, [^>]*\s+ was used for ( attributeSetting )* but this is too primitive to handle the special case above.
    • [^>]*> is used for the restOfTag; this also may not be robust enough for some input
  • For well-known problems, such as parsing HTML, it is often better to reuse existing Perl modules from CPAN instead of writing your own code.

Revised regex

1   $newString =~ s{
2      (<img
3       (?: \s+ \w+ \s*=\s* (['"])(?!\2).* \2 )* #  $2
4         src \s*=\s*)       # $1 - stuff up to quote
5         (['"])             # $3 - quote
6         ((?!/)             # Not a /
7          (?!\3).*)         # $4 - attribute value
8         (\3
9       (?: \s+ \w+ \s*=\s* (['"])(?!\6).* \6 )* #  $6
10          \s*/?>
11        )                  # $5 - rest
12  }{$1$3$docroot$4$5}gx;
  • lines 3 and 9 are modified with better patterns to match attribute settings
  • (?: ) does not capture

Outline

  • Review of basic syntax
  • Separating a string in two
  • Substitution
  • Backreferences
    • Matching paired quotes
  • Never-ending regexps (Backtracking)
    • Matching single-quoted strings
    • Avoiding backtracking
    • Unrolling the loop
  • Tips

Never-ending regexps (backtracking)

  • Avoid subpatterns that can match many different ways.
    • Example: match vacation spots with a triple-A rating and above
      m{ A*AAA \s+ (?:Hawaii|Bermuda) }x
      
    • Will spend a long time trying to match AAAAAA.....AAA Fort Lauderdale
    • Each time the destination fails to match, Perl will backtrack and try the match again, with a different matching pattern for the first part:
      • AAAAAA.....AAA
      • AAAAA.....AAA
      • AAA.....AAA
      • ...
    • When possible, write patterns that can only match one way.
      m{ \b AAAA* \s+ (?:Hawaii|Bermuda) }x
      
      • have specific patterns to anchor the match
  • \b : boundary between non-word character and word character. This anchors the expression to a specific starting point, so the rest can only match one way.
  • Putting A* at the end is not necessary, but as a general rule, where possible, put specific matches at the start of a pattern. This can often help reduce the amount of backtracking by providing earlier anchor points for pattern matches.

Never-ending combinations

  • Nested quantifiers quickly multiply the number of potential matches.
    'hihihihihihi' =~ / (?: (?: hi )+ )* /x
    
  • How many different ways can the regexp match at the start of the string of hi's?
  • If the first + is replaced with a *, how many ways now?

Ways that each parenthesized expression can match at the start of the string:

  • + can match 1 hi, and then * can match 0-6 times: 7 ways
  • + can match 2 hi's, and then * can match 0-3 times: 4 ways
  • + can match 3 hi's, and then * can match 0-2 times: 3 ways
  • + can match 4 hi's, and then * can match 0-1 times: 2 ways
  • + can match 5 hi's, and then * can match 0-1 times: 2 ways
  • + can match 6 hi's, and then * can match 0-1 times: 2 ways

There are least 20 different ways to match at the start of the string.

Note the whole pattern can match zero times anywhere in the string.

Matching single-quoted strings

  • Specification:
    • String begins and ends with an '.
    • String consists of any non-quote or non-\ character.
    • Exception: string can contain a \' or \\ pair
  • Straightforward encoding of specification:
    qr/^' (?: [^\\\'] | \\[\\\'])* '$/x,
    
    • [^\\\'] : any character except ' or \
    • \\[\\\'] : \, followed by \ or '
  • Note the (?: ... | ... )* construct results in backtracking when one part of the alternation fails.

Avoiding backtracking

  • Write a pattern that can only match one way.
    • Many scenarios have a pattern to match normal cases, and a different pattern to match special cases. The most obvious regexp pattern to use is the following:
      ( normal | special )+
    • This can be thought of as a loop where in each iteration, you either match one case or the other.

Note the cases do not have to be literally "normal" and "special"; the important aspect is that there are two different cases that are matched by two distinct regexp patterns.

Note the regexp for a single-quoted string given earlier follows this format.

Unrolling the loop

  • "Unrolling the loop" pattern:
    normal * ( special normal * )*
    • A matching string consists of normal cases, interspersed with special cases. This can be described as
      • zero or more normal cases; followed by
      • zero or more instances of
        • one special case; followed by
        • zero or more normal cases
    • Note you can think of the cases as "case 1" and "case 2" instead of "normal" and "special". This pattern is most efficient however when case 2 is indeed less frequent than case 1.
    • As written, this pattern matches zero length cases. It should be anchored with a specific match at either end (if possible, at the start).

Note this is just the general outline of the unrolled loop pattern; it should be tailored for each particular regexp.

Unrolled regexp requirements

  • "Unrolling the loop" pattern:
    normal * ( special normal * )*
  • The patterns for normal and special must not be able to match the same content, to avoid a never-ending match.
  • special must match at least one character, in order to anchor the match at that point and prevent backtracking.
  • special should not have a + or * after it, because (?: something + )* or (?: something * )* can match in multiple ways, leading to the never-ending match.
    • Beware of nested *'s and +'s: they result in never-ending matches.

Single-quoted strings, unrolled

  • For legibility, the unrolled version is broken down into segments:
    my $chars = qr/ [^\\\'] /x;
    my $escapedChars = qr/ \\[\\\'] /x;
    my $re =
      qr{^' $chars*
            (?: $escapedChars $chars* )*
          '
        }x;
    
  • Note $escapedChars+ is not used, to avoid a never-ending match. The trailing * after (?: ) is relied upon to match on multiple escaped characters.

Outline

  • Review of basic syntax
  • Separating a string in two
  • Substitution
  • Backreferences
    • Matching paired quotes
  • Never-ending regexps (Backtracking)
    • Matching single-quoted strings
    • Avoiding backtracking
    • Unrolling the loop
  • Tips

Tips

  • Writing and debugging:
    • Be specific in your pattern to avoid unexpected matches.
    • Break up your pattern into separate components to work on.
      • Use qr{} syntax to assign a pattern to a variable.
    • Use /x modifier to enable comments within the regex.
  • Efficiency
    • Only use capturing parentheses when necessary; prefer using (?: ) instead.
    • Avoid never-ending matches by anchoring the regex to patterns that can only match one way.
    • "Unrolling the loop" pattern:
      normal * ( special normal * )*

Tips

  • Writing and debugging:
    • Be specific in your pattern to avoid unexpected matches.
    • Break up your pattern into separate components to work on.
      • Use qr{} syntax to assign a pattern to a variable.
    • Use /x modifier to enable comments within the regex.
  • Efficiency
    • Only use capturing parentheses when necessary; prefer using (?: ) instead.
    • Avoid never-ending matches by anchoring the regex to patterns that can only match one way.
    • "Unrolling the loop" pattern:
      normal * ( special normal * )*
  • Don't Panic!