Table of Contents

1 Finding Patterns in Text

The most common use for re is to search for patterns in text. This example looks for two literal strings, 'this' and 'that', in a text string.

#!/home/weiwu/.pyenv/shims/python2
import re

patterns = [ 'this', 'that' ]
text = 'Does this text match the pattern?'

for pattern in patterns:
    print 'Looking for "%s" in "%s" ->' % (pattern, text)

    if re.search(pattern,  text):
        print 'found a match!'
    else:
        print 'no match'

search() takes the pattern and text to scan, and returns a Match object when the pattern is found. If the pattern is not found, search() returns None.

The Match object returned by search() holds information about the nature of the match, including the original input string, the regular expression used, and the location within the original string where the pattern occurs.

import re

pattern = 'this'
text = 'Does this text match the pattern?'

match = re.search(pattern, text)

s = match.start()
e = match.end()

print 'Found "%s" in "%s" from %d to %d ("%s")' % \
    (match.re.pattern, match.string, s, e, text[s:e])

2 Compiling Expressions

The compile() function converts an expression string into a RegexObject.

import re

# Pre-compile the patterns
regexes = [ re.compile(p) for p in [ 'this',
                                     'that',
                                     ]
            ]
text = 'Does this text match the pattern?'

for regex in regexes:
    print 'Looking for "%s" in "%s" ->' % (regex.pattern, text),

    if regex.search(text):
        print 'found a match!'
    else:
        print 'no match'

3 Multiple Matches

So far the example patterns have all used search() to look for single instances of literal text strings.

3.1 The findall() function returns all of the substrings of the input that match the pattern without overlapping.

import re

text = 'abbaaabbbbaaaaa'

pattern = 'ab'

for match in re.findall(pattern, text):
    print 'Found "%s"' % match

There are two instances of ab in the input string.

python re_findall.py

Found "ab"
Found "ab"

3.2 finditer() returns an iterator that produces Match instances instead of the strings returned by findall().

import re

text = 'abbaaabbbbaaaaa'

pattern = 'ab'

for match in re.finditer(pattern, text):
    s = match.start()
    e = match.end()
    print 'Found "%s" at %d:%d' % (text[s:e], s, e)

This example finds the same two occurrences of ab, and the Match instance shows where they are in the original input.

4 Character Sets

A character set is a group of characters, any one of which can match at that point in the pattern. For example, [ab] would match either a or b.

from re_test_patterns import test_patterns

test_patterns('abbaaabbbbaaaaa',
              [ '[ab]',    # either a or b
                'a[ab]+',  # a followed by one or more a or b
                'a[ab]+?', # a followed by one or more a or b, not greedy
                ])


test_patterns('This is some text -- with punctuation.',
              [ '[^-. ]+',  # sequences without -, ., or space
                ])

5 Repetition

5.1 *.

  • doesn’t match the literal character *; instead, it specifies that the previous character can be matched zero or more times, instead of exactly once.

For example, ca*t will match ct (0 a characters), cat (1 a), caaat (3 a characters), and so forth.

from re_test_patterns import test_patterns

test_patterns('abbaaabbbbaaaaa',
              [ 'ab*',     # a followed by zero or more b
                'ab+',     # a followed by one or more b
                'ab?',     # a followed by zero or one b
                'ab{3}',   # a followed by three b
                'ab{2,3}', # a followed by two to three b
                ])

6 Matching characters

6.1 metacharacters

6.1.1 Anchors

. ^ $ * + ? { } [ ] \ | ( )
  • The first metacharacters we’ll look at are [ and ]. They’re used for specifying a character class, which is a set of characters that you wish to match. Characters can be listed individually, or a range of characters can be indicated by giving two characters and separating them by a '-'. For example, [abc] will match any of the characters a, b, or c; this is the same as [a-c], which uses a range to express the same set of characters. If you wanted to match only lowercase letters, your RE would be [a-z].
^

The match must occur at the beginning of the line.

$

The match must occur at the end of the line.

\<

The match must occur at the beginning of the string.

\>

The match must occur at the end of the string.

\b

Defines the boundary for the word.

6.1.2 Pattern Matching

  • ?

indicates zero or one occurrences of the preceding element.

  • You can match the characters not listed within the class by complementing the set. This is indicated by including a '^' as the first character of the class; '^' outside a character class will simply match the '^' character. For example, [^5] will match any character except '5'.
  • \.

As in Python string literals, the backslash can be followed by various characters to signal various special sequences. It’s also used to escape all the metacharacters so you can still match them in patterns; for example, if you need to match a [ or \, you can precede them with a backslash to remove their special meaning: \[ or \\.

  • \d

Matches any decimal digit; this is equivalent to the class [0-9].

  • \D

Matches any non-digit character; this is equivalent to the class [^0-9].

  • \s

Matches any whitespace character; this is equivalent to the class [ \t\n\r\f\v].

  • §

Matches any non-whitespace character; this is equivalent to the class [^ \t\n\r\f\v].

  • \w

Matches any alphanumeric character; this is equivalent to the class [a-zA-Z0-9_].

  • \W

Matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-9_].

  • \(

open braket

  • [\u4e00-\u9fa5]+

汉字

6.1.3 Quantification

  • *

The asterisk indicates zero or more occurrences of the preceding element. 0 or more

  • +

indicates one or more occurrences of the preceding element. 1 or more

  • ?

0 or 1

  • {m,n}

Used to match between m and n occurences of the previous character. The example tries to match between three and five w's.

6.2 cases examples:

  • url:

https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{2,256}\.[a-z]{2,6}\b([-a-zA-Z0-9@:%_\+.~#?&//=]*)

  • date:

01 Jan 2003 ^\d{2}\s{1}(Jan|Feb|Mar|Apr|May|Jun|Jul|Apr|Sep|Oct|Nov|Dec)\s{1}\d{4}$

Jan 31, 2018 (Jan|Feb|Mar|Apr|May|Jun|Jul|Apr|Sep|Oct|Nov|Dec)\s\d{1,2},\s\d{4}

  • time:
  • machine space and numbers:

\s\b\d+\b

  • content to strings
SECTION\s\w+: -> *
(\t){1}CHAPTER(\s)[\d]{1,} -> **
(\t){1}[\d]{1,} ->

(\t){2} -> **
(\t){1} -> *
  • Chinese:

[\u4e00-\u9fa5]

  • Chinese punctuation:

[\u3002\uff1b\uff0c\uff1a\u201c\u201d\uff08\uff09\u3001\uff1f\u300a\u300b]

  • English punctuation:

[!#$%&'()*+,-./:;<=>?@[\]^_`{|}~]

7 replace content to org mode strings

8 Linux

  • grep: Invalid range end

This is because you are using the hyphen within other characters, so that grep understands it as a range, which happens to be invalid.

You are basically doing

grep "[\-']" file