reading-notes

Class 19: Automation

Python Regular Expressions Tutorial


Search engines, search and replace tools of word processors and text editors - all use regular expressions to complete their tasks.

Regex is used to help in manipulating textual data, which is often a prerequisite for data science projects involving text mining.


Following topics covered:

Quick Look Summary Table


Dictionary:


Start with importing the Python library that supports Regex:

import re

Basic Patterns: Ordinary Characters

Ordinary characters are the simplest regular expressions. They match themselves exactly and do not have a special meaning in their regular expression syntax.

Ordinary Characters are standard alpha-numeric characters. 0-9, a-z, and A-Z


The match() function returns a match object if the text matches the pattern. Otherwise, it returns None.

Example:

(\) is just a backslash when prefixed with an r rather than being interpreted as an escape sequence.

Sometimes, the syntax involves backslash-escaped (\) characters, and to prevent these characters from being interpreted as escape sequences; you use the raw (r) prefix.


Wild Card Characters: Special Characters

Special characters are characters that do not match themselves as seen but have a special meaning when used in a regular expression.

With the search function, you scan through the given string/sequence, looking for the first location where the regular expression produces a match.

The group function returns the string matched by the re. You will see both these functions in more detail later.

TIP: ^ and \A are effectively the same, and so are $ and \Z. Except when dealing with MULTILINE mode.


Repetitions

(+) and (*) qualifiers are said to be greedy. See below for explaination of greedy.


Grouping in Regular Expressions

The group feature of regular expression allows you to pick up parts of the matching text. Parts of a regular expression pattern bounded by parenthesis () are called groups. The parenthesis does not change what the expression matches, but rather forms groups within the matched sequence.

Example:

<--- INPUT
statement = 'Please contact us at: support@datacamp.com'
match = re.search(r'([\w\.-]+)@([\w\.-]+)', statement) # <-- Group 
if statement:
  print("Email address:", match.group()) # The whole matched text
  print("Username:", match.group(1)) # The username (group 1)
  print("Host:", match.group(2)) # The host (group 2)
OUTPUT ---> 
Email address: support@datacamp.com
Username: support
Host: datacamp.com

Another way of doing the same is with the usage of < > brackets instead. This will let you create named groups. Named groups will make your code more readable. The syntax for creating named group is: (?P<name>...). Replace the name part with the name you want to give to your group.

The (...) represent the rest of the matching syntax. See this in action using the same example as before…

TIP: You can always access the named groups using numbers instead of the name. But as the number of groups increases, it gets harder to handle them using numbers alone. So, always make it a habit to use named groups instead.


Greedy vs. Non-Greedy Matching

When a special character matches as much of the search sequence (string) as possible, it is said to be a "Greedy Match".

It is the normal behavior of a regular expression, but sometimes this behavior is not desired.

Adding ? after the qualifier makes it perform the match in a non-greedy or minimal fashion; That is, as few characters as possible will be matched.


Function provided by re

Regular expressions are handled as strings by Python. However, with compile(), you can computer a regular expression pattern into a regular expression object.

When you need to use an expression several times in a single program, using compile() to save the resulting regular expression object for reuse is more efficient than saving it as a string. This is because the compiled versions of the most recent patterns passed to compile() and the module-level matching functions are cached.

TIP: finditer() might be an excellent choice when you want to have more information returned to you about your search. The returned regex match object holds not only the sequence that matched but also their positions in the original text.

Compilation Flags

Flag Modifiers:


Cleaning Data in Python



shutil

The shutil module includes high-level file operations such as copying and archiving.

copyfile() copies the contents of the source to the destination and raises IOError if it does not have permission to write to the destination file.

Automation Ideas

Automating Your Browser and Desktop Apps

Watchdog