Regular Expressions
Overview
Teaching: 25 min
Exercises: 20 minQuestions
What are regular expressions?
Objectives
Use
grep
to select lines from text files that match simple patterns.
In the same way that many of us now use “Google” as a verb meaning “to find”, Unix programmers often use the word “grep”. “grep” is a contraction of “global/regular expression/print”, a common sequence of operations in early Unix text editors. It is also the name of a very useful command-line program.
grep
finds and prints lines in files that match a pattern.
For our examples,
we will use a file that contains three haikus taken from a
1998 competition in Salon magazine.
cat haiku.txt
The Tao that is seen
Is not the true Tao, until
You bring fresh toner.
With searching comes loss
and the presence of absence:
"My Thesis" not found.
Yesterday it worked
Today it is not working
Software is like that.
Forever, or Five Years
We haven’t linked to the original haikus because they don’t appear to be on Salon’s site any longer. As Jeff Rothenberg said, “Digital information lasts forever — or five years, whichever comes first.” Luckily, popular content often has backups.
Let’s find lines that contain the word “not”:
grep not haiku.txt
Is not the true Tao, until
"My Thesis" not found.
Today it is not working
Here, not
is the pattern we’re searching for. The grep command searches through the file, looking for matches to the pattern specified. To use it type grep
, then the pattern we’re searching for and finally the name of the file (or files) we’re searching in.
The output is the three lines in the file that contain the letters “not”.
Let’s try a different pattern: “The”.
grep The haiku.txt
The Tao that is seen
"My Thesis" not found.
This time, two lines that include the letters “The” are outputted. However, one instance of those letters is contained within a larger word, “Thesis”.
To restrict matches to lines containing the word “The” on its own,
we can give grep
with the -w
flag.
This will limit matches to word boundaries.
grep -w The haiku.txt
The Tao that is seen
Note that a “word boundary” includes the start and end of a line, so not
just letters surrounded by spaces.
Sometimes we don’t
want to search for a single word, but a phrase. This is also easy to do with
grep
by putting the phrase in quotes.
grep -w "is not" haiku.txt
Today it is not working
We’ve now seen that you don’t have to have quotes around single words, but it is useful to use quotes when searching for multiple words. It also helps to make it easier to distinguish between the search term or phrase and the file being searched. We will use quotes in the remaining examples.
Another useful option is -n
, which numbers the lines that match:
grep -n "it" haiku.txt
5:With searching comes loss
9:Yesterday it worked
10:Today it is not working
Here, we can see that lines 5, 9, and 10 contain the letters “it”.
We can combine options (i.e. flags) as we do with other Unix commands.
For example, let’s find the lines that contain the word “the”. We can combine
the option -w
to find the lines that contain the word “the” and -n
to number the lines that match:
grep -n -w "the" haiku.txt
2:Is not the true Tao, until
6:and the presence of absence:
Now we want to use the option -i
to make our search case-insensitive:
grep -n -w -i "the" haiku.txt
1:The Tao that is seen
2:Is not the true Tao, until
6:and the presence of absence:
Now, we want to use the option -v
to invert our search, i.e., we want to output
the lines that do not contain the word “the”.
grep -n -w -v "the" haiku.txt
1:The Tao that is seen
3:You bring fresh toner.
4:
5:With searching comes loss
7:"My Thesis" not found.
8:
9:Yesterday it worked
10:Today it is not working
11:Software is like that.
grep
has lots of other options. To find out what they are, please visit the manual.
Using
grep
Which command would result in the following output:
and the presence of absence:
grep "of" haiku.txt
grep -E "of" haiku.txt
grep -w "of" haiku.txt
grep -i "of" haiku.txt
Solution
The correct answer is 3, because the
-w
flag looks only for whole-word matches. The other options will all match “of” when part of another word.
grep
’s real power doesn’t come from its options, though; it comes from
the fact that patterns can include wildcards. (The technical name for
these is regular expressions, which
is what the “re” in “grep” stands for.)
Regular Expression Libraries
All the popular programming languages has at least one regular expression library. Not always they have the same default behaviour so you need to be careful.
The regular expressions that we saw so far didn’t make use of any operators or metacharacters. The characters that are operators are
^
: matches the beginning of a string$
: matches only at the end of a string.
: matches any single character[...]
: matches any one of the characters that are enclosed in the square brackets[^...]
: matches any characters except those in the square brackets|
: is used to specify alternatives(...)
: are used for grouping in regular expressions*
: the preceding regular expression should be repeated as many times as necessary to find a match+
: the preceding regular expression must be matched at least once?
: the preceding regular expression can be matched either once or not at all{n,m}
: the preceding regular expression is repeated n to m times
The power of regular expression is more visible in structured data.
head snap.txt
Mass x y z vx vy vz
M_sol Parsecs Parsecs Parsecs km/s km/s km/s
0.11 1.15 0.6 1.3 -0.01 0.89 1.12
0.13 -0.87 0.25 -1.11 0.24 0.98 1.23
0.22 1.18 -0.38 0.7 -0.01 1.04 1.08
0.16 1.17 -0.37 0.72 0.13 0.96 0.97
0.15 0.22 -0.7 -0.35 -0.25 0.29 0.84
0.34 0.9 0.65 1.73 0.13 1.95 0.42
0.27 -0.04 0.01 0.07 -0.05 1.16 0.69
If we want to find particles with mass of 0.13, we can’t use
grep 0.13 snap.txt
0.13 -0.87 0.25 -1.11 0.24 0.98 1.23
0.16 1.17 -0.37 0.72 0.13 0.96 0.97
0.34 0.9 0.65 1.73 0.13 1.95 0.42
0.47 0.39 -0.54 -0.7 -0.13 0.93 1.16
0.3 0.17 -0.69 -0.37 0.13 0.45 0.78
0.37 -0.13 -0.03 0.08 0.77 0.65 1.57
0.13 0.0 -0.1 -0.32 -0.49 0.46 1.16
0.13 1.09 -0.66 0.41 0.21 0.24 0.02
0.13 1.16 0.45 1.37 0.3 0.08 0.87
0.13 0.72 -0.36 0.9 0.34 0.59 0.94
0.13 0.21 -0.53 -0.74 0.12 0.61 1.38
0.13 -1.19 -0.05 -0.82 0.73 0.32 1.36
0.16 0.81 0.13 0.81 0.25 0.09 1.01
0.13 -0.03 -0.22 -1.32 0.35 0.4 1.1
0.25 0.53 -0.8 -0.4 -0.13 0.42 0.75
0.13 -1.31 -0.18 -0.73 0.23 0.79 0.1
0.13 0.58 0.69 1.96 0.13 0.39 0.42
0.1 0.13 -0.67 -0.44 0.03 0.7 0.72
0.28 0.1 -0.13 0.35 -0.28 0.8 0.61
because it will return lines with velocity of 0.13. With regular expressions, we can restrict the search to the first “column”.
grep -P '^0\.13' snap.txt
0.13 -0.87 0.25 -1.11 0.24 0.98 1.23
0.13 0.0 -0.1 -0.32 -0.49 0.46 1.16
0.13 1.09 -0.66 0.41 0.21 0.24 0.02
0.13 1.16 0.45 1.37 0.3 0.08 0.87
0.13 0.72 -0.36 0.9 0.34 0.59 0.94
0.13 0.21 -0.53 -0.74 0.12 0.61 1.38
0.13 -1.19 -0.05 -0.82 0.73 0.32 1.36
0.13 -0.03 -0.22 -1.32 0.35 0.4 1.1
0.13 -1.31 -0.18 -0.73 0.23 0.79 0.1
0.13 0.58 0.69 1.96 0.13 0.39 0.42
If we want to find particles with mass of 3 solar mass, we could try
grep -P '^3' snap.txt
3.01 0.18 -0.22 -1.25 0.1 0.42 1.0
34.38 -1.15 -0.25 -0.9 0.14 0.55 1.23
Note that this will return
34.38 -1.15 -0.25 -0.9 0.14 0.55 1.23
We need to elaborate a bit more on our regular expression. One more reliable regular expression is
grep -P '^3\.' snap.txt
3.01 0.18 -0.22 -1.25 0.1 0.42 1.0
If we want to find particles with vx of 0.16, we could try
grep -P '^.*\t.*\t.*\t.*\t0.16' snap.txt
0.35 -0.16 0.02 -0.05 0.16 1.15 1.11
0.14 0.82 0.47 1.15 0.16 0.91 0.64
0.11 1.16 -0.5 0.52 0.11 0.68 0.16
0.18 -1.36 -0.27 -0.73 0.16 0.81 0.93
0.18 0.17 -0.51 -0.76 0.16 0.3 1.42
0.25 1.45 0.21 0.32 0.16 0.45 1.11
0.36 0.84 0.1 0.99 0.16 0.38 1.02
0.11 0.22 -0.5 -0.71 0.16 -0.02 0.5
0.18 0.85 0.04 0.82 0.16 0.74 0.89
0.41 -0.01 -0.32 -1.36 0.16 -0.01 1.13
1.83 0.23 -0.51 -0.79 0.06 0.16 1.28
Note the lines
0.11 1.16 -0.5 0.52 0.11 0.68 0.16
and
1.83 0.23 -0.51 -0.79 0.06 0.16 1.28
We could fix it using
grep -P '^(.{3,5}\t){4}0.16' snap.txt
0.35 -0.16 0.02 -0.05 0.16 1.15 1.11
0.14 0.82 0.47 1.15 0.16 0.91 0.64
0.18 -1.36 -0.27 -0.73 0.16 0.81 0.93
0.18 0.17 -0.51 -0.76 0.16 0.3 1.42
0.25 1.45 0.21 0.32 0.16 0.45 1.11
0.36 0.84 0.1 0.99 0.16 0.38 1.02
0.11 0.22 -0.5 -0.71 0.16 -0.02 0.5
0.18 0.85 0.04 0.82 0.16 0.74 0.89
0.41 -0.01 -0.32 -1.36 0.16 -0.01 1.13
Alternatives
Select the lines of
gapminder_data.csv
that haveEurope
orAsia
.Solution
grep -P "(Europe|Asia)" gapminder_data.csv
Key Points
grep
selects lines in files that match patterns.