Standard Regular Expressions

In Linux and other Unix-like operating systems, there are two types of regular expressions that you can use: “standard” regular expressions (also called “basic” regular expressions) and “extended” regular expressions.

Standard regular expressions are the default type of regular expression in many Unix utilities, such as grep, sed, and awk. They are more limited in their capabilities than extended regular expressions, but they are generally faster and easier to use. Standard regular expressions use a backslash () to indicate special characters, such as . (period) and * (asterisk).

Extended regular expressions are a more powerful type of regular expression that is supported by some Unix utilities, such as grep and sed. They use a backslash () to indicate special characters, just like standard regular expressions, but they also support additional features, such as the ability to specify repetition using “{}” and the ability to specify alternatives using “|”.

To use extended regular expressions in a Unix utility, you typically need to use a command-line option. For example, in grep, you can use the “-E” option to specify that you want to use extended regular expressions. In sed, you can use the “-r” option to specify that you want to use extended regular expressions.

Here’s an example of using extended regular expressions in grep:

$ grep -E '^[0-9]+$' file.txt

This command searches for lines in “file.txt” that consist entirely of one or more digits (0-9). The “^” and “$” characters are used to match the beginning and end of the line, respectively. The “[0-9]+” expression specifies one or more digits. The “-E” option tells grep to use extended regular expressions.

Standard regular expressions

literal matching is the process of matching a specific string of characters as they appear, rather than interpreting the characters as special metacharacters.

To match a literal string in a regular expression, you can simply enclose the string in quotation marks. For example, the regular expression "hello" would match the string “hello” exactly, and would not match any other string.

You can also use the character to escape any special metacharacters within the string, so that they are treated as literals rather than special characters. For example, the regular expression "hello." would match the string “hello.” exactly, and would not match any other string.

In standard regular expressions (also called “basic” regular expressions), certain characters have special meaning and are used to match patterns in text. Here is a list of some of the most common special characters used in standard regular expressions:

  • . (period) – matches any single character
  • * (asterisk) – matches zero or more occurrences of the preceding character or expression
  • ^ (caret) – matches the beginning of a line
  • $  (dollar sign) – matches the end of a line
  • [] (square brackets) – match any character inside the brackets
  • [^ ] (square brackets with caret) – match any character not inside the brackets
  • \ (backslash) – escapes the special meaning of the following character

Character classes in regular expressions using ([ ])

character classes are used to match any character from a specific set of characters. You can define a character class by enclosing a set of characters in square brackets [].

For example, the character class [0123456789] would match any digit, and the character class [a-z] would match any lowercase letter.

Here are a few more examples of character classes in regular expressions:

  • [A-Z]: This character class would match any uppercase letter
  • [a-zA-Z]: This character class would match any letter, uppercase or lowercase
  • [0-9]: This character class would match any digit
  • [^0-9]: This character class, which is negated using the ^ metacharacter, would match any character that is not a digit

You can also use character classes in combination with other metacharacters and operators. For example:

  • [A-Z][a-z]*: This regular expression would match any string that starts with an uppercase letter, followed by zero or more lowercase letters
  • [0-9]{3}-[0-9]{3}-[0-9]{4}: This regular expression would match a phone number in the format 555-555-5555
  • [^a-zA-Z]: This regular expression would match any character that is not a letter

Match any single character using period (.)

In regular expressions, the period (.) is a special character that matches any single character. For example:

  • a.c” – matches “abc”, “a1c”, “a#c”, etc.
  • ^.ello$” – matches “hello” and “mello”, but not “hello” or “hell”

You can use the period in combination with other special characters, such as the asterisk (*) and the square brackets []. For example:

  • a.*z” – matches any string that starts with “a” and ends with “z”, with any number of characters in between
  • b[a-z]*d” – matches any string that starts with “b”, ends with “d”, and has zero or more lowercase letters in between

You can also use the period to match a specific character by preceding it with a backslash (.) For example:

  • a.c” – matches “a.c” (the backslash escapes the special meaning of the period)
  • a.c” – matches “ac” (the first backslash escapes the second backslash, which escapes the special meaning of the period)

Here are some examples of using these special characters in standard regular expressions:

  • a.*z” – matches any string that starts with “a” and ends with “z”, with any number of characters in between
  • ^[A-Z]” – matches any line that starts with an uppercase letter
  • ^[^A-Z]” – matches any line that does not start with an uppercase letter
  • .*@example.com$” – matches any line that ends with “@example.com” (the backslash is used to escape the special meaning of the period)

Using anchors in Regular Expressions with (^) and ($)

“Anchors” are special characters that match a position in the input string rather than a specific character. The two most common anchors are the caret (^) and the dollar sign ($).

The caret (^) is an anchor that matches the beginning of a line. For example:

  • ^hello” – matches “hello” at the beginning of a line
  • ^[A-Z]” – matches any line that starts with an uppercase letter
  • ^[^A-Z]” – matches any line that does not start with an uppercase letter

The dollar sign ($) is an anchor that matches the end of a line. For example:

  • hello$” – matches “hello” at the end of a line
  • .*@example.com$” – matches any line that ends with “@example.com” (the backslash is used to escape the special meaning of the period)

You can use anchors in combination with other special characters and patterns to match specific strings or patterns in the input. For example:

  • ^[0-9]+$” – matches any line that consists entirely of one or more digits (0-9)
  • ^Thes[A-Z][a-z]+$” – matches any line that starts with “The” followed by a space and a capitalized word (e.g., “The Cat”, “The Dog”, etc.)

Using OR logic in regular expressions with (|)

The | symbol is known as the “pipe” character, and is used to match either the expression before the | or the expression after the |. This is known as an “alternation” or a “logical OR” operator.

For example, the regular expression apple|orange would match either the string “apple” or the string “orange”, but not both.

Here are a few more examples of how the | operator can be used in regular expressions:

  • cat|dog: This regular expression would match either the string “cat” or the string “dog”.
  • red|blue|green: This regular expression would match either the string “red”, “blue”, or “green”.

You can also use the | operator in combination with other metacharacters and operators. For example:

  • ^A|B$: This regular expression would match any string that starts with the letter A or ends with the letter B
  • cat|dog|bird.: This regular expression would match either the string “cat”, “dog”, or “bird.” (note the use of the character to escape the . metacharacter)

Zero or one occurrences in regular expressions with (*)

the * symbol is used to match zero or more occurrences of the preceding character or expression.

For example, the regular expression a* would match any string that contains zero or more occurrences of the letter “a”, such as “”, “a”, “aa”, “aaa”, etc.

Here are a few more examples of how the * operator can be used in regular expressions:

  • [0-9]*: This regular expression would match any string that contains zero or more digits
  • ^A.*B$: This regular expression would match any string that starts with the letter “A” and ends with the letter “B”, with any number of characters in between
  • cat.*dog: This regular expression would match any string that contains the word “cat” followed by zero or more characters, followed by the word “dog” (e.g. “catdog”, “catdogcatdog”, etc.)

You can also use the * operator in combination with other metacharacters and operators. For example:

  • ^A.*B|C$: This regular expression would match any string that starts with the letter “A” and ends with the letter “B”, with any number of characters in between, or any string that ends with the letter “C”
  • ^[0-9]*.?[0-9]*$: This regular expression would match any string that consists of zero or more digits, followed by an optional decimal point, followed by zero or more digits (e.g. “123”, “123.456”, “0.0”, etc.)

Negated character set in regular expressions using [^ ]

a negated character class is a character class that is defined using the ^ metacharacter inside the square brackets []. It matches any single character that is not in the class.

For example, the regular expression [^0-9] would match any character that is not a digit.

You can also use the ^ metacharacter to negate a character class at the beginning of a string or line. For example, the regular expression ^[^A-Z] would match any string or line that does not start with an uppercase letter.

Here are a few more examples of negated character classes in regular expressions:

  • [^a-z]: This regular expression would match any character that is not a lowercase letter
  • [^A-Za-z]: This regular expression would match any character that is not a letter (uppercase or lowercase)
  • [^0-9a-zA-Z]: This regular expression would match any character that is not a letter or a digit

Escape character in regular expression (\)

The escape character in regular expressions is the backslash (\). It is used to escape the special meaning of certain characters, so that they are treated as literal characters rather than metacharacters.

For example, in extended regular expressions (EREs), the asterisk (*) is a metacharacter that is used to match zero or more occurrences of the preceding character. To match a literal asterisk in an ERE, you would need to use the escape character: \*.

Here are a few more examples of how the escape character can be used in regular expressions:

  • \.: This regular expression would match a literal period (.)
  • \[abc\]: This regular expression would match any single character that is either an “a”, “b”, or “c”
  • ^A\*B$: This regular expression would match a string that starts with the letter “A”, followed by an asterisk, and ends with the letter “B” (e.g. “A*B”)

The escape character is also used to escape certain characters in character classes. For example:

  • [\^abc]: This regular expression would match any single character that is either a caret (^), an “a”, a “b”, or a “c”
  • [a-z\]]: This regular expression would match any lowercase letter or a closing square bracket (])

Summary:

In Unix-like operating systems, there are two types of regular expressions that can be used: standard (also called basic) and extended. Standard regular expressions are the default and are supported by utilities such as grep, sed, and awk. They are faster and easier to use, but have less capabilities than extended regular expressions. Standard regular expressions use a backslash () to indicate special characters, such as . (period) and * (asterisk). Extended regular expressions are more powerful and are supported by some utilities like grep and sed. They use a backslash () to indicate special characters, and also support additional features like specifying repetition with “{}” and alternatives with “|”. To use extended regular expressions, a command-line option is typically required, such as “-E” for grep and “-r” for sed. Standard regular expressions can match literal strings by enclosing them in quotation marks, or by using a backslash () to escape any special metacharacters. Special characters in standard regular expressions include . (period), * (asterisk), ^ (caret), $ (dollar sign), [] (square brackets), and (backslash). Character classes, defined by enclosing a set of characters in square brackets [], can also be used to match any character from a specific set. Repetition in regular expressions can be specified using the * (asterisk) to match zero or more occurrences, the + (plus sign) to match one or more occurrences, and the ? (question mark) to match zero or one occurrence of the preceding character or expression. Alternatives in regular expressions can be specified using the “|” (vertical bar) symbol to match either the preceding or following character or expression.

Questions for review:

  1. What are the differences between standard and extended regular expressions?
  2. How do you specify that you want to use extended regular expressions in grep and sed?
  3. How do you match a literal string in a regular expression?
  4. What are some common special characters used in standard regular expressions?
  5. How do you define a character class in a regular expression?
  6. How can character classes be used in combination with other metacharacters and operators?
  7. How do you specify repetition in a regular expression using the *, +, and ? metacharacters?
  8. How do you specify alternatives in a regular expression using the “|” symbol?

 

License

Developers ultimate guide: Linux Bash scripting Copyright © 2022 by Matin Maleki. All Rights Reserved.

Share This Book