top of page

Regular Expressions: What They Are, How They Are Used, & Why You Need To Learn Them



Introduction


There is nothing that a cyber professional loves more than a good short-cut. It's just the way that our brains work. If we are able to save time and still do the job efficiently, then an idea is instantly sold. Regular expressions often come to mind as being one such concept that is highly satisfactory to use. For those who have spent multiple hours typing directory paths (you know with all of those backslashes "/" and finnicky special characters), regular expressions can be a huge time-saver. Not only does regex save you the time and frustration of typing long strings, they are also essential for operating most sophisticated cybersecurity tools.


What Are Regular Expressions (Regex)?


Regular expressions are strings of text that allow you to enforce repeatable patterns in place of longer text paths. Regular expressions are most commonly used to match another sequence of characters, but they can be applied in numerous other settings. You can locate specific strings in a document as a method of filtering, validate a string to match another string, or perform exclusions.


Exclusions are used to reduce the number of false positive alerts that are often received by a SIEM (system information and event management) platform. When implementing an exclusion, we usually want to limit some kind of behavior from affecting future performance of our detection system. This could include any kind of IOA threat (Indicators of Attack), which are techniques that attackers use to plot or carry out a cyber attack. These are just some examples of IOAs that we would want to exclude:

  • Internal hosts communicating with countries outside of normal range or user logins from strange locations

  • Connections from unusual insecure ports (not the standard HTTP-80 or HTTPS-443)

  • High increase in SMTP (mail-server) traffic, indicating a DDoS attack

  • Internal hosts communicating back and forth with public-facing external servers

  • Multiple Honeytoken alerts from one host (indicates an attacker attempting to gain unauthorized access)

Based on the above examples, you can think of an IOA as a piece of solid evidence that a cybercriminal is planning to attack your system. No wonder why we would want to exclude that from happening! Overall, this is why regular expressions are particularly useful to know in security.


Other Applications (Cybersecurity Tools)


Splunk and CrowdStrike are the primary tools that incorporate regex into daily use. Splunk is a System Information & Event Management system (SIEM) that specializes in the collection, management, and interpretation of large amounts of data. Splunk uses regular expressions to sort through text and find patterns that helps identify important trends and security insights. CrowdStrike is a cloud-based endpoint and detection response platform designed to hunt and monitor security threats. CrowdStrike uses regex to perform the act of exclusions (referred to above), and is also integrated with Splunk data to identify malicious activity.


How To Speak Regex


There are a variety of special characters that are used in order to speak fluid regex, but this guide of most common quantifiers will help you get the feel for it. Quantifiers count how many times you can search for a specific regex character in a string.


Regex Character Guide

Symbol

Name

Definition

Example

.

Wildcard (Period)

Matches any single character that comes before it in a string and is a representative character for a long string. When paired with asterisk, anything can come before a string.

'.*' Matches 'john.smith@gmail.com', 'dan.smith@gmail.com', 'katie.smith@gmail.com' (etc.)

*

Asterisk

Matches any preceding character a specific number of times (0 or more). Known as the "repeater" symbol.

'ha*t' Matches 'hat', 'haat', 'haaat', 'ht', (etc.)

+

Plus Sign

Matches any preceding character 1 or more times. (Also a repeater symbol).

'ha+t' Matches 'hat', 'haat', 'haaat', but NOT 'ht' because the preceding character (a) must appear at least once.

?

Question Mark

Matches any preceding character an optional amount of times.

'ha?t' Matches 'hat', 'haaaat', 'ha01457t', (etc.) any character in between no limits

[ ]

Square Brackets

Matches any range of characters enclosed in the brackets.

'b[eau]d' Matches 'bed', 'bad', 'bud', (etc.)

{ }

Curly Brackets

Specifies the number of times a preceding character can repeat inside a string.

'lo{2}ve' Matches 'loove' while 'lo{1,3}ve' can match 'love', 'loove', looove'

|

Vertical Divider

Represents the OR symbol. Matches multiple options designated inside a string.

'th(|is|at|ere|eir|)' Matches 'this', 'that', 'there', 'their'

^

Carrot Symbol

Indicates position in a string. Anything that comes before the carrot will match.

'^z' Matches 'yz', 'xyz', 'wxzy' (etc.)

$

Dollar Sign

Indicates the end position in a string.

"[0-5] {3} $" Matches 123, 145, 345, (etc.)

/

Back Slash

The escape character. Designates a specific character a certain amount of times.

"[. /]" Matches a single period in a regex. "[//]" Matches a single back-slash in a regex.

\

Forward Slash

​Placed at the beginning and end of an expression to designate the expression's boundaries.

"\*Microsoft\Programs\" Matches any directory that falls between \Home\ to \Programs\.

\d

Any digit

Represents any digit in a sequence of numbers.

"\d" Matches 1234567.

\w

Any alphanumeric character (a-z).

Represents any letter in a string of letters/words.

"\w" Matches red, blue, green in a sentence.

\s

White space

Represents any white space in a string.

"\s+abc" Matches " abc".

egrep

Search

Allows you to search a file with a regex expression. This is like the regex version of the linux "grep" command.

"strings file.sys | egrep"

sort

Sorts output

Allows you to sort your regex output in a certain determined order.

"strings file.sys | egrep | sort"

uniq

Remove duplicates

Allows you to remove duplicate copies of a printed output in regex.

​​"strings file.sys | egrep | uniq"

Try to memorize the repeated sequences and writing expressions will come easier to you. For example: whenever you see a double star, just know that means that multiple directories are applied instead of just one previous one.


The Best Way To Learn Is To Practice!

Like any new skill, it is easiest to learn by doing. Regular expressions can be tricky to communicate with off the top, so it is a good idea to refer to the above symbol chart for navigation purposes. Use these practice exercises to get a better understanding of regex and use the score card at the bottom to see how well you do! We've configured a few examples to help navigate you through some of the below practice exercises. We strongly recommend writing down these practice exercises on paper to help absorb the material.


Examples:

1. You are given a scenario in which you need to write a regular expression that excludes all outputs except for internet traffic urls. How do we write a regex to find all url traffic?


Answer: / http\:|https\: /

Now let's break this down. First, we can recognize the OR symbol right away by the vertical sign of " | ", so we know this expression is asking to find strings with "http" or "https" included in the file. Next we have the colon symbol ":" and the back slash "\", which specifies that we need to match "http:" or "https:" zero or more times. And finally, the expression is enclosed in the forward slash brackets, designating the beginning and end of the expression. You could also write this expression this way: "/ http: | https: /"


2. You are given another scenario in which you need to find all instances of internet traffic in a .sys log file. How do you search for internet traffic as a regular expression in the file and eliminate duplicate records?


Answer: strings file.sys | egrep -i "http\:|https\:" | sort | uniq

You would use this command to search for regex in a file on a Linux OS machine. We use the strings command to designate the file we are looking at, then "egrep" as the command to search the file. Like the grep tool, we can use the "-i" tag to search for both upper and lower case occurrences of http and https. We then copy and past the regex expression in the search space, and use sort to sort the output and uniq to remove duplicate records.


3. Write a regular expression to view all paths (directories and files) of a certain file.


Answer: /^ [a-z] {1} [\: ] {1} [\\] {1}/

We realize this regex example can be confusing to look at firsthand, but let's break it down. We use the carrot symbol "^" to designate that anything which comes from the beginning of this regex is supposed to match the output. Next we designate a range "[a-z]" to match any file paths that start alphabetically with a lowercase "a" or a "z" to be outputted. This would include anything from the "applications" directory to a file name "zeta.txt". The number 1 enclosed in the curly brackets "{1}" designates the amount of times we want a directory or file path to appear in the output, so for example, that means we we will get only 1 output of this file path.


For the next range of "[\:]", we only want to match the colon symbol one time "{1}" and matching the [\\] bracket range one time "{1}". So in plain english, this regular expression reads as: "Match any statement that comes before the file path directories named alphabetically a through z one time, a colon after that listed file path directory one time, and one backslash after the colon. An example output would look like the following:

"Home/Desktop/passwords.txt:/"


4. Write a regular expression to find IP addresses in a file.


Answer: /^( [0-9] {1, 3} [\.] {1} ) {3} [0-9] {1, 3} $ /

Let's break this one down piece by piece. First, the carrot symbol is used to designate the beginning of the regular expression, in which anything may come before the number range that follows. We want a range of numbers represented from 0-9, and we want those numbers to be repeated 1-3 times "{1, 3}". Next we need a period, so we write a back-slash character followed by a period for the IP address format, repeated once "{1}". This sequence of characters is enclosed within parentheses ( ) and is told to repeat 3 times "{3}" to match the format of an ip address. Then we have another range of numbers 0-9, and we tell it to repeat 1-3 times "{1, 3}". This represents the fourth segment of the IP address. The dollar sign denotes the end of the number sequence.The example output of running the regular expression could return any following IPv4 address:


192.168.2.123

127.0.0.1

201.45.619.1


Practice Exercises: Don't cheat! Try to write these on your own at first.


  1. Write a regular expression to match all of the following words: "bat", "cat", "rat", and "sat".

  2. Write a regular expression to match all of the following words: "Nice" and "Venice".

  3. Write a regular expression to match the number of repeating letters in "cccaaattt", "ccaatt", and "cat".

  4. Write a regular expression to match the following dates: "Jan 2002" and "Oct 2022".

  5. Write a regular expression to match these phone numbers: "407-678-9972" and "(321)-685-3421".

  6. Write a regular expression to match the following email addresses: "peter.parker@gmail.com", "bruce.wayne@yahoo.com", and "tony.stark@hotmail.com".

  7. Write a regular expression to match these links: "=https://regex.com" and "=http://cyber.com".

  8. Write a regular expression to match these file types: "image.jpg", "page.txt", and "word.doc".



Answers
  1. / [bcrs]at /

  2. / [N-V] [e-i]ce /

  3. / c {1, 3} a {1, 3} t {1,3} /

  4. / (\w + ( \d ))

  5. / 1? [ \s- ] ?\ (?( \d {3} ) \ )? [\s -] ? \d {3} [\s- ] ? \d {4} /

  6. / ^( [\w\.]*) /

  7. / =( [\w: //.] * ) /

  8. / (\w+)\.( jpg|txt|doc ) $

Additional Resources

There are a lot of great and free materials on mastering regex on the Internet. Check out this list of blog posts and websites to expand your regex knowledge. Good luck practicing regex!

Comments


bottom of page