Regular expressions in Google Analytics
Regular expressions in Google Analytics. You know those online resources that you’ve been using for years until, all of a sudden, they’re nowhere to be found? That was my experience with the more than handy PDF: Regular expressions for Google Analytics as written by Robbin Steif at Lunametrics. A regular expression tutorial that walks you through each of the RegExp characters you can use in Google Analytics.
Unfortunately LunaMetrics joined forces with Bounteous over time and apparently there was no more server space left for this PDF. However, seeing that the copyright in the file is Creative Commons Attribution-Noncommercial-Share Alike 3.0 license, it’s ok to share it.
So to make sure this excellent explanation of regular expressions doesn’t get lost, I’ve decided to put it online here. If you prefer reading online, below is a succinct summary.
First things first. What are regular expressions?
Regular expresssions are a sequence of characters or symbols, describing a pattern for which you can search or filter. In this case in the Google Analytics UI by selecting Matching RegExp in Advanced search (or in Goals or segments). Besides RegExp, also RegEx is used as an acronym for regular expressions. Regular expressions in Google Analytics are not case sensitive.
Regular expressions are about power matching. If you need to create a goal that matches multiple thank-you pages – that is power matching. If you need to write a filter that matches multiple URLs, but only know what a piece of each URL looks like – again, that is power matching.
Regular expressions are greedy by nature: if you don’t tell them not to, they match what you specify plus any adjacent characters.
Regular expressions I use when I’m including or excluding my name:
- Why would they match? Not a single one of those phrases includes the entire name.
- How can those possibly be regular expressions? There are no RegEx characters!
- They match because regular expressions in GA will match and match until they aren’t allowed to any more. That’s why Thijs matches the target string, Thijs van Noort – if it matches any part of the string, it will match the whole string.
- And the characters? You don’t need to have those characters just to have a regular expressions, and having the characters doesn’t necessarily make it a RegEx pattern. All you need to do is put the expression into a field that is sensitive to RegExp. For example, when you write a Google Analytics goal, you get to choose “head match,” “exact match” or “Regular Expression.” As soon as you choose “Regular Expression,” the field becomes sensitive to
RegExp, and all the rules of RegExp apply.
You often need to use little RegExp characters – but not always.
In that sense, you could write an expression like www.google.(com|nl|be|fr|de|it|es|co.(au|uk|il)) to filter for as many countries as possible where you could find Google. Or you just write: google .
Why use regular expressions?
As there are plenty of advanced ways to filter data in Google Analytics, why would you use regular expressions? Well first of all, because segments only allow data to be filtered on a session or user level.
While for SEO you regularly want to validate the results of optimizations on a page level. This is even more the case for the Search console reports in GA, since Search Console data is incompatible with Google Analytics segments.
Regular expression character #1 Backslash
The best Regular expression character to start with is the backslash. A backslash is different from all the other characters, as you will see. It provides a bridge between regular expressions and plain text.
A backslash escapes a character. What does escape mean? It means that it turns a regular expressions character
into plain text.
You use it when you want to filter, for example, for a URL that contains a parameter like example.com/?id=123 . Without any changes to this string, the question mark would be read as a regular expression. But you want it to be read as part of the string. To achieve this you put a backslash in front of it, like so: example.com/\?id=123 . This tells Google Analytics to read the question mark as text and not as a Regular expression.
Regular expression character #2 Pipe
The pipe is the simplest of regular expressions, and it is the one that all Regular People (that’s you and me) should learn. It means or .
Here’s an example: Let’s say you have two thankyou pages, and you need to roll them up into one goal. The first one is named thanks, and the second is named confirmation. You could create your goal like this: confirmation|thanks
Regular expression character #3 Question mark
A question mark means, the last item (which, for now, we’ll assume is the last character) is optional.
Let’s say I’d want to find all users who searched for Thijs, but am aware that this a common typo is writing it without an h. If so, the RegEx Th?ijs would get me both Thijs and Tijs (the h being optional as a result of the question mark.
Regular expression character #4 Parentheses
Parentheses in regular expressions work the same way that they do in mathematics.
Like when you’d want to set a goal in Google Analytics that captures both of the following URLs :
A great way to represent this in RegEx would be: /folder(one|two)/thanks
Regular expression character #5 Square brackets & dashes
With square brackets, you can make a simple list, like this: [aiu] . So p[aiu]n will match pan, pin and pun.
You can also add a dash to create a list of items, like this:
- [a-z] – all lower-case letters in the English alphabet
- [A-Z] – all upper-case letters in the English Alphabet
- [a-zA-Z0-9] – all lower-case and upper-case letters, and digits.
Regular expression character #6 Dot
A dot matches any one character.
The regular expressions: .ite will match site, lite, bite, kite.
Regular expression character #7 Plus sign
A plus sign matches one or more of the former items, which, as usual, we’ll assume is the previous character.
The regular expressions: aa+rgh will match aargh and aaargh and aaaaaaaaargh … well, you understand.
Regular expression character #8 Star
A star and a plus sign are quite similar in regular expressions. With one big difference: unlike the plus sign, the star will match zero or more of the previous items. Usign the same string and replacing the plus sign with a star matches something different.
The regular expressions: aa*rgh will match aargh and aaargh and aaaaaaaaargh, just like the plus sign. But it will also match argh . So without any repetition of the previous character.
Combining #6 (the dot) and #8 (the star)
There are two Regular Expressions that, when put together, mean “get everything.” Let’s say I wanted to find all the articles in my analytics folders, both the Dutch /analytics/ and English /en/web-analytics/, with Google in the title (and thereby in the URL). What the 2 folders have in common is that the folder ends with analytics/ . Next I’d need a way to include google after this, regardless of whether the title had anything before or after the word google.
The RegEx for this could be analytics/.google.
Regular expression character #9 Caret
When you use a caret in your regular expressions, you force the expression to match only strings that start exactly the same way your RegEx does.
So with my URL structure being as follows https://www.thijsvannoort.nl/en/web-analytics/ let’s say I want to find all the sessions that started in my web analytics folder. I could try to find these with the following RegEx: ^/web-analytics/.* but I wouldn’t get any results.
Reason being that Google Analytics starts it’s URLs straight after the root domain. i.e. https://www.thijsvannoort.nl . So in order for it to find all the sessions that started in the web analytics folder, I’d have to include the /en/ part in the RegEx: ^/en/web-analytics/.*
Regular expression character #10 Dollar sign
A dollar sign means don’t match if the target string has any characters beyond where I have placed the dollar sign in my Regular Expression. In short, it says “only include strings that end here”. Taking the same folder as above, /en/web-analytics/, let’s say I only wanted to find out which sessions started on that category page and not the underlying pages. In that case I’d use the RegEx: ^/en/web-analytics/$
So that’s the individual characters. Now obviously the trick becomes combining them to filter your data as narrow or broad as suits your purpose.
Got some cool RegEx examples to share with other? Let me know in the comments!