Upgrade regex experience in Python
by using a library called regex instead of the standard re
Introduction
My cycle of working with regular expressions is as follows: I dread them, work with them for a project and become an expert, don’t use them for the next 5 months and forget everything I learnt.
As I’ve done with Vim and Git (no article on this one yet), I want to learn a minimal set of functionalities that I can always come back to.
In this article, I will introduce a library and some techniques I learnt that can make working with regular expressions easier.
Standard library limitations
Say you have a string like this as part of a bigger string
and you want to parse out all the numbers.
You can start iteratively building your regular expression like this:
We start by just matching seeds: and once we verify that works, we try to match digits after that.
In regular expressions parentheses () are what are known as match groups. They allow you to capture multiple subcomponents of your match.
For example,
In this example, I want to extract the firstname, lastname and the company name from my match. Hence I can write a simple regular expression:
pattern = r"\w+\.\w+@\w+\.\w+"Word followed by dot (.) followed by word followed by @ … etc, and then put () around the things I want to extract out from the match.
This works when I can specify all my capture groups separately, but breaks when I want to repeat them.
Going back to our seeds example, if I specify the 4 capture groups separately:
I get the result I expect, however if I try to do ( +\d+)+ (space followed by a digit repeated multiple times), I only get the last matching group
You can see the same behavior even if I use re.match or re.search
Using regex
To fix this behavior, we can use another library called regex. From the README:
This regex implementation is backwards-compatible with the standard ‘re’ module, but offers additional functionality.
So if we just replace
with
and try using the same regular expression again
we get the output we expect.
Notice that in both re and regex , index 0 is reserved for the full match, and then index 1 onwards for the individual groups.
You can also index into multiple groups at once
and you can use .spans to get the offsets.
Finally, you can also name the groups for readability and better indexing.
There are many more functions available with search, however, these should be enough to do most of your tasks.
What other features does regex provide?
If you’ve worked with pdfs, especially scanned ones, you would’ve seen a lot of OCR errors. Things like typos or letters turning into numbers make it difficult to run RAG pipelines on them.
In such cases, it’s helpful for the retrieval step to use fuzzy matching. regex gives you a lot of control with fuzzy matching.
In this example we want to match the word “lease” but allow for 1 or few deletions. Similarly you can do
Here’s another example produced by Claude:
With match.fuzzy_counts you can access the total number of substitutions, insertions and deletions.
You can also restrict what is allowed during fuzzy matches
In this example, we allow for 1 substitution but with only [A-Z] hence when we replace it with a digit it no longer matches.
Conclusion
regex has many more features like recursive patterns, named lists, and timeouts that you can explore by copy pasting the README into an LLM.
For my use cases, the ones listed above seem enough. If I missed anything let me know in the comments.
References
I’ve been solving Advent of Code problems as part of a new course by Jeremy Howard called `How To Solve It With Code` and I came across the library as part of it. The course content is not public yet.
~ happy learning
