Index

joshstock.in / 76ff7a3

Source for serving and static templating/compiling of https://joshstock.in.

Latest Commit

{#}TimeHashSubjectAuthor#(+)(-)GPG?
18617 Mar 2023 18:0076ff7a3Update Python regex homoglyphs articleJosh Stockin110269G

Blob @ joshstock.in / site / content / blog / python-regex-homoglyphs.md

text/plain12659 bytesdownload raw
1---
2type: article
3identifier: python-regex-homoglyphs
4title: Patching Python's regex AST for confusable homoglyphs
5description: Exploiting the re module's parsing and compiling methods to check for "confusable homoglyphs" and create a better automoderator.
6datestring: 2023-03-17
7banner_image: /static/images/futaba-filter.jpg
8links:
9 futaba: https://github.com/strinking/futaba
10 confusable_homoglyphs: https://pypi.org/project/confusable_homoglyphs/
11 Python re module: https://docs.python.org/3/library/re.html
12 Scunthorpe problem: https://www.wikiwand.com/en/Scunthorpe_problem
13 Abstract Syntax Tree (AST): https://www.wikiwand.com/en/Abstract_syntax_tree
14---
15
16A couple years ago I contributed to an open source Discord
17[bot](https://github.com/strinking/futaba) designed by a public computer
18programming community I’m part of. As with most popular Discord bots, it
19incorporates its own filter to prohibit unwanted language. However, to ensure
20coverage of messages like `as𝕕f` (notice non-ASCII `𝕕`) in case it’s told to
21filter `asdf` (an ASCII-only string), the bot makes use of the
22[confusable_homoglyphs](https://pypi.org/project/confusable_homoglyphs/) Python
23package to automatically expand an inputted filter string to cover these
24non-ASCII edge cases.
25
26Originally, the bot used Python's native string manipulation to convert the
27inputted filter string into a Python [regular
28expression](https://docs.python.org/3/library/re.html) string that matches for
29each original character as well as each character's confusable homoglyphs
30(similar-looking non-ASCII characters). For example, the inputted filter string
31`asdf` was converted by the bot into the following regular expression:
32
33```bash
34[a⍺a𝐚𝑎𝒂𝒶𝓪𝔞𝕒𝖆𝖺𝗮𝘢𝙖𝚊ɑα𝛂𝛼𝜶𝝰𝞪а][ss𝐬𝑠𝒔𝓈𝓼𝔰𝕤𝖘𝗌𝘀𝘴𝙨𝚜ꜱƽѕꮪ𑣁𐑈][dⅾⅆ𝐝𝑑𝒅𝒹𝓭𝔡𝕕𝖉𝖽𝗱𝘥𝙙𝚍ԁᏧᑯꓒ][f𝐟𝑓𝒇𝒻𝓯𝔣𝕗𝖋𝖿𝗳𝘧𝙛𝚏ꬵꞙſẝք]
35```
36
37This regular expression uses character sets to match one instance each of the
38original character or its confusable homoglyphs. In other words, it has the
39ability to catch any recognizable rendition of a word using uncommon non-ASCII
40characters. This was really nice, because it meant the bot could catch most
41edge cases automatically. Unfortunately, however...
42
43## The problem
44
45Using the method of regex generation described above precludes the use of
46arbitrary regex patterns, which would help in looking at context to prevent
47[false positives](https://www.wikiwand.com/en/Scunthorpe_problem). When
48expanding filter strings, the bot could not differentiate between literal
49characters and special regex tokens. So, instead of the valid regular
50expression string `^asdf(.*)$` (which matches "asdf" followed by anything else,
51only if "asdf" is at the beginning) being converted into the following,
52preserving special regex tokens as desired...
53
54```bash
55^[a⍺a𝐚𝑎𝒂𝒶𝓪𝔞𝕒𝖆𝖺𝗮𝘢𝙖𝚊ɑα𝛂𝛼𝜶𝝰𝞪а][ss𝐬𝑠𝒔𝓈𝓼𝔰𝕤𝖘𝗌𝘀𝘴𝙨𝚜ꜱƽѕꮪ𑣁𐑈][dⅾⅆ𝐝𝑑𝒅𝒹𝓭𝔡𝕕𝖉𝖽𝗱𝘥𝙙𝚍ԁᏧᑯꓒ][f𝐟𝑓𝒇𝒻𝓯𝔣𝕗𝖋𝖿𝗳𝘧𝙛𝚏ꬵꞙſẝք](.*)$
56```
57
58...it would convert into this, mangling the original intended regex string,
59making it useless for arbitrary matching:
60
61```bash
62[\^˄ˆ][a⍺a𝐚𝑎𝒂𝒶𝓪𝔞𝕒𝖆𝖺𝗮𝘢𝙖𝚊ɑα𝛂𝛼𝜶𝝰𝞪а][ss𝐬𝑠𝒔𝓈𝓼𝔰𝕤𝖘𝗌𝘀𝘴𝙨𝚜ꜱƽѕꮪ𑣁𐑈][dⅾⅆ𝐝𝑑𝒅𝒹𝓭𝔡𝕕𝖉𝖽𝗱𝘥𝙙𝚍ԁᏧᑯꓒ][f𝐟𝑓𝒇𝒻𝓯𝔣𝕗𝖋𝖿𝗳𝘧𝙛𝚏ꬵꞙſẝք][\([❨❲〔﴾][\.𝅭․‎܁‎‎܂‎꘎‎𐩐‎‎٠‎۰ꓸ][\*⁎‎٭‎∗𐌟][\)]❩❳〕﴿]$
63```
64<figcaption>Interestingly, the confusable_homoglyphs package doesn't list any
65special characters that look similar to <code>$</code>.</figcaption>
66
67## The solution
68
69To support expansion for confusables while preserving arbitrary regex, we need
70to generate an **[abstract syntax
71tree](https://www.wikiwand.com/en/Abstract_syntax_tree) (AST)** for the regular
72expression and manipulate that somehow. For structured languages, an AST
73represents the layout of significant tokens based on a predefined grammar or
74syntax. For example, regex parsers have to correctly interpret `[` and `]` as
75special characters defining a set of characters within—unless they're escaped,
76like `\[` and `\]`, in which case they'll be taken as "literal" bracket
77characters.
78
79After generating an AST for the regular expression, we then must modify it to
80replace predefined character literals (e.g. `a`) with sets of their confusable
81homoglyphs (e.g. `[a⍺a𝐚𝑎𝒂𝒶𝓪𝔞𝕒𝖆𝖺𝗮𝘢𝙖𝚊ɑα𝛂𝛼𝜶𝝰𝞪а]`) before compiling the AST into a
82usable pattern-matching object. While this process appears similar to the
83string manipulation method described previously, parsing first for an AST
84provides a mutable structure that allows us to distinguish character literals
85from special regex tokens/grammar. Fortunately, Python’s `re` module already
86contains (and exposes!) the internal tools for parsing and compiling—we just
87need to modify the process.
88
89**<i>The AST is generated from the inputted filter string, and the usable
90pattern matching object is compiled from the AST. Since these steps are
91separate in Python's `re` pipeline and since the AST is a mutable object, the
92AST object can be separately modified before being compiled.</i>**
93
94# Manipulating Python’s regex AST
95
96This required a bit of reverse engineering on my part as I couldn't find any
97adequate documentation on the internals of Python’s `re` module. After some
98brief digging in the CPython source repository, I found **two submodules of
99`re` that handle regex string parsing and compilation: `re.sre_parse` and
100`re.sre_compile`**, respectively.
101
102## Reverse engineering
103
104The `re.compile()` function uses the `sre_parse` and `sre_compile` submodules
105by (effectively) doing the following to return a `re.Pattern` object:
106
107```python
108ast = re.sre_parse.parse( input_regex_string ) # -> re.sre_parse.SubPattern
109pattern_object = re.sre_compile.compile( ast ) # -> re.Pattern
110return pattern_object
111```
112
113Knowing this, we can experiment with the output of `sre_parse.parse()` function
114to determine `re`'s AST structure and figure out how we need to modify it.
115
116```python
117>>> import re
118
119>>> re.sre_parse.parse("asdf")
120[(LITERAL, 97), (LITERAL, 115), (LITERAL, 100), (LITERAL, 102)]
121
122>>> type(re.sre_parse.parse("asdf"))
123sre_parse.SubPattern
124
125>>> re.sre_parse.parse("[asdf]")
126[(IN, [(LITERAL, 97), (LITERAL, 115), (LITERAL, 100), (LITERAL, 102)])]
127
128>>> re.sre_parse.parse("(asdf)") # To see how `re` handles nested tokens
129[(SUBPATTERN, (1, 0, 0, [(LITERAL, 97), (LITERAL, 115), (LITERAL, 100), (LITERAL, 102)]))]
130```
131
132From this, we know the `sre_parse.parse()` function returns a `SubPattern`
133list-like object containing *tuples of tokens in the format `(TOKEN_NAME,
134TOKEN_VALUE)`.* Also, given the name `SubPattern`, it's likely this is nested
135for other types of regex grammar (maybe for lookarounds or matching groups?).
136
137## Modifying the AST (implementing the solution)
138
139For our case, we’re looking to replace tokens identified as `LITERAL`s:
140
141> `(LITERAL, ord)`, representing literal characters like `a`
142
143with character match sets—`IN` tokens—wrapping more `LITERAL` tokens:
144
145> `(IN, [ (LITERAL, ord) ... ])`, representing sets like `[a⍺a𝐚𝑎𝒂𝒶𝓪𝔞𝕒𝖆𝖺𝗮𝘢𝙖𝚊ɑα𝛂𝛼𝜶𝝰𝞪а]`
146
147Because abstract syntax trees are, well, trees, this needs to be done
148recursively to account for nested tokens, such as those within matching groups.
149What's important to note here is that regex does not allow nested character
150sets. So, if the input string uses sets natively, and we want to expand the
151characters in that set to cover confusable homoglyphs, we need to make sure we
152aren't creating a new set within the original set. You can see how I
153accomplished this below.
154
155This is the solution I came up with:
156
157```python
158from collections.abc import Iterable
159import re
160
161from confusable_homoglyphs import confusables
162
163def patched_regex(regex_string: str) -> re.Pattern:
164 """
165 Generate a regex pattern object after replacing literals with sets of
166 confusable homoglyphs
167 """
168
169 # Generate AST from base input string
170 ast_root = re.sre_parse.parse(regex_string)
171
172 # Generate list of confusable homoglyphs LITERALs based on input
173 # character, including input character
174 def generate_confusables(confusable: chr) -> list:
175 groups = confusables.is_confusable(confusable, greedy=True)
176
177 # Begin the homoglyph set with the original character
178 confusable_literals = [(re.sre_parse.LITERAL, ord(confusable))]
179
180 if not groups:
181 # Nothing to be confused with the original character
182 return confusable_literals
183
184 # Append confusable homoglyph tokens to list
185 for homoglyph in groups[0]["homoglyphs"]:
186 confusable_literals += [
187 (re.sre_parse.LITERAL, ord(char))
188 for char in homoglyph["c"]
189 ]
190 return confusable_literals
191
192 # Iterative function to patch AST
193 def modify(ast_local: Iterable):
194
195 # Step through this level of the AST
196 for index, item in enumerate(ast_local):
197
198 # Token represented by tuple (TOKEN_NAME, TOKEN_VALUE)
199 if isinstance(item, tuple):
200
201 token_name, token_value, *_ = item
202
203 if token_name == re.sre_parse.IN:
204 # IN type found, (IN, [ (LITERAL, ord) ... ])
205 # Because you can't nest sets in regex, these need to be
206 # handled separately, with confusables inserted in place
207
208 confusables = []
209 for literal in token_value:
210 confusables += generate_confusables(chr(literal[1]))
211 literal_list += confusables
212
213 elif token_name == re.sre_parse.LITERAL:
214 # LITERAL type found, (LITERAL, ord)
215 # *NOT* in a set, replace with set
216
217 # Generate list of confusable homoglyphs based on `ord`
218 confusables = generate_confusables(chr(token_value))
219
220 # Overwrite the original LITERAL token in the AST with a
221 # set of confusable homoglyphs
222 ast_local[index] = (re.sre_parse.IN, confusables)
223
224 else:
225 # Not LITERAL or IN; more possible tokens nested in this
226 # one. Convert to list, recurse, output back to tuple, then
227 # overwrite in AST
228 ast_local[index] = tuple(modify(list(item)))
229
230 elif isinstance(item, re.sre_parse.SubPattern):
231 # More possible tokens, recurse and overwrite in AST
232 ast_local[index] = modify(item)
233
234 return ast_local
235
236 # Patch generated native AST
237 ast_root = modify(ast_root)
238
239 # Compile AST to case-insensitive re.Pattern and return
240 return re.sre_compile.compile(ast_root, re.IGNORECASE)
241```
242
243Testing the generated regular expression pattern with the example input string
244from earlier, we can see it now works as expected.
245
246```python
247>>> pattern = patched_regex("^asdf(.*)$")
248
249>>> pattern.match("Not a match")
250None
251
252>>> pattern.match("Not a match even though asdf is in it because it doesn't follow the regex pattern")
253None
254
255>>> pattern.match("asdf")
256<re.Match object; span=(0, 4), match='asdf'>
257
258>>> pattern.match("asdf match")
259<re.Match object; span=(0, 10), match='asdf match'>
260
261>>> pattern.match("𝜶ꮪ𝚍𝖿") # String containing confusable homoglyphs
262<re.Match object; span=(0, 4), match='𝜶ꮪ𝚍𝖿'>
263
264>>> pattern2 = patch_regex("^asdf\\$(.*)$") # Also works when escaping special characters
265
266>>> pattern2.match("asdf match?")
267None
268
269>>> pattern2.match("asdf$ match?")
270<re.Match object; span=(0, 11), match='asdf$ match'>
271```
272<figcaption>This works likewise with re.Pattern.search() and other re.Pattern
273functions.</figcaption>
274
275I submitted a pull request to the bot which would make any filter string
276prefixed by `regex:` use a custom regex compilation process similar to the one
277above. This allows the Discord bot to employ arbitrary regular expressions as
278filter items, making use of supported regex features such as lookarounds, while
279still preserving expansion for non-ASCII confusable homoglyphs.
280