1 | --- |
2 | type: article |
3 | identifier: python-regex-homoglyphs |
4 | title: Patching Python's regex AST for confusable homoglyphs |
5 | description: Exploiting the re module's parsing and compiling methods to check for "confusable homoglyphs" and create a better automoderator. |
6 | datestring: 2023-03-17 |
7 | banner_image: /static/images/futaba-filter.jpg |
8 | links: |
9 | futaba: https://github.com/strinking/futaba |
10 | confusable_homoglyphs: https://pypi.org/project/confusable_homoglyphs/ |
11 | Python re module: https://docs.python.org/3/library/re.html |
12 | Scunthorpe problem: https://www.wikiwand.com/en/Scunthorpe_problem |
13 | Abstract Syntax Tree (AST): https://www.wikiwand.com/en/Abstract_syntax_tree |
14 | --- |
15 |
|
16 | A couple years ago I contributed to an open source Discord |
17 | [bot](https://github.com/strinking/futaba) designed by a public computer |
18 | programming community I’m part of. As with most popular Discord bots, it |
19 | incorporates its own filter to prohibit unwanted language. However, to ensure |
20 | coverage of messages like `as𝕕f` (notice non-ASCII `𝕕`) in case it’s told to |
21 | filter `asdf` (an ASCII-only string), the bot makes use of the |
22 | [confusable_homoglyphs](https://pypi.org/project/confusable_homoglyphs/) Python |
23 | package to automatically expand an inputted filter string to cover these |
24 | non-ASCII edge cases. |
25 |
|
26 | Originally, the bot used Python's native string manipulation to convert the |
27 | inputted filter string into a Python [regular |
28 | expression](https://docs.python.org/3/library/re.html) string that matches for |
29 | each original character as well as each character's confusable homoglyphs |
30 | (similar-looking non-ASCII characters). For example, the inputted filter string |
31 | `asdf` was converted by the bot into the following regular expression: |
32 |
|
33 | ```bash |
34 | [a⍺a𝐚𝑎𝒂𝒶𝓪𝔞𝕒𝖆𝖺𝗮𝘢𝙖𝚊ɑα𝛂𝛼𝜶𝝰𝞪а][ss𝐬𝑠𝒔𝓈𝓼𝔰𝕤𝖘𝗌𝘀𝘴𝙨𝚜ꜱƽѕꮪ𑣁𐑈][dⅾⅆ𝐝𝑑𝒅𝒹𝓭𝔡𝕕𝖉𝖽𝗱𝘥𝙙𝚍ԁᏧᑯꓒ][f𝐟𝑓𝒇𝒻𝓯𝔣𝕗𝖋𝖿𝗳𝘧𝙛𝚏ꬵꞙſẝք] |
35 | ``` |
36 |
|
37 | This regular expression uses character sets to match one instance each of the |
38 | original character or its confusable homoglyphs. In other words, it has the |
39 | ability to catch any recognizable rendition of a word using uncommon non-ASCII |
40 | characters. This was really nice, because it meant the bot could catch most |
41 | edge cases automatically. Unfortunately, however... |
42 |
|
43 | ## The problem |
44 |
|
45 | Using the method of regex generation described above precludes the use of |
46 | arbitrary regex patterns, which would help in looking at context to prevent |
47 | [false positives](https://www.wikiwand.com/en/Scunthorpe_problem). When |
48 | expanding filter strings, the bot could not differentiate between literal |
49 | characters and special regex tokens. So, instead of the valid regular |
50 | expression string `^asdf(.*)$` (which matches "asdf" followed by anything else, |
51 | only if "asdf" is at the beginning) being converted into the following, |
52 | preserving special regex tokens as desired... |
53 |
|
54 | ```bash |
55 | ^[a⍺a𝐚𝑎𝒂𝒶𝓪𝔞𝕒𝖆𝖺𝗮𝘢𝙖𝚊ɑα𝛂𝛼𝜶𝝰𝞪а][ss𝐬𝑠𝒔𝓈𝓼𝔰𝕤𝖘𝗌𝘀𝘴𝙨𝚜ꜱƽѕꮪ𑣁𐑈][dⅾⅆ𝐝𝑑𝒅𝒹𝓭𝔡𝕕𝖉𝖽𝗱𝘥𝙙𝚍ԁᏧᑯꓒ][f𝐟𝑓𝒇𝒻𝓯𝔣𝕗𝖋𝖿𝗳𝘧𝙛𝚏ꬵꞙſẝք](.*)$ |
56 | ``` |
57 |
|
58 | ...it would convert into this, mangling the original intended regex string, |
59 | making it useless for arbitrary matching: |
60 |
|
61 | ```bash |
62 | [\^˄ˆ][a⍺a𝐚𝑎𝒂𝒶𝓪𝔞𝕒𝖆𝖺𝗮𝘢𝙖𝚊ɑα𝛂𝛼𝜶𝝰𝞪а][ss𝐬𝑠𝒔𝓈𝓼𝔰𝕤𝖘𝗌𝘀𝘴𝙨𝚜ꜱƽѕꮪ𑣁𐑈][dⅾⅆ𝐝𝑑𝒅𝒹𝓭𝔡𝕕𝖉𝖽𝗱𝘥𝙙𝚍ԁᏧᑯꓒ][f𝐟𝑓𝒇𝒻𝓯𝔣𝕗𝖋𝖿𝗳𝘧𝙛𝚏ꬵꞙſẝք][\([❨❲〔﴾][\.𝅭․܁܂꘎𐩐٠۰ꓸ][\*⁎٭∗𐌟][\)]❩❳〕﴿]$ |
63 | ``` |
64 | <figcaption>Interestingly, the confusable_homoglyphs package doesn't list any |
65 | special characters that look similar to <code>$</code>.</figcaption> |
66 |
|
67 | ## The solution |
68 |
|
69 | To support expansion for confusables while preserving arbitrary regex, we need |
70 | to generate an **[abstract syntax |
71 | tree](https://www.wikiwand.com/en/Abstract_syntax_tree) (AST)** for the regular |
72 | expression and manipulate that somehow. For structured languages, an AST |
73 | represents the layout of significant tokens based on a predefined grammar or |
74 | syntax. For example, regex parsers have to correctly interpret `[` and `]` as |
75 | special characters defining a set of characters within—unless they're escaped, |
76 | like `\[` and `\]`, in which case they'll be taken as "literal" bracket |
77 | characters. |
78 |
|
79 | After generating an AST for the regular expression, we then must modify it to |
80 | replace predefined character literals (e.g. `a`) with sets of their confusable |
81 | homoglyphs (e.g. `[a⍺a𝐚𝑎𝒂𝒶𝓪𝔞𝕒𝖆𝖺𝗮𝘢𝙖𝚊ɑα𝛂𝛼𝜶𝝰𝞪а]`) before compiling the AST into a |
82 | usable pattern-matching object. While this process appears similar to the |
83 | string manipulation method described previously, parsing first for an AST |
84 | provides a mutable structure that allows us to distinguish character literals |
85 | from special regex tokens/grammar. Fortunately, Python’s `re` module already |
86 | contains (and exposes!) the internal tools for parsing and compiling—we just |
87 | need to modify the process. |
88 |
|
89 | **<i>The AST is generated from the inputted filter string, and the usable |
90 | pattern matching object is compiled from the AST. Since these steps are |
91 | separate in Python's `re` pipeline and since the AST is a mutable object, the |
92 | AST object can be separately modified before being compiled.</i>** |
93 |
|
94 | # Manipulating Python’s regex AST |
95 |
|
96 | This required a bit of reverse engineering on my part as I couldn't find any |
97 | adequate documentation on the internals of Python’s `re` module. After some |
98 | brief digging in the CPython source repository, I found **two submodules of |
99 | `re` that handle regex string parsing and compilation: `re.sre_parse` and |
100 | `re.sre_compile`**, respectively. |
101 |
|
102 | ## Reverse engineering |
103 |
|
104 | The `re.compile()` function uses the `sre_parse` and `sre_compile` submodules |
105 | by (effectively) doing the following to return a `re.Pattern` object: |
106 |
|
107 | ```python |
108 | ast = re.sre_parse.parse( input_regex_string ) # -> re.sre_parse.SubPattern |
109 | pattern_object = re.sre_compile.compile( ast ) # -> re.Pattern |
110 | return pattern_object |
111 | ``` |
112 |
|
113 | Knowing this, we can experiment with the output of `sre_parse.parse()` function |
114 | to determine `re`'s AST structure and figure out how we need to modify it. |
115 |
|
116 | ```python |
117 | >>> import re |
118 | |
119 | >>> re.sre_parse.parse("asdf") |
120 | [(LITERAL, 97), (LITERAL, 115), (LITERAL, 100), (LITERAL, 102)] |
121 | |
122 | >>> type(re.sre_parse.parse("asdf")) |
123 | sre_parse.SubPattern |
124 | |
125 | >>> re.sre_parse.parse("[asdf]") |
126 | [(IN, [(LITERAL, 97), (LITERAL, 115), (LITERAL, 100), (LITERAL, 102)])] |
127 | |
128 | >>> re.sre_parse.parse("(asdf)") # To see how `re` handles nested tokens |
129 | [(SUBPATTERN, (1, 0, 0, [(LITERAL, 97), (LITERAL, 115), (LITERAL, 100), (LITERAL, 102)]))] |
130 | ``` |
131 |
|
132 | From this, we know the `sre_parse.parse()` function returns a `SubPattern` |
133 | list-like object containing *tuples of tokens in the format `(TOKEN_NAME, |
134 | TOKEN_VALUE)`.* Also, given the name `SubPattern`, it's likely this is nested |
135 | for other types of regex grammar (maybe for lookarounds or matching groups?). |
136 |
|
137 | ## Modifying the AST (implementing the solution) |
138 |
|
139 | For our case, we’re looking to replace tokens identified as `LITERAL`s: |
140 |
|
141 | > `(LITERAL, ord)`, representing literal characters like `a` |
142 |
|
143 | with character match sets—`IN` tokens—wrapping more `LITERAL` tokens: |
144 |
|
145 | > `(IN, [ (LITERAL, ord) ... ])`, representing sets like `[a⍺a𝐚𝑎𝒂𝒶𝓪𝔞𝕒𝖆𝖺𝗮𝘢𝙖𝚊ɑα𝛂𝛼𝜶𝝰𝞪а]` |
146 |
|
147 | Because abstract syntax trees are, well, trees, this needs to be done |
148 | recursively to account for nested tokens, such as those within matching groups. |
149 | What's important to note here is that regex does not allow nested character |
150 | sets. So, if the input string uses sets natively, and we want to expand the |
151 | characters in that set to cover confusable homoglyphs, we need to make sure we |
152 | aren't creating a new set within the original set. You can see how I |
153 | accomplished this below. |
154 |
|
155 | This is the solution I came up with: |
156 |
|
157 | ```python |
158 | from collections.abc import Iterable |
159 | import re |
160 | |
161 | from confusable_homoglyphs import confusables |
162 | |
163 | def patched_regex(regex_string: str) -> re.Pattern: |
164 | """ |
165 | Generate a regex pattern object after replacing literals with sets of |
166 | confusable homoglyphs |
167 | """ |
168 | |
169 | # Generate AST from base input string |
170 | ast_root = re.sre_parse.parse(regex_string) |
171 | |
172 | # Generate list of confusable homoglyphs LITERALs based on input |
173 | # character, including input character |
174 | def generate_confusables(confusable: chr) -> list: |
175 | groups = confusables.is_confusable(confusable, greedy=True) |
176 | |
177 | # Begin the homoglyph set with the original character |
178 | confusable_literals = [(re.sre_parse.LITERAL, ord(confusable))] |
179 | |
180 | if not groups: |
181 | # Nothing to be confused with the original character |
182 | return confusable_literals |
183 | |
184 | # Append confusable homoglyph tokens to list |
185 | for homoglyph in groups[0]["homoglyphs"]: |
186 | confusable_literals += [ |
187 | (re.sre_parse.LITERAL, ord(char)) |
188 | for char in homoglyph["c"] |
189 | ] |
190 | return confusable_literals |
191 | |
192 | # Iterative function to patch AST |
193 | def modify(ast_local: Iterable): |
194 | |
195 | # Step through this level of the AST |
196 | for index, item in enumerate(ast_local): |
197 | |
198 | # Token represented by tuple (TOKEN_NAME, TOKEN_VALUE) |
199 | if isinstance(item, tuple): |
200 | |
201 | token_name, token_value, *_ = item |
202 | |
203 | if token_name == re.sre_parse.IN: |
204 | # IN type found, (IN, [ (LITERAL, ord) ... ]) |
205 | # Because you can't nest sets in regex, these need to be |
206 | # handled separately, with confusables inserted in place |
207 | |
208 | confusables = [] |
209 | for literal in token_value: |
210 | confusables += generate_confusables(chr(literal[1])) |
211 | literal_list += confusables |
212 | |
213 | elif token_name == re.sre_parse.LITERAL: |
214 | # LITERAL type found, (LITERAL, ord) |
215 | # *NOT* in a set, replace with set |
216 | |
217 | # Generate list of confusable homoglyphs based on `ord` |
218 | confusables = generate_confusables(chr(token_value)) |
219 | |
220 | # Overwrite the original LITERAL token in the AST with a |
221 | # set of confusable homoglyphs |
222 | ast_local[index] = (re.sre_parse.IN, confusables) |
223 | |
224 | else: |
225 | # Not LITERAL or IN; more possible tokens nested in this |
226 | # one. Convert to list, recurse, output back to tuple, then |
227 | # overwrite in AST |
228 | ast_local[index] = tuple(modify(list(item))) |
229 | |
230 | elif isinstance(item, re.sre_parse.SubPattern): |
231 | # More possible tokens, recurse and overwrite in AST |
232 | ast_local[index] = modify(item) |
233 | |
234 | return ast_local |
235 | |
236 | # Patch generated native AST |
237 | ast_root = modify(ast_root) |
238 | |
239 | # Compile AST to case-insensitive re.Pattern and return |
240 | return re.sre_compile.compile(ast_root, re.IGNORECASE) |
241 | ``` |
242 |
|
243 | Testing the generated regular expression pattern with the example input string |
244 | from earlier, we can see it now works as expected. |
245 |
|
246 | ```python |
247 | >>> pattern = patched_regex("^asdf(.*)$") |
248 | |
249 | >>> pattern.match("Not a match") |
250 | None |
251 | |
252 | >>> pattern.match("Not a match even though asdf is in it because it doesn't follow the regex pattern") |
253 | None |
254 | |
255 | >>> pattern.match("asdf") |
256 | <re.Match object; span=(0, 4), match='asdf'> |
257 | |
258 | >>> pattern.match("asdf match") |
259 | <re.Match object; span=(0, 10), match='asdf match'> |
260 | |
261 | >>> pattern.match("𝜶ꮪ𝚍𝖿") # String containing confusable homoglyphs |
262 | <re.Match object; span=(0, 4), match='𝜶ꮪ𝚍𝖿'> |
263 | |
264 | >>> pattern2 = patch_regex("^asdf\\$(.*)$") # Also works when escaping special characters |
265 | |
266 | >>> pattern2.match("asdf match?") |
267 | None |
268 | |
269 | >>> pattern2.match("asdf$ match?") |
270 | <re.Match object; span=(0, 11), match='asdf$ match'> |
271 | ``` |
272 | <figcaption>This works likewise with re.Pattern.search() and other re.Pattern |
273 | functions.</figcaption> |
274 |
|
275 | I submitted a pull request to the bot which would make any filter string |
276 | prefixed by `regex:` use a custom regex compilation process similar to the one |
277 | above. This allows the Discord bot to employ arbitrary regular expressions as |
278 | filter items, making use of supported regex features such as lookarounds, while |
279 | still preserving expansion for non-ASCII confusable homoglyphs. |
280 |
|