1 | --- |
2 | type: article |
3 | identifier: python-regex-homoglyphs |
4 | title: Patching Python's regex AST for confusable homoglyphs |
5 | description: Exploiting the re module's parsing and compiling methods to check for "confusable homoglyphs" and create a better automoderator. |
6 | datestring: 2023-03-18 |
7 | banner_image: /static/images/futaba-filter.jpg |
8 | links: |
9 | futaba: https://github.com/strinking/futaba |
10 | confusable-homoglyphs: https://pypi.org/project/confusable-homoglyphs/ |
11 | Python re module: https://docs.python.org/3/library/re.html |
12 | Scunthorpe problem: https://en.wikipedia.org/wiki/Scunthorpe_problem |
13 | Abstract Syntax Tree (AST): https://en.wikipedia.org/wiki/Abstract_syntax_tree |
14 | --- |
15 |
|
16 | A couple years ago I contributed to an open source Discord |
17 | [bot](https://github.com/strinking/futaba) designed by a public computer |
18 | programming community I’m part of. As with most popular Discord bots, it |
19 | incorporates its own filter to prohibit unwanted language. However, to ensure |
20 | coverage of messages like `as𝕕f` (notice non-ASCII `𝕕`) in case it’s told to |
21 | filter `asdf` (an ASCII-only string), the bot makes use of the |
22 | [confusable_homoglyphs](https://pypi.org/project/confusable-homoglyphs/) Python |
23 | package to automatically expand an inputted filter string to cover these |
24 | non-ASCII edge cases. |
25 |
|
26 | Originally, the bot used Python's native string manipulation to convert the |
27 | inputted filter string into a Python [regular |
28 | expression](https://docs.python.org/3/library/re.html) string that matches for |
29 | each original character as well as each character's confusable homoglyphs |
30 | (similar-looking non-ASCII characters). For example, the inputted filter string |
31 | `asdf` was converted by the bot into the following regular expression: |
32 |
|
33 | ```bash |
34 | [a⍺a𝐚𝑎𝒂𝒶𝓪𝔞𝕒𝖆𝖺𝗮𝘢𝙖𝚊ɑα𝛂𝛼𝜶𝝰𝞪а][ss𝐬𝑠𝒔𝓈𝓼𝔰𝕤𝖘𝗌𝘀𝘴𝙨𝚜ꜱƽѕꮪ𑣁𐑈][dⅾⅆ𝐝𝑑𝒅𝒹𝓭𝔡𝕕𝖉𝖽𝗱𝘥𝙙𝚍ԁᏧᑯꓒ][f𝐟𝑓𝒇𝒻𝓯𝔣𝕗𝖋𝖿𝗳𝘧𝙛𝚏ꬵꞙſẝք] |
35 | ``` |
36 |
|
37 | This regular expression uses character sets to match one instance each of the |
38 | original character or its confusable homoglyphs. In other words, it has the |
39 | ability to catch any recognizable rendition of a word using uncommon non-ASCII |
40 | characters. This was really nice, because it meant the bot could catch most |
41 | edge cases automatically. Unfortunately, however... |
42 |
|
43 | ## The problem |
44 |
|
45 | Using the method of regex generation described above precludes the use of |
46 | arbitrary regex patterns, which would help in looking at context to prevent |
47 | [false positives](https://en.wikipedia.org/wiki/Scunthorpe_problem). When |
48 | expanding filter strings, the bot could not differentiate between literal |
49 | characters and special regex tokens. So, instead of the valid regular |
50 | expression string `^asdf(.*)$` (which matches anything following "asdf", only |
51 | if "asdf" is at the beginning of the message) being converted into the |
52 | following, preserving special regex tokens as desired... |
53 |
|
54 | ```bash |
55 | ^[a⍺a𝐚𝑎𝒂𝒶𝓪𝔞𝕒𝖆𝖺𝗮𝘢𝙖𝚊ɑα𝛂𝛼𝜶𝝰𝞪а][ss𝐬𝑠𝒔𝓈𝓼𝔰𝕤𝖘𝗌𝘀𝘴𝙨𝚜ꜱƽѕꮪ𑣁𐑈][dⅾⅆ𝐝𝑑𝒅𝒹𝓭𝔡𝕕𝖉𝖽𝗱𝘥𝙙𝚍ԁᏧᑯꓒ][f𝐟𝑓𝒇𝒻𝓯𝔣𝕗𝖋𝖿𝗳𝘧𝙛𝚏ꬵꞙſẝք](.*)$ |
56 | ``` |
57 |
|
58 | ...it would convert into this, mangling the original intended regex string, |
59 | making it useless for arbitrary matching: |
60 |
|
61 | ```bash |
62 | [\^˄ˆ][a⍺a𝐚𝑎𝒂𝒶𝓪𝔞𝕒𝖆𝖺𝗮𝘢𝙖𝚊ɑα𝛂𝛼𝜶𝝰𝞪а][ss𝐬𝑠𝒔𝓈𝓼𝔰𝕤𝖘𝗌𝘀𝘴𝙨𝚜ꜱƽѕꮪ𑣁𐑈][dⅾⅆ𝐝𝑑𝒅𝒹𝓭𝔡𝕕𝖉𝖽𝗱𝘥𝙙𝚍ԁᏧᑯꓒ][f𝐟𝑓𝒇𝒻𝓯𝔣𝕗𝖋𝖿𝗳𝘧𝙛𝚏ꬵꞙſẝք][\([❨❲〔﴾][\.𝅭․܁܂꘎𐩐٠۰ꓸ][\*⁎٭∗𐌟][\)]❩❳〕﴿]$ |
63 | ``` |
64 | <figcaption>Interestingly, the confusable-homoglyphs package doesn't list any |
65 | special characters that look similar to <code>$</code>.</figcaption> |
66 |
|
67 | ## The solution |
68 |
|
69 | To support expansion for confusables while preserving arbitrary regex, we need |
70 | to generate an **[abstract syntax |
71 | tree](https://en.wikipedia.org/wiki/Abstract_syntax_tree) (AST)** for the regular |
72 | expression and manipulate that somehow. For structured languages, an AST |
73 | represents the layout of meaningful tokens based on a predefined grammar, or |
74 | syntax. For example, regex parsers have to correctly interpret `[` and `]` as |
75 | special characters defining a set of characters within—unless they're escaped, |
76 | like `\[` and `\]`, in which case they'll be taken as "literal" bracket |
77 | characters. |
78 |
|
79 | After generating an AST for the regular expression, we then must modify it to |
80 | replace predefined character literals (e.g. `a`) with sets of their confusable |
81 | homoglyphs (e.g. `[a⍺a𝐚𝑎𝒂𝒶𝓪𝔞𝕒𝖆𝖺𝗮𝘢𝙖𝚊ɑα𝛂𝛼𝜶𝝰𝞪а]`) before compiling the AST into a |
82 | usable pattern-matching object. While this process appears similar to the |
83 | string manipulation method described previously, parsing first for an AST |
84 | provides a mutable structure that allows us to distinguish character literals |
85 | from special regex tokens/grammar. Fortunately, Python’s `re` module already |
86 | contains (and exposes!) the internal tools for parsing and compiling—we just |
87 | need to modify the process. |
88 |
|
89 | **<i>The AST is generated from the inputted filter string, and the usable |
90 | pattern matching object is compiled from the AST. Since these steps are |
91 | separate in Python's `re` pipeline and since the AST is a mutable object, the |
92 | AST object can be separately modified before being compiled.</i>** |
93 |
|
94 | # Manipulating Python’s regex AST |
95 |
|
96 | This required a bit of reverse engineering on my part as I couldn't find any |
97 | adequate documentation on the internals of Python’s `re` module. After some |
98 | brief digging in the CPython source repository, I found **two submodules of |
99 | `re` that handle regex string parsing and compilation: `re.sre_parse` and |
100 | `re.sre_compile`**, respectively. |
101 |
|
102 | ## Reverse engineering |
103 |
|
104 | The `re.compile()` function uses the `sre_parse` and `sre_compile` submodules |
105 | by effectively doing the following to return a `re.Pattern` object: |
106 |
|
107 | ```python |
108 | ast = re.sre_parse.parse( input_regex_string ) # -> re.sre_parse.SubPattern |
109 | pattern_object = re.sre_compile.compile( ast ) # -> re.Pattern |
110 | return pattern_object |
111 | ``` |
112 |
|
113 | Knowing this, we can experiment with the output of `sre_parse.parse()` function |
114 | to determine `re`'s AST structure and figure out how we need to modify it. |
115 |
|
116 | ```python |
117 | >>> import re |
118 | |
119 | >>> re.sre_parse.parse("asdf") |
120 | [(LITERAL, 97), (LITERAL, 115), (LITERAL, 100), (LITERAL, 102)] |
121 | |
122 | >>> type(re.sre_parse.parse("asdf")) |
123 | sre_parse.SubPattern |
124 | |
125 | >>> re.sre_parse.parse("[asdf]") |
126 | [(IN, [(LITERAL, 97), (LITERAL, 115), (LITERAL, 100), (LITERAL, 102)])] |
127 | |
128 | >>> re.sre_parse.parse("[asd-f]") # Ranges? |
129 | [(IN, [(LITERAL, 97), (LITERAL, 115), (RANGE, (100, 102))])] |
130 | |
131 | >>> re.sre_parse.parse("(asdf)") # To see how `re` handles nested tokens |
132 | [(SUBPATTERN, (1, 0, 0, [(LITERAL, 97), (LITERAL, 115), (LITERAL, 100), (LITERAL, 102)]))] |
133 | ``` |
134 |
|
135 | From this, we know the `sre_parse.parse()` function returns a `SubPattern` |
136 | list-like object containing *tuples of tokens in the format `(TOKEN_NAME, |
137 | TOKEN_VALUE)`.* Also, given the name `SubPattern`, it's likely this is nested |
138 | for other types of regex grammar (maybe for lookarounds or matching groups?). |
139 |
|
140 | ## Modifying the AST (implementing the solution) |
141 |
|
142 | For our case, we’re looking to replace tokens identified as `LITERAL`s: |
143 |
|
144 | > `(LITERAL, ord)`, representing literal characters like `a` |
145 |
|
146 | with character match sets—`IN` tokens—wrapping more `LITERAL` tokens: |
147 |
|
148 | > `(IN, [ (LITERAL, ord) ... ])`, representing sets like `[a⍺a𝐚𝑎𝒂𝒶𝓪𝔞𝕒𝖆𝖺𝗮𝘢𝙖𝚊ɑα𝛂𝛼𝜶𝝰𝞪а]` |
149 |
|
150 | We also have this `RANGE` token that needs to be handled: |
151 |
|
152 | > `(RANGE, (LOWER_ORD, UPPER_ORD))`, representing set ranges like `a-z` in `[a-z]` |
153 |
|
154 | Because abstract syntax trees are, well, trees, this needs to be done |
155 | recursively to account for nested tokens, such as those within matching groups. |
156 | What's important to note here is that regex does not allow nested character |
157 | sets. So, if the input string uses sets natively, and we want to expand the |
158 | characters in that set to cover confusable homoglyphs, we need to make sure we |
159 | aren't creating a new set within the original set. You can see how I |
160 | accomplished this below, among other things like handling ranges. |
161 |
|
162 | This is the solution I came up with: |
163 |
|
164 | ```python |
165 | from collections.abc import Iterable |
166 | import re |
167 | |
168 | from confusable_homoglyphs import confusables |
169 | |
170 | def patched_regex(regex_string: str) -> re.Pattern: |
171 | """ |
172 | Generate a regex pattern object after replacing literals with sets of |
173 | confusable homoglyphs |
174 | """ |
175 | |
176 | # Generate AST from base input string |
177 | ast_root = re.sre_parse.parse(regex_string) |
178 | |
179 | # Generate list of confusable homoglyph LITERAL tokens based on input |
180 | # character, including token for input character |
181 | def generate_confusables(confusable: chr) -> list: |
182 | groups = confusables.is_confusable(confusable, greedy=True) |
183 | |
184 | # Begin the homoglyph set with the original character |
185 | confusable_literals = [(re.sre_parse.LITERAL, ord(confusable))] |
186 | |
187 | if not groups: |
188 | # Nothing to be confused with the original character |
189 | return confusable_literals |
190 | |
191 | # Append confusable homoglyph tokens to list |
192 | # Check confusable_homoglyphs documentation to verify this usage |
193 | for homoglyph in groups[0]["homoglyphs"]: |
194 | confusable_literals += [ |
195 | (re.sre_parse.LITERAL, ord(char)) |
196 | for char in homoglyph["c"] |
197 | ] |
198 | return confusable_literals |
199 | |
200 | # Iterative function to patch AST |
201 | def modify(ast_local: Iterable): |
202 | |
203 | # Step through this level of the AST |
204 | for index, item in enumerate(ast_local): |
205 | |
206 | # Token represented by tuple |
207 | # (TOKEN_NAME, TOKEN_VALUE) is likely |
208 | if isinstance(item, tuple): |
209 | |
210 | token_name, token_value, *_ = item |
211 | |
212 | if token_name == re.sre_parse.IN: |
213 | # IN type found, (IN, [ (LITERAL, ord) ... ]) |
214 | # Because you can't nest sets in regex, these need to be |
215 | # handled separately, with confusables inserted in place |
216 | |
217 | # Generate confusables for every literal in charset |
218 | confusables = [] |
219 | for set_token in token_value: |
220 | if set_token[0] == re.sre_parse.RANGE: |
221 | # If this is a RANGE, e.g. [a-z] |
222 | # ( RANGE, (LOWER_ORD, UPPER_ORD) ) |
223 | lower_bound = set_token[1][0] |
224 | upper_bound = set_token[1][1] |
225 | |
226 | # [lower_bound, upper_bound] loop, inclusive |
227 | # Appends confusables for all characters in range |
228 | for ord_value in range(lower_bound, upper_bound + 1): |
229 | confusables += generate_confusables(chr(ord_value)) |
230 | |
231 | else: |
232 | # Must be a LITERAL |
233 | # Append confusables for this character to list |
234 | confusables += generate_confusables(chr(set_token[1])) |
235 | |
236 | # Append confusables to character set |
237 | token_value += confusables |
238 | |
239 | elif token_name == re.sre_parse.LITERAL: |
240 | # LITERAL type found, (LITERAL, ord) |
241 | # *NOT* in a set, replace with set |
242 | |
243 | # Generate list of confusable homoglyphs based on `ord` |
244 | confusables = generate_confusables(chr(token_value)) |
245 | |
246 | # Overwrite the original LITERAL token in the AST with a |
247 | # set of confusable homoglyphs |
248 | ast_local[index] = (re.sre_parse.IN, confusables) |
249 | |
250 | else: |
251 | # Not LITERAL or IN; more possible tokens nested in this |
252 | # one. Convert to list, recurse, output back to tuple, then |
253 | # overwrite in AST |
254 | ast_local[index] = tuple(modify(list(item))) |
255 | |
256 | # If not a tuple/token |
257 | elif isinstance(item, re.sre_parse.SubPattern): |
258 | # More possible tokens, recurse and overwrite in AST |
259 | ast_local[index] = modify(item) |
260 | |
261 | return ast_local |
262 | |
263 | # Patch generated native AST |
264 | ast_root = modify(ast_root) |
265 | |
266 | # Compile AST to case-insensitive re.Pattern and return |
267 | return re.sre_compile.compile(ast_root, re.IGNORECASE) |
268 | ``` |
269 |
|
270 | Testing with some sample regular expressions, we can see it works as desired: |
271 |
|
272 | ```python |
273 | >>> patched_regex("asdf").match("asdf") |
274 | <re.Match object; span=(0, 4), match='asdf'> |
275 | |
276 | >>> patched_regex("asdf").match("as𝕕f") |
277 | <re.Match object; span=(0, 4), match='as𝕕f'> |
278 | |
279 | >>> patched_regex("as[d-f]").match("as𝕕") |
280 | <re.Match object; span=(0, 3), match='as𝕕'> |
281 | |
282 | >>> patched_regex("as[d-f]*").match("as𝕗𝕗𝕗𝕗") |
283 | <re.Match object; span=(0, 6), match='as𝕗𝕗𝕗𝕗'> |
284 | |
285 | >>> patched_regex("as[d-f]").match("as𝕗") |
286 | <re.Match object; span=(0, 3), match='as𝕗'> |
287 | |
288 | >>> patched_regex("[asd-f]").findall("asxℯ") |
289 | ['a', 's', 'ℯ'] |
290 | |
291 | >>> patched_regex("[asd-f]").match("qwerty") |
292 | None # Looks weird, but native compile yields this too. `findall` and `search` work for this. |
293 | ``` |
294 | <figcaption>This works likewise with re.Pattern.search() and other re.Pattern |
295 | functions, and includes all native regex features. The only real limitation |
296 | here is generating confusables, since <code>confusable-homoglyphs</code> doesn't seem to |
297 | account for accent characters, e.g. <code>é</code> -> <code>e</code>.</figcaption> |
298 |
|
299 | I submitted a pull request to the bot which would make any filter string |
300 | prefixed by `regex:` use a custom regex compilation process similar to the one |
301 | above. This allows the Discord bot to employ arbitrary regular expressions as |
302 | filter items, making use of supported regex features such as lookarounds, while |
303 | still preserving expansion for non-ASCII confusable homoglyphs. |
304 |
|