1 | --- |
2 | type: none |
3 | identifier: python-regex-homoglyphs |
4 | title: Patching Python's regex AST for confusable homoglyphs |
5 | description: Exploiting the re module's parsing and compiling methods to create a better automoderator. |
6 | datestring: 2022-11-06 |
7 | banner_image: /static/images/futaba-filter.jpg |
8 | links: |
9 | futaba: https://github.com/strinking/futaba |
10 | confusable_homoglyphs: https://pypi.org/project/confusable_homoglyphs/ |
11 | Python re module: https://docs.python.org/3/library/re.html |
12 | Scunthorpe problem: https://www.wikiwand.com/en/Scunthorpe_problem |
13 | Abstract Syntax Trees: https://www.wikiwand.com/en/Abstract_syntax_tree |
14 | --- |
15 |
|
16 | A couple years ago I contributed to an open source Discord |
17 | [bot](https://github.com/strinking/futaba) designed by a public computer |
18 | programming community I’m part of. As with most popular Discord bots, it |
19 | incorporates its own filter to prohibit unwanted language. However, to ensure |
20 | coverage of messages like `as𝕕f` (notice non-ASCII `𝕕`) in case it’s told to |
21 | filter `asdf` (an ASCII-only string), the bot makes use of the |
22 | [confusable_homoglyphs](https://pypi.org/project/confusable_homoglyphs/) Python |
23 | package to automatically expand an inputted filter string to support these |
24 | non-ASCII edge cases. |
25 |
|
26 | Originally, the bot used base Python string manipulation to convert the |
27 | inputted filter string into a Python [regular |
28 | expression](https://docs.python.org/3/library/re.html) string that matches for |
29 | each original character as well as each character's confusable homoglyphs |
30 | (similar-looking non-ASCII characters). For example, the inputted filter string |
31 | `asdf` was converted by the bot into the following regular expression: |
32 |
|
33 | ```bash |
34 | [a⍺a𝐚𝑎𝒂𝒶𝓪𝔞𝕒𝖆𝖺𝗮𝘢𝙖𝚊ɑα𝛂𝛼𝜶𝝰𝞪а][ss𝐬𝑠𝒔𝓈𝓼𝔰𝕤𝖘𝗌𝘀𝘴𝙨𝚜ꜱƽѕꮪ𑣁𐑈][dⅾⅆ𝐝𝑑𝒅𝒹𝓭𝔡𝕕𝖉𝖽𝗱𝘥𝙙𝚍ԁᏧᑯꓒ][f𝐟𝑓𝒇𝒻𝓯𝔣𝕗𝖋𝖿𝗳𝘧𝙛𝚏ꬵꞙſẝք] |
35 | ``` |
36 |
|
37 | This regular expression uses character sets to match one instance each of the |
38 | original character or its confusable homoglyphs. In other words, it has the |
39 | ability to catch any recognizable rendition of a word using uncommon non-ASCII |
40 | characters. This was really nice, because it meant the bot could catch most |
41 | edge cases automatically. Unfortunately, however... |
42 |
|
43 | ## The problem |
44 |
|
45 | Using the method of regex generation described above precludes the use of |
46 | arbitrary regex patterns, which would help in looking at context to prevent |
47 | [false positives](https://www.wikiwand.com/en/Scunthorpe_problem). When |
48 | expanding filter strings, the bot could not differentiate between literal |
49 | characters and special regex tokens. So, instead of the valid regular |
50 | expression string `^asdf(.*)$` converting into this, preserving special regex |
51 | tokens as desired: |
52 |
|
53 | ```bash |
54 | ^[a⍺a𝐚𝑎𝒂𝒶𝓪𝔞𝕒𝖆𝖺𝗮𝘢𝙖𝚊ɑα𝛂𝛼𝜶𝝰𝞪а][ss𝐬𝑠𝒔𝓈𝓼𝔰𝕤𝖘𝗌𝘀𝘴𝙨𝚜ꜱƽѕꮪ𑣁𐑈][dⅾⅆ𝐝𝑑𝒅𝒹𝓭𝔡𝕕𝖉𝖽𝗱𝘥𝙙𝚍ԁᏧᑯꓒ][f𝐟𝑓𝒇𝒻𝓯𝔣𝕗𝖋𝖿𝗳𝘧𝙛𝚏ꬵꞙſẝք](.*)$ |
55 | ``` |
56 |
|
57 | It would convert into this, mangling the original intended regex string, making |
58 | it useless for arbitrary matching: |
59 |
|
60 | ```bash |
61 | [\^˄ˆ][a⍺a𝐚𝑎𝒂𝒶𝓪𝔞𝕒𝖆𝖺𝗮𝘢𝙖𝚊ɑα𝛂𝛼𝜶𝝰𝞪а][ss𝐬𝑠𝒔𝓈𝓼𝔰𝕤𝖘𝗌𝘀𝘴𝙨𝚜ꜱƽѕꮪ𑣁𐑈][dⅾⅆ𝐝𝑑𝒅𝒹𝓭𝔡𝕕𝖉𝖽𝗱𝘥𝙙𝚍ԁᏧᑯꓒ][f𝐟𝑓𝒇𝒻𝓯𝔣𝕗𝖋𝖿𝗳𝘧𝙛𝚏ꬵꞙſẝք][\([❨❲〔﴾][\.𝅭․܁܂꘎𐩐٠۰ꓸ][\*⁎٭∗𐌟][\)]❩❳〕﴿]$ |
62 | ``` |
63 | <figcaption>Interestingly, confusable_homoglyphs doesn't list any special |
64 | characters that look similar to $.</figcaption> |
65 |
|
66 | ## The solution |
67 |
|
68 | To support expansion for confusables while preserving arbitrary regex, an |
69 | **[abstract syntax tree](https://www.wikiwand.com/en/Abstract_syntax_tree) |
70 | (AST)** for the regular expression would need to be generated from the filter |
71 | string, then modified to replace predefined character literals (e.g. `a`) with |
72 | sets of their confusable homoglyphs (e.g. `[a⍺a𝐚𝑎𝒂𝒶𝓪𝔞𝕒𝖆𝖺𝗮𝘢𝙖𝚊ɑα𝛂𝛼𝜶𝝰𝞪а]`) before |
73 | being compiled into a usable pattern object. While this process appears similar |
74 | to the string manipulation method described previously, parsing first for an |
75 | AST provides a mutable structure that allows us to distinguish character |
76 | literals from special regex lexemes/grammar. Fortunately, Python’s `re` module |
77 | already contains (and exposes!!) the internal tools for parsing and |
78 | compiling—we just need to modify the process. |
79 |
|
80 | *The AST is generated from the inputted filter string, and the usable pattern |
81 | matching object is compiled from the AST. Since these steps are separate in |
82 | Python's `re` pipeline and since the AST is a mutable object, the AST object |
83 | can be separately modified before being compiled.* |
84 |
|
85 | # Manipulating Python’s regex AST |
86 |
|
87 | This required a bit of reverse engineering on my part as I couldn't find any |
88 | adequate documentation on the internals of Python’s `re` module. After some |
89 | brief digging in the CPython source repository, I found **two submodules of |
90 | `re` that handle regex string parsing and compilation: `re.sre_parse` and |
91 | `re.sre_compile`**, respectively. |
92 |
|
93 | ## Reverse engineering |
94 |
|
95 | The `re.compile()` function uses the `sre_parse` and `sre_compile` submodules |
96 | by (effectively) doing the following to return a `re.Pattern` object: |
97 |
|
98 | ```python |
99 | ast = re.sre_parse.parse( input_regex_string ) # -> re.sre_parse.SubPattern |
100 | pattern_object = re.sre_compile.compile( ast ) # -> re.Pattern |
101 | return pattern_object |
102 | ``` |
103 |
|
104 | Knowing this, we can experiment with the output of `sre_parse.parse()` function |
105 | to determine `re`'s AST structure and figure out how we need to modify it. |
106 |
|
107 | ```python |
108 | >>> import re |
109 | |
110 | >>> re.sre_parse.parse("asdf") |
111 | [(LITERAL, 97), (LITERAL, 115), (LITERAL, 100), (LITERAL, 102)] |
112 | |
113 | >>> type(re.sre_parse.parse("asdf")) |
114 | sre_parse.SubPattern |
115 | |
116 | >>> re.sre_parse.parse("[asdf]") |
117 | [(IN, [(LITERAL, 97), (LITERAL, 115), (LITERAL, 100), (LITERAL, 102)])] |
118 | |
119 | >>> re.sre_parse.parse("(asdf)") # To see how `re` handles nested lexemes |
120 | [(SUBPATTERN, (1, 0, 0, [(LITERAL, 97), (LITERAL, 115), (LITERAL, 100), (LITERAL, 102)]))] |
121 | ``` |
122 |
|
123 | From this, we know the `sre_parse.parse()` function returns a `SubPattern` |
124 | list-like object containing *tuples of lexemes in the format `(LEXEME_NAME, |
125 | value)`.* |
126 |
|
127 | ## Modifying the AST (implementing the solution) |
128 |
|
129 | For our case, we’re looking to replace lexemes identified as `LITERAL`s: |
130 |
|
131 | > `(LITERAL, ord)`, representing literal characters like `a` |
132 |
|
133 | with character match sets—`IN` lexemes—wrapping more `LITERAL` lexemes: |
134 |
|
135 | > `(IN, [ (LITERAL, ord) ... ])`, representing sets like `[a⍺a𝐚𝑎𝒂𝒶𝓪𝔞𝕒𝖆𝖺𝗮𝘢𝙖𝚊ɑα𝛂𝛼𝜶𝝰𝞪а]` |
136 |
|
137 | Because abstract syntax trees are, well, trees, this needs to be done |
138 | recursively to account for nested lexemes, such as those within matching |
139 | groups. |
140 |
|
141 | This is the solution I came up with: |
142 |
|
143 | ```python |
144 | from collections.abc import Iterable |
145 | import re |
146 | |
147 | from confusable_homoglyphs import confusables |
148 | |
149 | def patched_regex(regex_string: str) -> re.Pattern: |
150 | """ |
151 | Generate a regex pattern object after replacing literals with sets of |
152 | confusable homoglyphs |
153 | """ |
154 | |
155 | # Generate AST from base input string |
156 | ast_root = re.sre_parse.parse(regex_string) |
157 | |
158 | # Iterative function to patch AST |
159 | def modify(ast_local: Iterable): |
160 | |
161 | # Step through this level of the AST |
162 | for index, item in enumerate(ast_local): |
163 | |
164 | # Lexeme represented by item = (LEXEME_TYPE, VALUE) |
165 | if isinstance(item, tuple): |
166 | |
167 | if item[0] == re.sre_parse.LITERAL: |
168 | # LITERAL type found item = (sre_parse.LITERAL, ord) |
169 | |
170 | # Generate list of confusable homoglyphs... |
171 | groups = confusables.is_confusable(chr(item[1]), |
172 | greedy=True) |
173 | if not groups: |
174 | # Homoglyph is not confusable, doesn't need to be |
175 | # patched, move to next item in subpattern |
176 | continue |
177 | |
178 | confusable_literals = [item] # Begin the homoglyph set with |
179 | # the original character lexeme |
180 | for homoglyph in groups[0]["homoglyphs"]: |
181 | confusable_literals += [ # Append confusable homoglyph |
182 | # lexemes to set |
183 | (re.sre_parse.LITERAL, ord(char)) |
184 | for char in homoglyph["c"] |
185 | ] |
186 | |
187 | # Overwrite the original LITERAL lexeme in the AST with the |
188 | # new set of confusable homoglyphs |
189 | ast_local[index] = (re.sre_parse.IN, confusable_literals) |
190 | else: |
191 | # Not LITERAL; more possible lexemes nested in this one |
192 | # Convert to list, recurse, output back to tuple, then |
193 | # overwrite in AST |
194 | ast_local[index] = tuple(modify(list(item))) |
195 | |
196 | elif isinstance(item, re.sre_parse.SubPattern): |
197 | # More possible lexemes, recurse and overwrite in AST |
198 | ast_local[index] = modify(item) |
199 | |
200 | return ast_local |
201 | |
202 | # Patch generated base AST |
203 | ast_root = modify(ast_root) |
204 | |
205 | # Compile AST to case-insensitive re.Pattern and return |
206 | return re.sre_compile.compile(ast_root, re.IGNORECASE) |
207 | ``` |
208 |
|
209 | Testing the generated regular expression pattern with the example input string |
210 | from earlier, we can see it now works as expected. |
211 |
|
212 | ```python |
213 | >>> pattern = patched_regex("^asdf(.*)$") |
214 | |
215 | >>> pattern.match("Not a match") |
216 | None |
217 | |
218 | >>> pattern.match("Not a match even though asdf is in it because it doesn't follow the regex pattern") |
219 | None |
220 | |
221 | >>> pattern.match("asdf") |
222 | <re.Match object; span=(0, 4), match='asdf'> |
223 | |
224 | >>> pattern.match("asdf match") |
225 | <re.Match object; span=(0, 10), match='asdf match'> |
226 | |
227 | >>> pattern.match("𝜶ꮪ𝚍𝖿") # String containing confusable homoglyphs |
228 | <re.Match object; span=(0, 4), match='𝜶ꮪ𝚍𝖿'> |
229 | |
230 | >>> pattern2 = patch_regex("^asdf\\$(.*)$") # Also works when escaping special characters |
231 | |
232 | >>> pattern2.match("asdf match?") |
233 | None |
234 | |
235 | >>> pattern2.match("asdf$ match?") |
236 | <re.Match object; span=(0, 11), match='asdf$ match'> |
237 | ``` |
238 | <figcaption>This works likewise with re.Pattern.search() and other re.Pattern |
239 | functions.</figcaption> |
240 |
|
241 | I submitted a [pull request](https://github.com/strinking/futaba/pull/368) to |
242 | the bot which would make any filter string prefixed by `regex:` use a custom |
243 | regex compilation process similar to the one above. This allows the Discord bot |
244 | to employ arbitrary regular expressions as filter items, making use of |
245 | supported regex features such as lookarounds, while still preserving expansion |
246 | for non-ASCII confusable homoglyphs. |
247 |
|