1 ---
2 type: article
3 identifier: python-regex-homoglyphs
4 title: Patching Python's regex AST for confusable homoglyphs
5 description: Exploiting the re module's parsing and compiling methods to check for "confusable homoglyphs" and create a better automoderator.
6 datestring: 2023-03-18
7 banner_image: /static/images/futaba-filter.jpg
8 links:
9     futaba: https://github.com/strinking/futaba
10     confusable-homoglyphs: https://pypi.org/project/confusable-homoglyphs/
11     Python re module: https://docs.python.org/3/library/re.html
12     Scunthorpe problem: https://en.wikipedia.org/wiki/Scunthorpe_problem
13     Abstract Syntax Tree (AST): https://en.wikipedia.org/wiki/Abstract_syntax_tree
14 ---
15
16 A couple years ago I contributed to an open source Discord
17 [bot](https://github.com/strinking/futaba) designed by a public computer
18 programming community I’m part of. As with most popular Discord bots, it
19 incorporates its own filter to prohibit unwanted language.  However, to ensure
20 coverage of messages like `as𝕕f` (notice non-ASCII `𝕕`) in case it’s told to
21 filter `asdf` (an ASCII-only string), the bot makes use of the
22 [confusable_homoglyphs](https://pypi.org/project/confusable-homoglyphs/) Python
23 package to automatically expand an inputted filter string to cover these
24 non-ASCII edge cases.
25
26 Originally, the bot used Python's native string manipulation to convert the
27 inputted filter string into a Python [regular
28 expression](https://docs.python.org/3/library/re.html) string that matches for
29 each original character as well as each character's confusable homoglyphs
30 (similar-looking non-ASCII characters). For example, the inputted filter string
31 `asdf` was converted by the bot into the following regular expression:
32
33 ```bash
34 [a⍺ａ𝐚𝑎𝒂𝒶𝓪𝔞𝕒𝖆𝖺𝗮𝘢𝙖𝚊ɑα𝛂𝛼𝜶𝝰𝞪а][sｓ𝐬𝑠𝒔𝓈𝓼𝔰𝕤𝖘𝗌𝘀𝘴𝙨𝚜ꜱƽѕꮪ𑣁𐑈][dⅾⅆ𝐝𝑑𝒅𝒹𝓭𝔡𝕕𝖉𝖽𝗱𝘥𝙙𝚍ԁᏧᑯꓒ][f𝐟𝑓𝒇𝒻𝓯𝔣𝕗𝖋𝖿𝗳𝘧𝙛𝚏ꬵꞙſẝք]
35 ```
36
37 This regular expression uses character sets to match one instance each of the
38 original character or its confusable homoglyphs. In other words, it has the
39 ability to catch any recognizable rendition of a word using uncommon non-ASCII
40 characters. This was really nice, because it meant the bot could catch most
41 edge cases automatically. Unfortunately, however...
42
43 ## The problem
44
45 Using the method of regex generation described above precludes the use of
46 arbitrary regex patterns, which would help in looking at context to prevent
47 [false positives](https://en.wikipedia.org/wiki/Scunthorpe_problem). When
48 expanding filter strings, the bot could not differentiate between literal
49 characters and special regex tokens. So, instead of the valid regular
50 expression string `^asdf(.*)$` (which matches anything following "asdf", only
51 if "asdf" is at the beginning of the message) being converted into the
52 following, preserving special regex tokens as desired...
53
54 ```bash
55 ^[a⍺ａ𝐚𝑎𝒂𝒶𝓪𝔞𝕒𝖆𝖺𝗮𝘢𝙖𝚊ɑα𝛂𝛼𝜶𝝰𝞪а][sｓ𝐬𝑠𝒔𝓈𝓼𝔰𝕤𝖘𝗌𝘀𝘴𝙨𝚜ꜱƽѕꮪ𑣁𐑈][dⅾⅆ𝐝𝑑𝒅𝒹𝓭𝔡𝕕𝖉𝖽𝗱𝘥𝙙𝚍ԁᏧᑯꓒ][f𝐟𝑓𝒇𝒻𝓯𝔣𝕗𝖋𝖿𝗳𝘧𝙛𝚏ꬵꞙſẝք](.*)$
56 ```
57
58 ...it would convert into this, mangling the original intended regex string,
59 making it useless for arbitrary matching:
60
61 ```bash
62 [\^˄ˆ][a⍺ａ𝐚𝑎𝒂𝒶𝓪𝔞𝕒𝖆𝖺𝗮𝘢𝙖𝚊ɑα𝛂𝛼𝜶𝝰𝞪а][sｓ𝐬𝑠𝒔𝓈𝓼𝔰𝕤𝖘𝗌𝘀𝘴𝙨𝚜ꜱƽѕꮪ𑣁𐑈][dⅾⅆ𝐝𝑑𝒅𝒹𝓭𝔡𝕕𝖉𝖽𝗱𝘥𝙙𝚍ԁᏧᑯꓒ][f𝐟𝑓𝒇𝒻𝓯𝔣𝕗𝖋𝖿𝗳𝘧𝙛𝚏ꬵꞙſẝք][\(［❨❲〔﴾][\.𝅭․‎܁‎‎܂‎꘎‎𐩐‎‎٠‎۰ꓸ][\*⁎‎٭‎∗𐌟][\)］❩❳〕﴿]$
63 ```
64 <figcaption>Interestingly, the confusable-homoglyphs package doesn't list any
65 special characters that look similar to <code>$</code>.</figcaption>
66
67 ## The solution
68
69 To support expansion for confusables while preserving arbitrary regex, we need
70 to generate an **[abstract syntax
71 tree](https://en.wikipedia.org/wiki/Abstract_syntax_tree) (AST)** for the regular
72 expression and manipulate that somehow. For structured languages, an AST
73 represents the layout of meaningful tokens based on a predefined grammar, or
74 syntax.  For example, regex parsers have to correctly interpret `[` and `]` as
75 special characters defining a set of characters within—unless they're escaped,
76 like `\[` and `\]`, in which case they'll be taken as "literal" bracket
77 characters.
78
79 After generating an AST for the regular expression, we then must modify it to
80 replace predefined character literals (e.g. `a`) with sets of their confusable
81 homoglyphs (e.g. `[a⍺ａ𝐚𝑎𝒂𝒶𝓪𝔞𝕒𝖆𝖺𝗮𝘢𝙖𝚊ɑα𝛂𝛼𝜶𝝰𝞪а]`) before compiling the AST into a
82 usable pattern-matching object. While this process appears similar to the
83 string manipulation method described previously, parsing first for an AST
84 provides a mutable structure that allows us to distinguish character literals
85 from special regex tokens/grammar. Fortunately, Python’s `re` module already
86 contains (and exposes!) the internal tools for parsing and compiling—we just
87 need to modify the process.
88
89 **<i>The AST is generated from the inputted filter string, and the usable
90 pattern matching object is compiled from the AST. Since these steps are
91 separate in Python's `re` pipeline and since the AST is a mutable object, the
92 AST object can be separately modified before being compiled.</i>**
93
94 # Manipulating Python’s regex AST
95
96 This required a bit of reverse engineering on my part as I couldn't find any
97 adequate documentation on the internals of Python’s `re` module. After some
98 brief digging in the CPython source repository, I found **two submodules of
99 `re` that handle regex string parsing and compilation: `re.sre_parse` and
100 `re.sre_compile`**, respectively.
101
102 ## Reverse engineering
103
104 The `re.compile()` function uses the `sre_parse` and `sre_compile` submodules
105 by effectively doing the following to return a `re.Pattern` object:
106
107 ```python
108 ast = re.sre_parse.parse( input_regex_string ) # -> re.sre_parse.SubPattern
109 pattern_object = re.sre_compile.compile( ast ) # -> re.Pattern
110 return pattern_object
111 ```
112
113 Knowing this, we can experiment with the output of `sre_parse.parse()` function
114 to determine `re`'s AST structure and figure out how we need to modify it.
115
116 ```python
117 >>> import re
118
119 >>> re.sre_parse.parse("asdf")
120 [(LITERAL, 97), (LITERAL, 115), (LITERAL, 100), (LITERAL, 102)]
121
122 >>> type(re.sre_parse.parse("asdf"))
123 sre_parse.SubPattern
124
125 >>> re.sre_parse.parse("[asdf]")
126 [(IN, [(LITERAL, 97), (LITERAL, 115), (LITERAL, 100), (LITERAL, 102)])]
127
128 >>> re.sre_parse.parse("[asd-f]") # Ranges?
129 [(IN, [(LITERAL, 97), (LITERAL, 115), (RANGE, (100, 102))])]
130
131 >>> re.sre_parse.parse("(asdf)") # To see how `re` handles nested tokens
132 [(SUBPATTERN, (1, 0, 0, [(LITERAL, 97), (LITERAL, 115), (LITERAL, 100), (LITERAL, 102)]))]
133 ```
134
135 From this, we know the `sre_parse.parse()` function returns a `SubPattern`
136 list-like object containing *tuples of tokens in the format `(TOKEN_NAME,
137 TOKEN_VALUE)`.* Also, given the name `SubPattern`, it's likely this is nested
138 for other types of regex grammar (maybe for lookarounds or matching groups?).
139
140 ## Modifying the AST (implementing the solution)
141
142 For our case, we’re looking to replace tokens identified as `LITERAL`s:
143
144 > `(LITERAL, ord)`, representing literal characters like `a`
145
146 with character match sets—`IN` tokens—wrapping more `LITERAL` tokens:
147
148 > `(IN, [ (LITERAL, ord) ... ])`, representing sets like `[a⍺ａ𝐚𝑎𝒂𝒶𝓪𝔞𝕒𝖆𝖺𝗮𝘢𝙖𝚊ɑα𝛂𝛼𝜶𝝰𝞪а]`
149
150 We also have this `RANGE` token that needs to be handled:
151
152 > `(RANGE, (LOWER_ORD, UPPER_ORD))`, representing set ranges like `a-z` in `[a-z]`
153
154 Because abstract syntax trees are, well, trees, this needs to be done
155 recursively to account for nested tokens, such as those within matching groups.
156 What's important to note here is that regex does not allow nested character
157 sets. So, if the input string uses sets natively, and we want to expand the
158 characters in that set to cover confusable homoglyphs, we need to make sure we
159 aren't creating a new set within the original set. You can see how I
160 accomplished this below, among other things like handling ranges.
161
162 This is the solution I came up with:
163
164 ```python
165 from collections.abc import Iterable
166 import re
167
168 from confusable_homoglyphs import confusables
169
170 def patched_regex(regex_string: str) -> re.Pattern:
171     """
172     Generate a regex pattern object after replacing literals with sets of
173     confusable homoglyphs
174     """
175
176     # Generate AST from base input string
177     ast_root = re.sre_parse.parse(regex_string)
178
179     # Generate list of confusable homoglyph LITERAL tokens based on input
180     # character, including token for input character
181     def generate_confusables(confusable: chr) -> list:
182         groups = confusables.is_confusable(confusable, greedy=True)
183
184         # Begin the homoglyph set with the original character
185         confusable_literals = [(re.sre_parse.LITERAL, ord(confusable))]
186
187         if not groups:
188             # Nothing to be confused with the original character
189             return confusable_literals
190
191         # Append confusable homoglyph tokens to list
192         # Check confusable_homoglyphs documentation to verify this usage
193         for homoglyph in groups[0]["homoglyphs"]:
194             confusable_literals += [
195                 (re.sre_parse.LITERAL, ord(char))
196                 for char in homoglyph["c"]
197             ]
198         return confusable_literals
199
200     # Iterative function to patch AST
201     def modify(ast_local: Iterable):
202
203         # Step through this level of the AST
204         for index, item in enumerate(ast_local):
205
206             # Token represented by tuple
207             # (TOKEN_NAME, TOKEN_VALUE) is likely
208             if isinstance(item, tuple):
209
210                 token_name, token_value, *_ = item
211
212                 if token_name == re.sre_parse.IN:
213                     # IN type found, (IN, [ (LITERAL, ord) ... ])
214                     # Because you can't nest sets in regex, these need to be
215                     # handled separately, with confusables inserted in place
216
217                     # Generate confusables for every literal in charset
218                     confusables = []
219                     for set_token in token_value:
220                         if set_token[0] == re.sre_parse.RANGE:
221                             # If this is a RANGE, e.g. [a-z]
222                             # ( RANGE, (LOWER_ORD, UPPER_ORD) )
223                             lower_bound = set_token[1][0]
224                             upper_bound = set_token[1][1]
225
226                             # [lower_bound, upper_bound] loop, inclusive
227                             # Appends confusables for all characters in range
228                             for ord_value in range(lower_bound, upper_bound + 1):
229                                 confusables += generate_confusables(chr(ord_value))
230
231                         else:
232                             # Must be a LITERAL
233                             # Append confusables for this character to list
234                             confusables += generate_confusables(chr(set_token[1]))
235
236                     # Append confusables to character set
237                     token_value += confusables
238
239                 elif token_name == re.sre_parse.LITERAL:
240                     # LITERAL type found, (LITERAL, ord)
241                     # *NOT* in a set, replace with set
242
243                     # Generate list of confusable homoglyphs based on `ord`
244                     confusables = generate_confusables(chr(token_value))
245
246                     # Overwrite the original LITERAL token in the AST with a
247                     # set of confusable homoglyphs
248                     ast_local[index] = (re.sre_parse.IN, confusables)
249
250                 else:
251                     # Not LITERAL or IN; more possible tokens nested in this
252                     # one. Convert to list, recurse, output back to tuple, then
253                     # overwrite in AST
254                     ast_local[index] = tuple(modify(list(item)))
255
256             # If not a tuple/token
257             elif isinstance(item, re.sre_parse.SubPattern):
258                 # More possible tokens, recurse and overwrite in AST
259                 ast_local[index] = modify(item)
260
261         return ast_local
262
263     # Patch generated native AST
264     ast_root = modify(ast_root)
265
266     # Compile AST to case-insensitive re.Pattern and return
267     return re.sre_compile.compile(ast_root, re.IGNORECASE)
268 ```
269
270 Testing with some sample regular expressions, we can see it works as desired:
271
272 ```python
273 >>> patched_regex("asdf").match("asdf")
274 <re.Match object; span=(0, 4), match='asdf'>
275
276 >>> patched_regex("asdf").match("as𝕕f")
277 <re.Match object; span=(0, 4), match='as𝕕f'>
278
279 >>> patched_regex("as[d-f]").match("as𝕕")
280 <re.Match object; span=(0, 3), match='as𝕕'>
281
282 >>> patched_regex("as[d-f]*").match("as𝕗𝕗𝕗𝕗")
283 <re.Match object; span=(0, 6), match='as𝕗𝕗𝕗𝕗'>
284
285 >>> patched_regex("as[d-f]").match("as𝕗")
286 <re.Match object; span=(0, 3), match='as𝕗'>
287
288 >>> patched_regex("[asd-f]").findall("asxℯ")
289 ['a', 's', 'ℯ']
290
291 >>> patched_regex("[asd-f]").match("qwerty")
292 None # Looks weird, but native compile yields this too. `findall` and `search` work for this.
293 ```
294 <figcaption>This works likewise with re.Pattern.search() and other re.Pattern
295 functions, and includes all native regex features. The only real limitation
296 here is generating confusables, since <code>confusable-homoglyphs</code> doesn't seem to
297 account for accent characters, e.g. <code>é</code> -> <code>e</code>.</figcaption>
298
299 I submitted a pull request to the bot which would make any filter string
300 prefixed by `regex:` use a custom regex compilation process similar to the one
301 above. This allows the Discord bot to employ arbitrary regular expressions as
302 filter items, making use of supported regex features such as lookarounds, while
303 still preserving expansion for non-ASCII confusable homoglyphs.
304
1	---
2	type: article
3	identifier: python-regex-homoglyphs
4	title: Patching Python's regex AST for confusable homoglyphs
5	description: Exploiting the re module's parsing and compiling methods to check for "confusable homoglyphs" and create a better automoderator.
6	datestring: 2023-03-18
7	banner_image: /static/images/futaba-filter.jpg
8	links:
9	futaba: https://github.com/strinking/futaba
10	confusable-homoglyphs: https://pypi.org/project/confusable-homoglyphs/
11	Python re module: https://docs.python.org/3/library/re.html
12	Scunthorpe problem: https://en.wikipedia.org/wiki/Scunthorpe_problem
13	Abstract Syntax Tree (AST): https://en.wikipedia.org/wiki/Abstract_syntax_tree
14	---
15
16	A couple years ago I contributed to an open source Discord
17	[bot](https://github.com/strinking/futaba) designed by a public computer
18	programming community I’m part of. As with most popular Discord bots, it
19	incorporates its own filter to prohibit unwanted language. However, to ensure
20	coverage of messages like `as𝕕f` (notice non-ASCII `𝕕`) in case it’s told to
21	filter `asdf` (an ASCII-only string), the bot makes use of the
22	[confusable_homoglyphs](https://pypi.org/project/confusable-homoglyphs/) Python
23	package to automatically expand an inputted filter string to cover these
24	non-ASCII edge cases.
25
26	Originally, the bot used Python's native string manipulation to convert the
27	inputted filter string into a Python [regular
28	expression](https://docs.python.org/3/library/re.html) string that matches for
29	each original character as well as each character's confusable homoglyphs
30	(similar-looking non-ASCII characters). For example, the inputted filter string
31	`asdf` was converted by the bot into the following regular expression:
32
33	```bash
34	[a⍺ａ𝐚𝑎𝒂𝒶𝓪𝔞𝕒𝖆𝖺𝗮𝘢𝙖𝚊ɑα𝛂𝛼𝜶𝝰𝞪а][sｓ𝐬𝑠𝒔𝓈𝓼𝔰𝕤𝖘𝗌𝘀𝘴𝙨𝚜ꜱƽѕꮪ𑣁𐑈][dⅾⅆ𝐝𝑑𝒅𝒹𝓭𝔡𝕕𝖉𝖽𝗱𝘥𝙙𝚍ԁᏧᑯꓒ][f𝐟𝑓𝒇𝒻𝓯𝔣𝕗𝖋𝖿𝗳𝘧𝙛𝚏ꬵꞙſẝք]
35	```
36
37	This regular expression uses character sets to match one instance each of the
38	original character or its confusable homoglyphs. In other words, it has the
39	ability to catch any recognizable rendition of a word using uncommon non-ASCII
40	characters. This was really nice, because it meant the bot could catch most
41	edge cases automatically. Unfortunately, however...
42
43	## The problem
44
45	Using the method of regex generation described above precludes the use of
46	arbitrary regex patterns, which would help in looking at context to prevent
47	[false positives](https://en.wikipedia.org/wiki/Scunthorpe_problem). When
48	expanding filter strings, the bot could not differentiate between literal
49	characters and special regex tokens. So, instead of the valid regular
50	expression string `^asdf(.*)$` (which matches anything following "asdf", only
51	if "asdf" is at the beginning of the message) being converted into the
52	following, preserving special regex tokens as desired...
53
54	```bash
55	^[a⍺ａ𝐚𝑎𝒂𝒶𝓪𝔞𝕒𝖆𝖺𝗮𝘢𝙖𝚊ɑα𝛂𝛼𝜶𝝰𝞪а][sｓ𝐬𝑠𝒔𝓈𝓼𝔰𝕤𝖘𝗌𝘀𝘴𝙨𝚜ꜱƽѕꮪ𑣁𐑈][dⅾⅆ𝐝𝑑𝒅𝒹𝓭𝔡𝕕𝖉𝖽𝗱𝘥𝙙𝚍ԁᏧᑯꓒ][f𝐟𝑓𝒇𝒻𝓯𝔣𝕗𝖋𝖿𝗳𝘧𝙛𝚏ꬵꞙſẝք](.*)$
56	```
57
58	...it would convert into this, mangling the original intended regex string,
59	making it useless for arbitrary matching:
60
61	```bash
62	[\^˄ˆ][a⍺ａ𝐚𝑎𝒂𝒶𝓪𝔞𝕒𝖆𝖺𝗮𝘢𝙖𝚊ɑα𝛂𝛼𝜶𝝰𝞪а][sｓ𝐬𝑠𝒔𝓈𝓼𝔰𝕤𝖘𝗌𝘀𝘴𝙨𝚜ꜱƽѕꮪ𑣁𐑈][dⅾⅆ𝐝𝑑𝒅𝒹𝓭𝔡𝕕𝖉𝖽𝗱𝘥𝙙𝚍ԁᏧᑯꓒ][f𝐟𝑓𝒇𝒻𝓯𝔣𝕗𝖋𝖿𝗳𝘧𝙛𝚏ꬵꞙſẝք][\(［❨❲〔﴾][\.𝅭․‎܁‎‎܂‎꘎‎𐩐‎‎٠‎۰ꓸ][\*⁎‎٭‎∗𐌟][\)］❩❳〕﴿]$
63	```
64	<figcaption>Interestingly, the confusable-homoglyphs package doesn't list any
65	special characters that look similar to <code>$</code>.</figcaption>
66
67	## The solution
68
69	To support expansion for confusables while preserving arbitrary regex, we need
70	to generate an **[abstract syntax
71	tree](https://en.wikipedia.org/wiki/Abstract_syntax_tree) (AST)** for the regular
72	expression and manipulate that somehow. For structured languages, an AST
73	represents the layout of meaningful tokens based on a predefined grammar, or
74	syntax. For example, regex parsers have to correctly interpret `[` and `]` as
75	special characters defining a set of characters within—unless they're escaped,
76	like `\[` and `\]`, in which case they'll be taken as "literal" bracket
77	characters.
78
79	After generating an AST for the regular expression, we then must modify it to
80	replace predefined character literals (e.g. `a`) with sets of their confusable
81	homoglyphs (e.g. `[a⍺ａ𝐚𝑎𝒂𝒶𝓪𝔞𝕒𝖆𝖺𝗮𝘢𝙖𝚊ɑα𝛂𝛼𝜶𝝰𝞪а]`) before compiling the AST into a
82	usable pattern-matching object. While this process appears similar to the
83	string manipulation method described previously, parsing first for an AST
84	provides a mutable structure that allows us to distinguish character literals
85	from special regex tokens/grammar. Fortunately, Python’s `re` module already
86	contains (and exposes!) the internal tools for parsing and compiling—we just
87	need to modify the process.
88
89	**<i>The AST is generated from the inputted filter string, and the usable
90	pattern matching object is compiled from the AST. Since these steps are
91	separate in Python's `re` pipeline and since the AST is a mutable object, the
92	AST object can be separately modified before being compiled.</i>**
93
94	# Manipulating Python’s regex AST
95
96	This required a bit of reverse engineering on my part as I couldn't find any
97	adequate documentation on the internals of Python’s `re` module. After some
98	brief digging in the CPython source repository, I found **two submodules of
99	`re` that handle regex string parsing and compilation: `re.sre_parse` and
100	`re.sre_compile`**, respectively.
101
102	## Reverse engineering
103
104	The `re.compile()` function uses the `sre_parse` and `sre_compile` submodules
105	by effectively doing the following to return a `re.Pattern` object:
106
107	```python
108	ast = re.sre_parse.parse( input_regex_string ) # -> re.sre_parse.SubPattern
109	pattern_object = re.sre_compile.compile( ast ) # -> re.Pattern
110	return pattern_object
111	```
112
113	Knowing this, we can experiment with the output of `sre_parse.parse()` function
114	to determine `re`'s AST structure and figure out how we need to modify it.
115
116	```python
117	>>> import re
118
119	>>> re.sre_parse.parse("asdf")
120	[(LITERAL, 97), (LITERAL, 115), (LITERAL, 100), (LITERAL, 102)]
121
122	>>> type(re.sre_parse.parse("asdf"))
123	sre_parse.SubPattern
124
125	>>> re.sre_parse.parse("[asdf]")
126	[(IN, [(LITERAL, 97), (LITERAL, 115), (LITERAL, 100), (LITERAL, 102)])]
127
128	>>> re.sre_parse.parse("[asd-f]") # Ranges?
129	[(IN, [(LITERAL, 97), (LITERAL, 115), (RANGE, (100, 102))])]
130
131	>>> re.sre_parse.parse("(asdf)") # To see how `re` handles nested tokens
132	[(SUBPATTERN, (1, 0, 0, [(LITERAL, 97), (LITERAL, 115), (LITERAL, 100), (LITERAL, 102)]))]
133	```
134
135	From this, we know the `sre_parse.parse()` function returns a `SubPattern`
136	list-like object containing *tuples of tokens in the format `(TOKEN_NAME,
137	TOKEN_VALUE)`.* Also, given the name `SubPattern`, it's likely this is nested
138	for other types of regex grammar (maybe for lookarounds or matching groups?).
139
140	## Modifying the AST (implementing the solution)
141
142	For our case, we’re looking to replace tokens identified as `LITERAL`s:
143
144	> `(LITERAL, ord)`, representing literal characters like `a`
145
146	with character match sets—`IN` tokens—wrapping more `LITERAL` tokens:
147
148	> `(IN, [ (LITERAL, ord) ... ])`, representing sets like `[a⍺ａ𝐚𝑎𝒂𝒶𝓪𝔞𝕒𝖆𝖺𝗮𝘢𝙖𝚊ɑα𝛂𝛼𝜶𝝰𝞪а]`
149
150	We also have this `RANGE` token that needs to be handled:
151
152	> `(RANGE, (LOWER_ORD, UPPER_ORD))`, representing set ranges like `a-z` in `[a-z]`
153
154	Because abstract syntax trees are, well, trees, this needs to be done
155	recursively to account for nested tokens, such as those within matching groups.
156	What's important to note here is that regex does not allow nested character
157	sets. So, if the input string uses sets natively, and we want to expand the
158	characters in that set to cover confusable homoglyphs, we need to make sure we
159	aren't creating a new set within the original set. You can see how I
160	accomplished this below, among other things like handling ranges.
161
162	This is the solution I came up with:
163
164	```python
165	from collections.abc import Iterable
166	import re
167
168	from confusable_homoglyphs import confusables
169
170	def patched_regex(regex_string: str) -> re.Pattern:
171	"""
172	Generate a regex pattern object after replacing literals with sets of
173	confusable homoglyphs
174	"""
175
176	# Generate AST from base input string
177	ast_root = re.sre_parse.parse(regex_string)
178
179	# Generate list of confusable homoglyph LITERAL tokens based on input
180	# character, including token for input character
181	def generate_confusables(confusable: chr) -> list:
182	groups = confusables.is_confusable(confusable, greedy=True)
183
184	# Begin the homoglyph set with the original character
185	confusable_literals = [(re.sre_parse.LITERAL, ord(confusable))]
186
187	if not groups:
188	# Nothing to be confused with the original character
189	return confusable_literals
190
191	# Append confusable homoglyph tokens to list
192	# Check confusable_homoglyphs documentation to verify this usage
193	for homoglyph in groups[0]["homoglyphs"]:
194	confusable_literals += [
195	(re.sre_parse.LITERAL, ord(char))
196	for char in homoglyph["c"]
197	]
198	return confusable_literals
199
200	# Iterative function to patch AST
201	def modify(ast_local: Iterable):
202
203	# Step through this level of the AST
204	for index, item in enumerate(ast_local):
205
206	# Token represented by tuple
207	# (TOKEN_NAME, TOKEN_VALUE) is likely
208	if isinstance(item, tuple):
209
210	token_name, token_value, *_ = item
211
212	if token_name == re.sre_parse.IN:
213	# IN type found, (IN, [ (LITERAL, ord) ... ])
214	# Because you can't nest sets in regex, these need to be
215	# handled separately, with confusables inserted in place
216
217	# Generate confusables for every literal in charset
218	confusables = []
219	for set_token in token_value:
220	if set_token[0] == re.sre_parse.RANGE:
221	# If this is a RANGE, e.g. [a-z]
222	# ( RANGE, (LOWER_ORD, UPPER_ORD) )
223	lower_bound = set_token[1][0]
224	upper_bound = set_token[1][1]
225
226	# [lower_bound, upper_bound] loop, inclusive
227	# Appends confusables for all characters in range
228	for ord_value in range(lower_bound, upper_bound + 1):
229	confusables += generate_confusables(chr(ord_value))
230
231	else:
232	# Must be a LITERAL
233	# Append confusables for this character to list
234	confusables += generate_confusables(chr(set_token[1]))
235
236	# Append confusables to character set
237	token_value += confusables
238
239	elif token_name == re.sre_parse.LITERAL:
240	# LITERAL type found, (LITERAL, ord)
241	# NOT in a set, replace with set
242
243	# Generate list of confusable homoglyphs based on `ord`
244	confusables = generate_confusables(chr(token_value))
245
246	# Overwrite the original LITERAL token in the AST with a
247	# set of confusable homoglyphs
248	ast_local[index] = (re.sre_parse.IN, confusables)
249
250	else:
251	# Not LITERAL or IN; more possible tokens nested in this
252	# one. Convert to list, recurse, output back to tuple, then
253	# overwrite in AST
254	ast_local[index] = tuple(modify(list(item)))
255
256	# If not a tuple/token
257	elif isinstance(item, re.sre_parse.SubPattern):
258	# More possible tokens, recurse and overwrite in AST
259	ast_local[index] = modify(item)
260
261	return ast_local
262
263	# Patch generated native AST
264	ast_root = modify(ast_root)
265
266	# Compile AST to case-insensitive re.Pattern and return
267	return re.sre_compile.compile(ast_root, re.IGNORECASE)
268	```
269
270	Testing with some sample regular expressions, we can see it works as desired:
271
272	```python
273	>>> patched_regex("asdf").match("asdf")
274	<re.Match object; span=(0, 4), match='asdf'>
275
276	>>> patched_regex("asdf").match("as𝕕f")
277	<re.Match object; span=(0, 4), match='as𝕕f'>
278
279	>>> patched_regex("as[d-f]").match("as𝕕")
280	<re.Match object; span=(0, 3), match='as𝕕'>
281
282	>>> patched_regex("as[d-f]*").match("as𝕗𝕗𝕗𝕗")
283	<re.Match object; span=(0, 6), match='as𝕗𝕗𝕗𝕗'>
284
285	>>> patched_regex("as[d-f]").match("as𝕗")
286	<re.Match object; span=(0, 3), match='as𝕗'>
287
288	>>> patched_regex("[asd-f]").findall("asxℯ")
289	['a', 's', 'ℯ']
290
291	>>> patched_regex("[asd-f]").match("qwerty")
292	None # Looks weird, but native compile yields this too. `findall` and `search` work for this.
293	```
294	<figcaption>This works likewise with re.Pattern.search() and other re.Pattern
295	functions, and includes all native regex features. The only real limitation
296	here is generating confusables, since <code>confusable-homoglyphs</code> doesn't seem to
297	account for accent characters, e.g. <code>é</code> -> <code>e</code>.</figcaption>
298
299	I submitted a pull request to the bot which would make any filter string
300	prefixed by `regex:` use a custom regex compilation process similar to the one
301	above. This allows the Discord bot to employ arbitrary regular expressions as
302	filter items, making use of supported regex features such as lookarounds, while
303	still preserving expansion for non-ASCII confusable homoglyphs.
304