Index

joshstock.in / 35436cd

Source for serving and static templating/compiling of https://joshstock.in.

Latest Commit

{#}TimeHashSubjectAuthor#(+)(-)GPG?
15814 Jan 2023 23:375aaa26aWebsite rewriteJosh Stockin12460G

Blob @ joshstock.in / site / content / blog / python-regex-homoglyphs.md

text/plain11050 bytesdownload raw
1---
2type: none
3identifier: python-regex-homoglyphs
4title: Patching Python's regex AST for confusable homoglyphs
5description: Exploiting the re module's parsing and compiling methods to create a better automoderator.
6datestring: 2022-11-06
7banner_image: /static/images/futaba-filter.jpg
8links:
9 futaba: https://github.com/strinking/futaba
10 confusable_homoglyphs: https://pypi.org/project/confusable_homoglyphs/
11 Python re module: https://docs.python.org/3/library/re.html
12 Scunthorpe problem: https://www.wikiwand.com/en/Scunthorpe_problem
13 Abstract Syntax Trees: https://www.wikiwand.com/en/Abstract_syntax_tree
14---
15
16A couple years ago I contributed to an open source Discord
17[bot](https://github.com/strinking/futaba) designed by a public computer
18programming community I’m part of. As with most popular Discord bots, it
19incorporates its own filter to prohibit unwanted language. However, to ensure
20coverage of messages like `as𝕕f` (notice non-ASCII `𝕕`) in case it’s told to
21filter `asdf` (an ASCII-only string), the bot makes use of the
22[confusable_homoglyphs](https://pypi.org/project/confusable_homoglyphs/) Python
23package to automatically expand an inputted filter string to support these
24non-ASCII edge cases.
25
26Originally, the bot used base Python string manipulation to convert the
27inputted filter string into a Python [regular
28expression](https://docs.python.org/3/library/re.html) string that matches for
29each original character as well as each character's confusable homoglyphs
30(similar-looking non-ASCII characters). For example, the inputted filter string
31`asdf` was converted by the bot into the following regular expression:
32
33```bash
34[a⍺a𝐚𝑎𝒂𝒶𝓪𝔞𝕒𝖆𝖺𝗮𝘢𝙖𝚊ɑα𝛂𝛼𝜶𝝰𝞪а][ss𝐬𝑠𝒔𝓈𝓼𝔰𝕤𝖘𝗌𝘀𝘴𝙨𝚜ꜱƽѕꮪ𑣁𐑈][dⅾⅆ𝐝𝑑𝒅𝒹𝓭𝔡𝕕𝖉𝖽𝗱𝘥𝙙𝚍ԁᏧᑯꓒ][f𝐟𝑓𝒇𝒻𝓯𝔣𝕗𝖋𝖿𝗳𝘧𝙛𝚏ꬵꞙſẝք]
35```
36
37This regular expression uses character sets to match one instance each of the
38original character or its confusable homoglyphs. In other words, it has the
39ability to catch any recognizable rendition of a word using uncommon non-ASCII
40characters. This was really nice, because it meant the bot could catch most
41edge cases automatically. Unfortunately, however...
42
43## The problem
44
45Using the method of regex generation described above precludes the use of
46arbitrary regex patterns, which would help in looking at context to prevent
47[false positives](https://www.wikiwand.com/en/Scunthorpe_problem). When
48expanding filter strings, the bot could not differentiate between literal
49characters and special regex tokens. So, instead of the valid regular
50expression string `^asdf(.*)$` converting into this, preserving special regex
51tokens as desired:
52
53```bash
54^[a⍺a𝐚𝑎𝒂𝒶𝓪𝔞𝕒𝖆𝖺𝗮𝘢𝙖𝚊ɑα𝛂𝛼𝜶𝝰𝞪а][ss𝐬𝑠𝒔𝓈𝓼𝔰𝕤𝖘𝗌𝘀𝘴𝙨𝚜ꜱƽѕꮪ𑣁𐑈][dⅾⅆ𝐝𝑑𝒅𝒹𝓭𝔡𝕕𝖉𝖽𝗱𝘥𝙙𝚍ԁᏧᑯꓒ][f𝐟𝑓𝒇𝒻𝓯𝔣𝕗𝖋𝖿𝗳𝘧𝙛𝚏ꬵꞙſẝք](.*)$
55```
56
57It would convert into this, mangling the original intended regex string, making
58it useless for arbitrary matching:
59
60```bash
61[\^˄ˆ][a⍺a𝐚𝑎𝒂𝒶𝓪𝔞𝕒𝖆𝖺𝗮𝘢𝙖𝚊ɑα𝛂𝛼𝜶𝝰𝞪а][ss𝐬𝑠𝒔𝓈𝓼𝔰𝕤𝖘𝗌𝘀𝘴𝙨𝚜ꜱƽѕꮪ𑣁𐑈][dⅾⅆ𝐝𝑑𝒅𝒹𝓭𝔡𝕕𝖉𝖽𝗱𝘥𝙙𝚍ԁᏧᑯꓒ][f𝐟𝑓𝒇𝒻𝓯𝔣𝕗𝖋𝖿𝗳𝘧𝙛𝚏ꬵꞙſẝք][\([❨❲〔﴾][\.𝅭․‎܁‎‎܂‎꘎‎𐩐‎‎٠‎۰ꓸ][\*⁎‎٭‎∗𐌟][\)]❩❳〕﴿]$
62```
63<figcaption>Interestingly, confusable_homoglyphs doesn't list any special
64characters that look similar to $.</figcaption>
65
66## The solution
67
68To support expansion for confusables while preserving arbitrary regex, an
69**[abstract syntax tree](https://www.wikiwand.com/en/Abstract_syntax_tree)
70(AST)** for the regular expression would need to be generated from the filter
71string, then modified to replace predefined character literals (e.g. `a`) with
72sets of their confusable homoglyphs (e.g. `[a⍺a𝐚𝑎𝒂𝒶𝓪𝔞𝕒𝖆𝖺𝗮𝘢𝙖𝚊ɑα𝛂𝛼𝜶𝝰𝞪а]`) before
73being compiled into a usable pattern object. While this process appears similar
74to the string manipulation method described previously, parsing first for an
75AST provides a mutable structure that allows us to distinguish character
76literals from special regex lexemes/grammar. Fortunately, Python’s `re` module
77already contains (and exposes!!) the internal tools for parsing and
78compiling—we just need to modify the process.
79
80*The AST is generated from the inputted filter string, and the usable pattern
81matching object is compiled from the AST. Since these steps are separate in
82Python's `re` pipeline and since the AST is a mutable object, the AST object
83can be separately modified before being compiled.*
84
85# Manipulating Python’s regex AST
86
87This required a bit of reverse engineering on my part as I couldn't find any
88adequate documentation on the internals of Python’s `re` module. After some
89brief digging in the CPython source repository, I found **two submodules of
90`re` that handle regex string parsing and compilation: `re.sre_parse` and
91`re.sre_compile`**, respectively.
92
93## Reverse engineering
94
95The `re.compile()` function uses the `sre_parse` and `sre_compile` submodules
96by (effectively) doing the following to return a `re.Pattern` object:
97
98```python
99ast = re.sre_parse.parse( input_regex_string ) # -> re.sre_parse.SubPattern
100pattern_object = re.sre_compile.compile( ast ) # -> re.Pattern
101return pattern_object
102```
103
104Knowing this, we can experiment with the output of `sre_parse.parse()` function
105to determine `re`'s AST structure and figure out how we need to modify it.
106
107```python
108>>> import re
109
110>>> re.sre_parse.parse("asdf")
111[(LITERAL, 97), (LITERAL, 115), (LITERAL, 100), (LITERAL, 102)]
112
113>>> type(re.sre_parse.parse("asdf"))
114sre_parse.SubPattern
115
116>>> re.sre_parse.parse("[asdf]")
117[(IN, [(LITERAL, 97), (LITERAL, 115), (LITERAL, 100), (LITERAL, 102)])]
118
119>>> re.sre_parse.parse("(asdf)") # To see how `re` handles nested lexemes
120[(SUBPATTERN, (1, 0, 0, [(LITERAL, 97), (LITERAL, 115), (LITERAL, 100), (LITERAL, 102)]))]
121```
122
123From this, we know the `sre_parse.parse()` function returns a `SubPattern`
124list-like object containing *tuples of lexemes in the format `(LEXEME_NAME,
125value)`.*
126
127## Modifying the AST (implementing the solution)
128
129For our case, we’re looking to replace lexemes identified as `LITERAL`s:
130
131> `(LITERAL, ord)`, representing literal characters like `a`
132
133with character match sets—`IN` lexemes—wrapping more `LITERAL` lexemes:
134
135> `(IN, [ (LITERAL, ord) ... ])`, representing sets like `[a⍺a𝐚𝑎𝒂𝒶𝓪𝔞𝕒𝖆𝖺𝗮𝘢𝙖𝚊ɑα𝛂𝛼𝜶𝝰𝞪а]`
136
137Because abstract syntax trees are, well, trees, this needs to be done
138recursively to account for nested lexemes, such as those within matching
139groups.
140
141This is the solution I came up with:
142
143```python
144from collections.abc import Iterable
145import re
146
147from confusable_homoglyphs import confusables
148
149def patched_regex(regex_string: str) -> re.Pattern:
150 """
151 Generate a regex pattern object after replacing literals with sets of
152 confusable homoglyphs
153 """
154
155 # Generate AST from base input string
156 ast_root = re.sre_parse.parse(regex_string)
157
158 # Iterative function to patch AST
159 def modify(ast_local: Iterable):
160
161 # Step through this level of the AST
162 for index, item in enumerate(ast_local):
163
164 # Lexeme represented by item = (LEXEME_TYPE, VALUE)
165 if isinstance(item, tuple):
166
167 if item[0] == re.sre_parse.LITERAL:
168 # LITERAL type found item = (sre_parse.LITERAL, ord)
169
170 # Generate list of confusable homoglyphs...
171 groups = confusables.is_confusable(chr(item[1]),
172 greedy=True)
173 if not groups:
174 # Homoglyph is not confusable, doesn't need to be
175 # patched, move to next item in subpattern
176 continue
177
178 confusable_literals = [item] # Begin the homoglyph set with
179 # the original character lexeme
180 for homoglyph in groups[0]["homoglyphs"]:
181 confusable_literals += [ # Append confusable homoglyph
182 # lexemes to set
183 (re.sre_parse.LITERAL, ord(char))
184 for char in homoglyph["c"]
185 ]
186
187 # Overwrite the original LITERAL lexeme in the AST with the
188 # new set of confusable homoglyphs
189 ast_local[index] = (re.sre_parse.IN, confusable_literals)
190 else:
191 # Not LITERAL; more possible lexemes nested in this one
192 # Convert to list, recurse, output back to tuple, then
193 # overwrite in AST
194 ast_local[index] = tuple(modify(list(item)))
195
196 elif isinstance(item, re.sre_parse.SubPattern):
197 # More possible lexemes, recurse and overwrite in AST
198 ast_local[index] = modify(item)
199
200 return ast_local
201
202 # Patch generated base AST
203 ast_root = modify(ast_root)
204
205 # Compile AST to case-insensitive re.Pattern and return
206 return re.sre_compile.compile(ast_root, re.IGNORECASE)
207```
208
209Testing the generated regular expression pattern with the example input string
210from earlier, we can see it now works as expected.
211
212```python
213>>> pattern = patched_regex("^asdf(.*)$")
214
215>>> pattern.match("Not a match")
216None
217
218>>> pattern.match("Not a match even though asdf is in it because it doesn't follow the regex pattern")
219None
220
221>>> pattern.match("asdf")
222<re.Match object; span=(0, 4), match='asdf'>
223
224>>> pattern.match("asdf match")
225<re.Match object; span=(0, 10), match='asdf match'>
226
227>>> pattern.match("𝜶ꮪ𝚍𝖿") # String containing confusable homoglyphs
228<re.Match object; span=(0, 4), match='𝜶ꮪ𝚍𝖿'>
229
230>>> pattern2 = patch_regex("^asdf\\$(.*)$") # Also works when escaping special characters
231
232>>> pattern2.match("asdf match?")
233None
234
235>>> pattern2.match("asdf$ match?")
236<re.Match object; span=(0, 11), match='asdf$ match'>
237```
238<figcaption>This works likewise with re.Pattern.search() and other re.Pattern
239functions.</figcaption>
240
241I submitted a [pull request](https://github.com/strinking/futaba/pull/368) to
242the bot which would make any filter string prefixed by `regex:` use a custom
243regex compilation process similar to the one above. This allows the Discord bot
244to employ arbitrary regular expressions as filter items, making use of
245supported regex features such as lookarounds, while still preserving expansion
246for non-ASCII confusable homoglyphs.
247