Recently, I encountered an interesting problem which involved validating if a string
is a valid Hebrew name. The basic requirement was that it must contain all Hebrew characters.
Immediately, I knew regex was the way to go! I started thinking of what unicode character
range I’d need to check for. However, after some digging, I realised that most popular regex
engines
support unicode scripts. I could use a unicode
grouping (like {InHebrew}) which nicely curates the unicode characters for non-Latin character based languages. Other examples of this
are {InArabic} and {InGreek} which work for Arabic and Greek characters. My first solution to
this problem was to use this pattern. Nice and simple!
^\\p{InHebrew}+\$".toRegex().matches(name)
About a month after deploying this, I encountered a bug. Someone with a valid Hebrew name couldn’t
use this flow. I dug deeper only to realise that I didn’t account for ‘special’ characters. After
some conversations with a number of Hebrew speakers I know, I discovered that Hebrew names could
contain an apostrophe. Not only that, the apostrophe could only be either in the middle or the end of
the string, not both.
I went ahead to modify the regex and ended up with this.