Recently, I encountered an interesting problem which involved validating if a string is a valid Hebrew name. The basic requirement was that it must contain all Hebrew characters.
Immediately, I knew regex was the way to go! I started thinking of what unicode character
range I’d need to check for. However, after some digging, I realised that most popular regex
engines
support unicode scripts. I could use a unicode
grouping (like {InHebrew}
) which nicely curates the unicode characters for non-Latin character based languages. Other examples of this
are {InArabic}
and {InGreek}
which work for Arabic and Greek characters. My first solution to
this problem was to use this pattern. Nice and simple!
^\\p{InHebrew}+\$".toRegex().matches(name)
About a month after deploying this, I encountered a bug. Someone with a valid Hebrew name couldn’t use this flow. I dug deeper only to realise that I didn’t account for ‘special’ characters. After some conversations with a number of Hebrew speakers I know, I discovered that Hebrew names could contain an apostrophe. Not only that, the apostrophe could only be either in the middle or the end of the string, not both.
I went ahead to modify the regex and ended up with this.
"^\\p{InHebrew}+'?\\p{InHebrew}+|\\p{InHebrew}+'?\$".toRegex().matches(name)
…and that my friends brings us to the end of my occasional but always fun ride to regex-land.