Frequently, I find myself wanting to do some simple language detection. For Chinese, Japanese, and Korean, this can easily be done by looking at the types of characters in some text. The simplest and most robust way to do this is to use Unicode block names. It is very simple to write a regular expression which will test if a character is contained in a certain block.
For all the different possible blocks, see here:
Unicode block names for use in XSD regular expressions
Here are some very simple blocks for detecting katakana, hiragana and kanji
robert_felty$ echo "ア" | perl perl -CIO -nle 'if (/\p{Katakana}/) { print "this contains katakana\n";}' this contains katakana robert_felty$ echo "あ" | perl perl -CIO -nle 'if (/\p{Hiragana}/) { print "this contains hiragana\n";}' this contains hiragana robert_felty$ echo "安" | perl perl -CIO -nle 'if (/\p{Han}/) { print "this contains kanji\n";}' this contains kanji
This style of character block for regex is supported in many languages, including Java and perl. Note that it is not supported in python using the default “re” module. There is an alternative module called “regex”, which does support it:
regex 2014.02.19 : Python Package Index
One final thought – don’t try to use unicode block ranges, like: [\x{4E00}-\x{9FBF}]. This is prone to error