Unicode(UTF8)中日韩统一汉字(U+4E00–U+9FBF)判断程序

还是工程应用的文章,utf8 3位汉字编码,至少在中日韩统一汉字中是这样,但怎么判断一个长度为3的string是不是汉字呢?
答案一:U+4E00–U+9FBF
答案二:访问www.unicode.org
答案三:参考这篇文章
答案四:下面这段代码
int is_utf8_zh_basic(const char * str)
{
    if (strlen(str)<3) return 0;

    /*basic check if str is 1110xxxx 10xxxxxx 10xxxxxx*/
    if ((str[0]+256)/16!=14) return 0;
    if ((str[1]+256)/64!=2) return 0;
    if ((str[2]+256)/64!=2) return 0;

    int code=(((str[0]+256)%16)*64*64+((str[1]+256)%64)*64+(str[2]+256)%64);
    if ((code>=0x4E00)&&(code<=0x9FbF))  return 1; else return 0;
}
统计扫描百兆以上级数据时常常遇到匪夷所思的字符,用这个过滤一下吧。

Advertisements
This entry was posted in 未分类. Bookmark the permalink.

One Response to Unicode(UTF8)中日韩统一汉字(U+4E00–U+9FBF)判断程序

  1. Susan says:

    中日韩统一汉字it is a very good idea, there are great difficulties in realization of it, but if it comes true how great progress would be in the world communiction and and world culture, great contritution to the human being.!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s