Java實現過濾中文亂碼

最近在日誌數據清洗時遇到中文亂碼,如果只要有非中文字符就將該字符串過濾掉,這種方法雖簡單但並不可取,因為比如像Xperia™主題、天天四川麻將Ⅱ這樣的字符串也會被過濾掉。

1. Unicode編碼

Unicode編碼是一種涵蓋了世界上所有語言、標點等字符的編碼方式,簡單一點說,就是一種通用的世界碼;其編碼範圍:U+0000 .. U+10FFFF。按Unicode硬編碼的區間進行劃分,Unicode編碼被分成若干個block ( Unicode block);每一個Unicode編碼專屬於唯一的Unicode block,Unicode block之間互不重疊。從碼字的本身的屬性出發,Unicode編碼被分成了若干script ( Unicode>CJK Radicals SupplementKangxi RadicalsCJK Symbols and Punctuation中的15個字符CJK Unified Ideographs Extension ACJK Unified IdeographsCJK Compatibility IdeographsCJK Unified Ideographs Extension BCJK Unified Ideographs Extension CCJK Unified Ideographs Extension DCJK Unified Ideographs Extension ECJK Compatibility Ideographs Supplement

其中,常見的中文字符在CJK Unified Ideographs block;此外,考慮繁體字及不常見字等,CJK還有A、B、C、D、E五個extension。Basic Latin block完整地包含了ASCII碼的控制字符、標點字符與英文字母字符。

2. Java的字符編碼

JDK完整實現Unicode的block與script:

<code>Char c = '☎'
Character.UnicodeBlock ub = Character.UnicodeBlock.of(c)
Character.UnicodeScript uc = Character.UnicodeScript.of(c);/<code>

Java中的字符char內置的編碼方式是UTF-16,當char強轉成int類型時,其返回值是unicode編碼值,只有當getbyte時才返回的是utf-8編碼的byte:

<code>String s = "\\\\u00a0";
String.format("\\\\\\u%04x", (int) s.charAt(0)) // --> \\\\u00a0
import org.apache.commons.codec.binary.Hex;
Hex.encodeHex(s.getBytes()) // --> c2a0/<code>

UTF-8是Unicode字符的變長前綴編碼的一種實現,二者之間的對應關係在這裡.現在我們回到開篇過濾中文亂碼的問題,有一個基本解決思路:

去掉各種標點字符、控制字符,計算剩下字符中非中文字符所佔的比例,如果超過閾值,則認為該字符串為亂碼串

完整代碼如下:

<code>public class ChineseUtill {

private static boolean isChinese(char c) {
Character.UnicodeScript sc = Character.UnicodeScript.of(c);
if (sc == Character.UnicodeScript.HAN) {
return true;
}
return false;
}

public static boolean isPunctuation(char c) {
Character.UnicodeBlock ub = Character.UnicodeBlock.of(c);
if ( // punctuation, spacing, and formatting characters
ub == Character.UnicodeBlock.GENERAL_PUNCTUATION
// symbols and punctuation in the unified Chinese, Japanese and Korean/> || ub == Character.UnicodeBlock.CJK_SYMBOLS_AND_PUNCTUATION
// fullwidth character or a halfwidth character
|| ub == Character.UnicodeBlock.HALFWIDTH_AND_FULLWIDTH_FORMS
// vertical glyph variants for east Asian compatibility
|| ub == Character.UnicodeBlock.CJK_COMPATIBILITY_FORMS
// vertical punctuation for compatibility characters with the Chinese Standard GB 18030
|| ub == Character.UnicodeBlock.VERTICAL_FORMS
// ascii
|| ub == Character.UnicodeBlock.BASIC_LATIN
) {
return true;
} else {
return false;
}
}

private static Boolean isUserDefined(char c) {
Character.UnicodeBlock ub = Character.UnicodeBlock.of(c);
if (ub == Character.UnicodeBlock.NUMBER_FORMS
|| ub == Character.UnicodeBlock.ENCLOSED_ALPHANUMERICS
|| ub == Character.UnicodeBlock.LETTERLIKE_SYMBOLS


|| c == '\\\\ufeff'
|| c == '\\\\u00a0'
)
return true;
return false;
}

public static Boolean isMessy(String str) {
float chlength = 0;
float count = 0;
for(int i = 0; i < str.length(); i++) {
char c = str.charAt(i);
if(isPunctuation(c) || isUserDefined(c))
continue;
else {
if(!isChinese(c)) {
count = count + 1;
}
chlength ++;
}
}
float result = count / chlength;
if(result > 0.3)
return true;
return false;
}

}
/<code>

為了得到更為完整的可接受的字符表,定義isUserDefined方法(具體字符表與日誌中的字符有關係);加上了Number Forms、Enclosed Alphanumerics、Letterlike Symbols這三個block,以及\\\\u00a0(Non-breaking space)字符與\\\\ufeff(ZERO WIDTH NO-BREAK SPACE)字符。