Saturday, October 17, 2015

Trim is not removing all whitespaces in Java

Java trim is removing only ASCII whitespace characters, but ignores unicode whitespaces. This is backward compatibility thing, and there is big and detailed explanation of this problem. It can be easily fixed by using regular expression that will remove all official unicode whitespaces:
Pattern TRIM_PATTERN = Pattern.compile("^\\s*(.*?)\\s*$", Pattern.UNICODE_CHARACTER_CLASS);
Matcher matcher = TRIM_PATTERN.matcher(input);
if (matcher.matches() && matcher.groupCount() > 0) {
    return matcher.group(1);
}
return input;

But for more extreme cases you may want to use also this pattern
"^[\\s\\u2060\\u200D\\u200C\\u200B\\u180E\\uFEFF\\u00AD]*(.*?)[\\s\\u2060\\u200D\\u200C\\u200B\\u180E\\uFEFF\\u00AD]*$"

besides whitespaces it will also remove other invisible symbols.

No comments:

Post a Comment