Saturday, July 9, 2016

URL canonicalization and normalization in Java

Recently I had to implement integration with Google Safe Browsing in Java and one part of the task is URL normalisation, basically it is like JSoup for URL. You should remove redundant parts, decode, encode, etc. Seems trivial: even java.net.URI has normalisation, but it really was not trivial, nothing was working and result was not even remotely compliant.

After searching and trying everything suggested on Stackoverflow, I finally found working solution - URL-Detector from Linkedin. Lib itself looks raw and it is not even in public Maven as of now, but it successfully passes all Google tests after replacing port and using URL without fragment.