Dmitrijs Artjomenko blog: Rich text component validation

Friday, July 15, 2011

Rich text component validation

Rich Text Components return HTML code, and if you are going to show it on your pages there is problem that some bad person can submit malicious HTML to your server and if you will display it "as is", this code can do something bad. So you have to check somehow that code that is submitted is generate by your component, or at least is safe. Latter is called HTML sanitization.
Simple pattern matching is too trivial, as malicious code can be hidden behind some strange unicode symbols or so (you can check how tricky it is at http://ha.ckers.org/xss.html). Basically, it usually requires parsing HTML and detecting, what can be used and what can't. Fortunately, there is already nice HTML parser built into JDK. It is intended to be used in Swing, but is abstract enough to be used for validation too. What you need to do is to extend javax.swing.text.html.parser.Parser, like:

import javax.swing.text.html.parser.*
import static javax.swing.text.html.HTML.Tag.*
import static javax.swing.text.html.HTML.Attribute.*

class RteParser extends Parser {
  boolean hasErrors = false
    
  public RteParser() {
    super(DTD.getDTD('html'));
  }
  
  void validateTag(tag) {
    ...
  }
  
  void handleStartTag(TagElement tag) {
    validateTag(tag)
    this.flushAttributes()
  }
  
  void handleEndTag(TagElement tag) {
    validateTag(tag)
    this.flushAttributes()
  }
  
  void handleEmptyTag(TagElement tag) {
    validateTag(tag)
    this.flushAttributes()
  }

  public static boolean validate(String value) {
    RteParser parser = new RteParser()
    StringReader reader = new StringReader("<html>${value}</html>")
    parser.parse(reader)
    return parser.isValid()
  }
  
  public boolean isValid() {
    return !hasErrors
  }

}

All validation can be done by calling static validate method.
All magic is done in validateTag. This method is specific, this is place where you check all tags and attributes against some black list or validation patterns.

Dmitrijs Artjomenko blog

Friday, July 15, 2011

Rich text component validation

No comments:

Post a Comment