Friday, February 08, 2008

String.replaceAll() caveat

Sometimes I wonder how it is possible to write Java programs for as long as I have now and still stumble across little gotchas I did not know about.

When writing a class that is intended as a utility to write well-formed queries for our persistence framework, I was bitten by a nice little subtlety of the String class. Just a little background: In our application we use a persistence framework with a SQL-like (but still different) query language. Statements in that language are easier to read because they are closer to the business object model, but in the end they get translated into "real" SQL.

Unfortunately the query parser is based on simple query strings without support for something like prepared statements which makes it susceptible to injection attacks if you put user-entered values into a query directly. While there is a way to escape the input accordingly it is a little cumbersome and makes the code to assemble the query conditions into the final String hard to read.

I decided to build a little wrapper that would allow support for a sort of prepared statements by having people put placeholders into their queries and escape the values upon insertion.

The code for such a query might look like this:

StmtHelper h = new StmtHelper("SELECT firstname FROM customer WHERE name == #n#");
h.replace("n", txtfield.getText());
String queryString = h.assemble();
Inside the StmtHelper class' "replace" method I first escape the parameter value as needed and store it with its placeholder "n" in a map. Upon "assemble" I create a new String based on the template passed to the constructor where all placeholders are replaced with their escaped values.

This is the "assemble" method:

public String assemble() {
String result = template;
for (Map.Entry<String, String> entry : replacements.entrySet()) {
 result = result.replaceAll(entry.getKey(), entry.getValue());
}
return result;
}
At first glance this does what it is supposed to do. Unfortunately there is a bug in this code that will only show its ugly face with proper values to be filled in. For example this code will lead to a problem at runtime:
StmtHelper h = new StmtHelper("SELECT firstname FROM customer WHERE name == #n#");
h.replace("n", "O'Reilly");
String queryString = h.assemble();
This should lead to a query string as follows:

SELECT firstname FROM customer where name = 'O\'Reilly'

The single quote in the name must be escaped with a backslash. This is taken care of inside the "replace" method and works reliably - inspecting the replacement Map proves that. Nevertheless the final query string will be wrong: The single quote will not be escaped, effectively breaking the syntax for this particular query and name, but practically opening a gaping security hole for injection attacks:

SELECT firstname FROM customer where name = 'O'Reilly'

I was rather surprised when I came across this, because the replacements worked in all the test cases I had built into the JUnit tests. Unfortunately those did not contain a replacement value with a single quote, even though this is of course one of the first things someone would try in an attack... Looking into this more closely I learned something about String.replaceAll() I had never noticed before. This is from the Java 1.5 API documentation:

Replaces each substring of this string that matches the given regular expression with the given replacement. An invocation of this method of the form str.replaceAll(regex, repl) yields exactly the same result as the expression Pattern.compile(regex).matcher(str).replaceAll(repl)

IMHO this does not suggest anything special concerning the replacement, as long as the regular expression for the first parameter is valid. In my case this was just a simple name surrounded by hash characters, so no problems to expect here. However the backslash in front of O'Reilly's single quote disappears after the call to "replaceAll", leaving the quote unprotected in the resulting String.

Only looking at the Java 6 documentation makes it more clear:

An invocation of this method of the form str.replaceAll(regex, repl) yields exactly the same result as the expression Pattern.compile(regex).matcher(str).replaceAll(repl) Note that backslashes (\) and dollar signs ($) in the replacement string may cause the results to be different than if it were being treated as a literal replacement string; see Matcher.replaceAll. Use Matcher.quoteReplacement(java.lang.String) to suppress the special meaning of these characters, if desired.

That added paragraph suggests that more than one person must have fallen into that trap. While in the Java 5 version of the docs there was a mention of the result being exactly the same as if you used a Pattern/Matcher combination, at least to me it was not obviously hinting at a special behavior concerning the replacement string, but only the first parameter called "regex".

Strangely enough only after that experience I noticed that I must have been lucky all the time before, because I practically always used the "replaceAll()" method instead of the simple "replace()" I not use to replace the key/value pairs. As I do not really need the regular expression capabilities for the search term, this also saves some overhead.

2 comments:

Ramon said...

The behaviour that you describe is what I would expect. Why would a replace function do escaping? Escaping/conversion should be done elsewhere.

Pffff you got me worried for a minute :)

Daniel Schneller said...

I'm not sure I get you right. Which behavior would you expect?
Please note: My replacement String is completely done when I put it into replace/replaceAll. I do not expect those methods to do anything but "paste" in the second argument into the position of the placeholder found through the first argument.
Are we talking about the same thing here?