Monday, January 31, 2011

Beware of Oracle JAXP XPath implementation.

For quite some time I have being involved in a project that employs such a "cutting edge" and RSI inducing technology as BPEL. Well, more specifically: Oracle's interpretation of this technology as implemented in Oracle SOA Suite 10g.

Recently we got a problem in one of the project's BPEL processes. Not that it is the only problem we have got with this software but this time for a change the problem does not have anything to do with BPEL.

The BPEL process iterates over some XML elements trying to find the element (<parent>) with the largest value of some child element (<check>), for example:

<parent>
<some_elt>...</some_elt>
<check>1</check>
</parent>
<parent>
<some_elt>...</some_elt>
<check>13</check>
</parent>
<parent>
<some_elt>...</some_elt>
<check>9</check>
</parent>

The expected result is <parent> with <check>13</check>. Instead the result was the element with <check>9</check>.

The "suspect" was some BPEL construct that used XPath to compare "the largest <check> so far" with "the current <check>". Some more experiments proved that. Apparently Oracle software can correctly handle some XPath comparison operations.

I decided to check where exactly the problem lies: in the BPEL implementation or somewhere else. I knew that Oracle has its own implementation of XML parsers, XSLT and XPath support that it tries to force on you if you are running some Oracle software. The implementation lives in xmlparserv2.jar, for example in /lib/xmlparserv2.jar.

I wrote a simple test program:

package p1;

import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpressionException;
import javax.xml.xpath.XPathFactory;;

public class Main {
public static void main(String[] args) throws XPathExpressionException {
XPathFactory xf = XPathFactory.newInstance();

System.out.println("Factory class: [" + xf.getClass().getName() + "]");
eval(xf, "1", true);
eval(xf, "'a'", true);
eval(xf, "''", false);
eval(xf, "1 > 2", false);
eval(xf, "1 < 2", true);
eval(xf, "'1' > '2'", false);
eval(xf, "'1' < '2'", true);
eval(xf, "12 > 9", true);
eval(xf, "12 < 9", false);
eval(xf, "'12' > '9'", true);
eval(xf, "'12' < '9'", false);
eval(xf, "'' > 2", false);
eval(xf, "'' < 2", false);
eval(xf, "'a' < 'b'", false);
eval(xf, "'b' < 'a'", false);
eval(xf, "number('a') < number('b')", false);
eval(xf, "number('b') < number('a')", false);
}

//
private static void eval(XPathFactory xf, String expr, boolean exp) {
XPath xp = xf.newXPath();

try {
Object res = xp.evaluate(expr, (Object)null, XPathConstants.BOOLEAN);

if (Boolean.valueOf(exp).equals(res)) {
System.out.println("Expression: [" + expr + "], results match");
} else {
System.out.println("Expression: [" + expr + "], NO MATCH!!: got: [" +
res + "], expected: [" +exp + "]");
}
} catch (Exception ex) {
System.out.println("Expression: [" + expr + "], EXC!!:");
ex.printStackTrace(System.out);
}
}
}


The program just goes to the "default" XPath implementation, uses it to evaluate some xpath expressions with string or numeric constants, and compares the obtained result with the expected value.

Executing this program under JDK 1.5 or 1.6 produces no surprises: the obtained result matches the expected value:


>java -cp xptest.jar p1.Main

Factory class: [com.sun.org.apache.xpath.internal.jaxp.XPathFactoryImpl]
Expression: [1], results match
Expression: ['a'], results match
Expression: [''], results match
Expression: [1 > 2], results match
Expression: [1 < 2], results match
Expression: ['1' > '2'], results match
Expression: ['1' < '2'], results match
Expression: [12 > 9], results match
Expression: [12 < 9], results match
Expression: ['12' > '9'], results match
Expression: ['12' < '9'], results match
Expression: ['' > 2], results match
Expression: ['' < 2], results match
Expression: ['a' < 'b'], results match
Expression: ['b' < 'a'], results match
Expression: [number('a') < number('b')], results match
Expression: [number('b') < number('a')], results match


Next I ran it with Oracle XML jar in the classpath and Oracle XPath as the default XPath implementation (command line is wrapped for readability):

> java -Djavax.xml.xpath.XPathFactory:http://java.sun.com/jaxp/xpath/dom=oracle.xml.xpath.JXPathFactory
-cp xmlparserv2.jar;xptest.jar p1.Main

Factory class: [oracle.xml.xpath.JXPathFactory]
Expression: [1], results match
Expression: ['a'], results match
Expression: [''], results match
Expression: [1 > 2], results match
Expression: [1 < 2], results match
Expression: ['1' > '2'], results match
Expression: ['1' < '2'], results match
Expression: [12 > 9], results match
Expression: [12 < 9], results match
Expression: ['12' > '9'], NO MATCH!!: got: [false], expected: [true]
Expression: ['12' < '9'], NO MATCH!!: got: [true], expected: [false]
Expression: ['' > 2], results match
Expression: ['' < 2], results match
Expression: ['a' < 'b'], NO MATCH!!: got: [true], expected: [false]
Expression: ['b' < 'a'], results match
Expression: [number('a') < number('b')], results match
Expression: [number('b') < number('a')], results match



The selected lines show where Oracle's XPath implementation violates XML Path Language (XPath) v 1.0 specification, more specifically, section 3.4 Booleans:
When neither object to be compared is a node-set and the operator is <=, <, >= or <, then the objects are compared by converting both objects to numbers and comparing the numbers according to IEEE 754.

It looks like Oracle's XPath compares strings lexicographically instead of converting them to numbers. This also means that expressions like '1' > '2' or 'b' < 'a' got "match" only incidentally. So for a change the problem is not caused by the BPEL implementation. The fact that Oracle's BPEL implementation can't really keep track of variable datatypes and use that information accordingly is just minor nuisance. You implement a BPEL process, you have to specify datatypes of all variables used in the process, and then you have to watch a lot of useless warnings like
[Error ORABPEL-10041]: Trying to assign incompatible types

Mind you, it does not complain about relational expressions. And then you execute your process and get something unexpected. Nice.

Conclusion:
  1. It is pretty impossible to stay away of Oracle in the Java world right now. But you can try to stay away from its XML implementation even if you have to run your code under some Oracle product. Just use appropriate javax.xml...Factory system properties. Fortunately it looks like this XML implementation is not included in newer versions of Oracle middleware (Weblogic-based). On the other hand this all can change: Oracle might decide to use their XML implementation as the default one for newer JDKs (brrr...).

  2. In some cases you have to stick to Oracle's implementation. For example who knows how the rest of BPEL software depends on it. No matter what XPath implementation you end up using: be as explicit as possible. In this case the solution is to force the string-to-number conversion explicitly. Instead of '12' > '9' use number('12') > number('9').

  3. Will I report the problem to Oracle? Not likely. I have some experience reporting BPEL problems to them, and it is an interesting exercise, to say the least. They always end up asking source code of the test case for the sole purpose of "analyzing the design of the test process". After that they come up with a lot of suggestions on how we can improve the design of our test process. But they tend to ignore our explanations that the test case is actually specifically implemented in a particular way to demonstrate the problem. Otherwise we would have to ship out complete system to them so that they can reproduce the problem. On the other hand ... It might be interesting to here their comments on the design of the test program above.

P.S. I also decided to check how some other java XPath implementations behave.

I have checked jaxen. They do not have JAXP javax.xml.xpath.XPathFactory, but their API is simple. No surprises here, everything works.

I also tested saxon, implementation version 9.3.0.4. This is XPath 2.0 implementation, and I got some errors because of this. I had to change my test program to force backwards compatibility mode with XPath 1.0. After that the program completed successfully as well.

Sunday, January 9, 2011

Adventures in the land of distributed transactions, or how ACID is the 2PC?

What? The problem is back?! No, it can't be true. I spent so much time investigating it. I fixed it. I tested the fix.

And yet... I am looking at the database and see enough information there to say: yes, the problem is back. Or, at least, the symptoms are the same: a JMS message is sent and a DB insert is performed in one transaction, and when the MDB gets the message and goes to the database to fetch the inserted record the record is not there.

There is also a very important difference: it is the acceptance environment where the application is running for weeks and the problem manifested itself only once. The original problem was happening once about every 4-5 transactions.

First things first: setRollbackOnly() and Thread.sleep() in the class that sends a JMS message. No message is received by the MDB proving that JMS operations are invoked in the same transaction as database inserts. So no, the new problem is not the same as the old one. This is good and bad at the same time. The good part is that my original fix is correct. The bad part: I have to reproduce the new problem.

Somehow I felt that using a small test application would not help me here: too much unknown, too many things may cause it. I have to reproduce it using the real application.

Actually that "I have to reproduce it" part turned out to be not very difficult after all: I just have created a JBPM process with some async nodes that performed enough work to keep the server busy for tens of minutes if not hours. And I started it 10 times. Eventually after sending about 200-300 JMS messages all the processes "died" with the same error.

Changing the MDB so that it went to the database more than once trying to find the inserted record "solved" the issue: the record was always found within 3 iterations.

Clearly, I have a race condition on my hands, but what is causing it?

Next thing I did I made sure both JMS operations and the database access went through the same JDBC datasource. Originally the application used 2 JDBC datasources: one for JMS and one for the database access. This change did not solve the problem.

I am not going to describe here what I have tried. It took some time but eventually I have found the source of the problem: some hibernate configuration parameters that should not be there in the first place. Problem solved?

Well, all this got me thinking: first of all, specifying these configuration parameters resulted in an apparently minor difference in what J2EE methods Hibernate invoked. The J2EE javadoc for these methods does not call for such a drastic difference in transaction behavior.

Next, it looks like without these configuration parameters the application is using a single pooled JDBC connection for both JMS and database operations. When these configuration parameters are specified the application is using multiple JDBC connections making each transaction distributed. (Now I am just guessing; as I have mentioned above the javadoc does not say anything about possible difference. Even more, I guess the difference if any is up to a container implementing the interface in question.)

Anyway if I am right about distributed transactions then the question is: is something wrong with the OC4J transaction manager? It can easily be the case given how OC4J handles some other aspects of J2EE. But I guess there is more to it.

I mean: here is a distributed transaction with multiple resources; the transaction is committed with no errors and no recovery on the part of the transaction manager. After the call to commit() returns the resources involved in the transaction contain all the changes. After all, it is the responsibility of the two-phase commit protocol to ensure ACID properties of the distributed transaction, right?

But one thing bothers me: what the 2PC protocol guarantees is that after a call to commit() returns you have your transaction completed. Nowhere could I find information about any guarantees with respect to commits of individual resources, or, more specifically, about visibility of changes in one resource relative to the other resources.

But this is exactly what is going on in the application: as soon as the JMS resource transaction is committed some combination of database/JMS code triggers the execution of the MDB's onMessage(). This code does not wait until the database resource gets its changes committed. It does not know that the changes are part of a distributed transaction. And if the database resource is slow enough the onMessage() code might hit the database before the changes to the database resource involved in the original transaction are committed. So much for ACID!

Am I right? Is this a feature of the 2PCprotocol? I hope that this is caused by some misconfiguration of OC4J J2EE stack, but I just can't see how the 2PC protocol can guarantee that all resources involved in a distributed transaction complete their individual commit()s at exactly the same time.

P.S. By the way if you are using Hibernate in a managed environment with a datasource configured with hibernate.connection.datasource poperty do not use hibernate.connection.username and hibernate.connection.password properties. Even if you understand the consequences and you are absolutely positively sure of what you are doing ... Just don't. Let the container manage this.