Sunday, January 9, 2011

Adventures in the land of distributed transactions, or how ACID is the 2PC?

What? The problem is back?! No, it can't be true. I spent so much time investigating it. I fixed it. I tested the fix.

And yet... I am looking at the database and see enough information there to say: yes, the problem is back. Or, at least, the symptoms are the same: a JMS message is sent and a DB insert is performed in one transaction, and when the MDB gets the message and goes to the database to fetch the inserted record the record is not there.

There is also a very important difference: it is the acceptance environment where the application is running for weeks and the problem manifested itself only once. The original problem was happening once about every 4-5 transactions.

First things first: setRollbackOnly() and Thread.sleep() in the class that sends a JMS message. No message is received by the MDB proving that JMS operations are invoked in the same transaction as database inserts. So no, the new problem is not the same as the old one. This is good and bad at the same time. The good part is that my original fix is correct. The bad part: I have to reproduce the new problem.

Somehow I felt that using a small test application would not help me here: too much unknown, too many things may cause it. I have to reproduce it using the real application.

Actually that "I have to reproduce it" part turned out to be not very difficult after all: I just have created a JBPM process with some async nodes that performed enough work to keep the server busy for tens of minutes if not hours. And I started it 10 times. Eventually after sending about 200-300 JMS messages all the processes "died" with the same error.

Changing the MDB so that it went to the database more than once trying to find the inserted record "solved" the issue: the record was always found within 3 iterations.

Clearly, I have a race condition on my hands, but what is causing it?

Next thing I did I made sure both JMS operations and the database access went through the same JDBC datasource. Originally the application used 2 JDBC datasources: one for JMS and one for the database access. This change did not solve the problem.

I am not going to describe here what I have tried. It took some time but eventually I have found the source of the problem: some hibernate configuration parameters that should not be there in the first place. Problem solved?

Well, all this got me thinking: first of all, specifying these configuration parameters resulted in an apparently minor difference in what J2EE methods Hibernate invoked. The J2EE javadoc for these methods does not call for such a drastic difference in transaction behavior.

Next, it looks like without these configuration parameters the application is using a single pooled JDBC connection for both JMS and database operations. When these configuration parameters are specified the application is using multiple JDBC connections making each transaction distributed. (Now I am just guessing; as I have mentioned above the javadoc does not say anything about possible difference. Even more, I guess the difference if any is up to a container implementing the interface in question.)

Anyway if I am right about distributed transactions then the question is: is something wrong with the OC4J transaction manager? It can easily be the case given how OC4J handles some other aspects of J2EE. But I guess there is more to it.

I mean: here is a distributed transaction with multiple resources; the transaction is committed with no errors and no recovery on the part of the transaction manager. After the call to commit() returns the resources involved in the transaction contain all the changes. After all, it is the responsibility of the two-phase commit protocol to ensure ACID properties of the distributed transaction, right?

But one thing bothers me: what the 2PC protocol guarantees is that after a call to commit() returns you have your transaction completed. Nowhere could I find information about any guarantees with respect to commits of individual resources, or, more specifically, about visibility of changes in one resource relative to the other resources.

But this is exactly what is going on in the application: as soon as the JMS resource transaction is committed some combination of database/JMS code triggers the execution of the MDB's onMessage(). This code does not wait until the database resource gets its changes committed. It does not know that the changes are part of a distributed transaction. And if the database resource is slow enough the onMessage() code might hit the database before the changes to the database resource involved in the original transaction are committed. So much for ACID!

Am I right? Is this a feature of the 2PCprotocol? I hope that this is caused by some misconfiguration of OC4J J2EE stack, but I just can't see how the 2PC protocol can guarantee that all resources involved in a distributed transaction complete their individual commit()s at exactly the same time.

P.S. By the way if you are using Hibernate in a managed environment with a datasource configured with hibernate.connection.datasource poperty do not use hibernate.connection.username and hibernate.connection.password properties. Even if you understand the consequences and you are absolutely positively sure of what you are doing ... Just don't. Let the container manage this.

1 comment:

  1. I am not having enough knowledge about the concept of distributed transactions. But I enjoyed reading your experience and about the problem that you have faced. Thanks for posting about your findings.
    sap upgrade tools