Wednesday, December 30, 2020

The most f*cked up piece of software: "Leaky abstractions? Never heard of it"

Note: if you came here looking for a solution for the particular problems mentioned in this post you would be disappointed: there is no solution, sorry. I advise you to read the post anyway: you might still find it useful. You can also skip to the section below where I describe some possible actions you can try: who knows, you might be lucky.

Law of Leaky Abstractions describes a very real cause of problems in software products. saaj-ri was never shy on leaking abstractions in quite a spectacular way. The fact that it managed to have two different classes that implemented DOM Document interface is really telling: you could start with a Document instance, and then invoking getDocumentElement().getOwnerDocument() would give you a different Document instance. Bugs related to this "feature" might be very entertaining to track.

But this is nothing compare to what the project achieved just in half a year in 2017...

 

After being hit by the bug described in the previous post I thought I would not have to deal with saaj-ri anymore for quite some time. I thought I just file a bug and that is it. (After I find which one of the myriads repositories for this particular code is the real owner.)

I was wrong...

For some new project we had to make sure that our soft runs in a particular environment. We had never tried that environment before so we expected some problems along the way. No big deal.

After we resolved the most obvious problems we eventually noticed that the warnings we have added to detect the saaj-ri bug described in the previous post are present in the logs in the new environment.

This is strange... I looked at the new environment closer and noticed that it included some saaj-ri jars. Thanks to magic of the service loader we were getting classes from this saaj-ri jar and not from the JDK. The environment's saaj-ri version was 1.4.1 while the JDK 8 has 1.3.<something>. Nice, we have got some bug fixes along the way for free. We can now remove the JDK patch and just patch the saaj-ri version packaged with the new environment. This looks a bit cleaner.

But before making a patch I just switched our codebase to use the same saaj-ri 1.4.1 version, and fired the tests...

Call me naïve but I expected only one failing test: the test for the bug with namespace prefixes.

But I was not disappointed, oh no. Seeing test results made me laugh. About third of the tests for a particular functionality were failing. NPEs in the third-party code, failed assertions because something expected was not found in the test messages, errors like "org.w3c.dom.DOMException: WRONG_DOCUMENT_ERR: A node is used in a different document than the one that created it", you name it.

So much for "some bug fixes along the way"!

Sit comfortably and let me tell you a fairytale:

Once upon a time there was a project, just a little project. It implemented the SAAJ specification, and to minimize amount of code it relied on the internal DOM classes of the JDK. It was probably a reasonable decision back in those days given the fact that the little project was also bound to be included in the JDK.

And so it became part of the JDK core functionality. Big mistake was it. It was added to the JDK and mostly forgotten: for years there were literally no changes.

But then JMPS (как вы яхту назовёте...) came along. The little project suddenly got a big problem: the dependency on the internal DOM classes became a liability.

Instead of saying "screw the JPMS, we just open one internal module of the JDK to the other" somebody decided to do "the right thing". Much, much bigger mistake! Especially given the fact where this stuff ended up in the latest JDK versions.

Well, the fairytale did last long, but it came to an abrupt end. The project maintainers had to do a lot of work and probably did not have a lot of time for it.

So how would you do it properly? You are not allowed to use the classes from other modules directly. You cannot inherit. You decided to wrap.

This is probably the only correct decision that the project maintainers made and I guess it is only by accident: there was no other choice. But the execution of this decision resulted in a mess and an utter nightmare.

If you implement an API by wrapping code that implements the same or similar API it is important not to allow wrapped instances to leak into the code that uses the wrappers. If you allow this it would be just too easy to pass a wrapper instance into a wrapped instance's method. Like in this case, a SOAPElement into DOM Node.appendChild() for example.

So the correct decision is "wrap everything and do not allow wrapped instance to be visible to the external code".

I would estimate that for saaj-ri it is like 2 week of work with about a week or two to implement comprehensive unit tests that can prove that no wrapped instance is ever leaked outside. Tests are really important here because DOM API allows myriads different ways to obtain one instance form another.

There is also that pesky DOMImplementation interface which would probably require some additional work but let's forget about it for the moment. It is forgotten in saaj-ri anyway.

So after those 3-4 weeks one can be reasonably sure the job is done and it is done correctly. There of course would be a thing or two in some corner parts of the API that might have been forgotten, but for the rest the code would be ready.

I do not know what the project maintainers decided to do, but what they did is just the minimal wrapping, probably stopping after they made sure the existing unit tests work.

This created an enormous "abstraction leakage": the new saaj-ri implementation was readily giving you wrapper instances and wrapped instances without any thought. Sometimes you would get a wrapper instance from one method and a wrapped instance from another quite similar method (like NS vs non-NS methods).

The project's history is really telling: the first commit for JDK9 is done on 20 January 2017 and since then whole February and March there were sporadic fixes in different parts of the wrapping code mostly by adding other wrapper classes. And it looks like that crap also survived some reviews from the JDK team and nobody raised any concerns.

There were even more fixes in that area in May again adding more wrapper classes. The fixes clearly indicate that the whole "wrap only if it is really necessary" approach was broken but nobody gave a damn.

Meanwhile this crap was released as part of JDK9, and people started to notice, with issues like WSS-605. So other projects were forced to implement workarounds which looked, honestly, quite terrible. But I cannot blame the maintainers who had to come up with such hacks: they thought (and rightly so) that they must do whatever they could to make sure their projects can be used with JDK9.

Somewhere in June-July 2017 there were more fixes in saaj-ri for the wrapping code. Again nobody said stop to that madness.

So after 5 months of "development" the result was still a mess. This big mess ended up as saaj-ri 1.4.1 release. This is what I was dealing with.

Just for kicks I decided to spend some time trying to make our tests work. Of course I could go the way WSS4J went to fix WSS-605, but as I said the fix they implemented looked terrible. Instead I decided to patch saaj-ri wherever necessary. With just couple of changes I plugged some abstraction leaking holes and got rid of WRONG_DOCUMENT_ERR and NPEs

But I still had failed assertions in our tests. It took me some time to debug them but eventually I found the reason: the wrapped instance escaped through yet another hole and some library we used navigated the wrapped document but used the wrapping element in the condition to decide if it found whatever it needed.

Some googling took me to SANTUARIO-516 where a similar problem is discussed. Following the links Icame to this issue. They "have fixed" it here

The fix? Yet another class was wrapped. I rather enjoyed the comment to the fix:

This change fixes the delegation so that element types, including attributes and text nodes, correctly identify as being part of the same owning document.
Note that whilst this fixes the consistency of some cloned elements, a proper fix is beyond what can be achieved within this change and requires a comprehensive review.

Wow! After three and a half years they finally started to suspect something is wrong. Or maybe the comment is just about the newly added wrapper class.

Of course they did not bother to set up that "comprehensive review". Or maybe they did bother and came with nothing.

Another problem with their fix is that it is done only to the saaj-ri version which is based on the new "jakarta" saaj interfaces (jakarta.xml.soap.*). Tough luck if you need it for the previous version (javax.xml.soap.*)

I reapplied this fix to 1.4.1 version where I did my changes so far. It fixed some of our tests but some others that were OK before this change started to fail with WRONG_DOCUMENT_ERR. Splendid work they did: they fixed one place where it leaked and introduced another leak.

This "fix" yet again demonstrates how f*cked up this project is.

I had to track and change more places where they did not bother with the proper wrapping or unwrapping. Our tests finally passed, except for one I was originally expecting not to pass, for the issue described in the previous post.

Phew...

I even went as far as patched WSS4J and removed their WSS-605 fix, and the tests were still OK. This just proves that whatever is done in other projects to work around saaj-ri issues were not really necessary if the saaj-ri developers did their job properly in the first place.

After all this work to make the tests pass we are still faced with the question: what must we do with our project w.r.t. saaj-ri?

The answer is clear: get rid of it, and now!

We will not use my fixes in our project. It is clear that saaj-ri 1.4+ is so broken that we do not have any guarantee it does not fail in some other use case which we just happen not to test or even in the tested scenarios but with slightly different SOAP message. Given the way some of these errors are manifested it would be nightmare to debug them in production.

But getting rid of saaj-ri will take some time so right now we just switch back to saaj-ri 1.3 packaged with the JDK8: we still have this possibility in this new environment.

 

If you came here looking for a solution for a problem with saaj-ri, well, I can only say you have a big problem. The best solution is to get rid of saaj dependency. I understand it is not always possible but this is definitely a way to go.

If you are still on JDK8 and got saaj-ri 1.4 dependency by accident try to sitch to the JDK saaj-ri version.

You can try to find another saaj implementation and hope it does not have any quirks.

In some cases you can work around the problems by staying "pure DOM" as long as possible and convert you DOM to SOAP at the very end. Or the other way around, get hold of the wrapped DOM as soon as possible and just continue with DOM.

If your code manipulates SOAP documents via saaj try to change it to use DOM API and then apply what I have written above.

But if you have some third-party dependency that uses saaj and not DOM then, I am sorry, you are doomed.

I am not going to create tickets for saaj-ri for the problems I have found. It is clear that the project maintainers are changing the code to hide the symptoms and are not going to fix the whole thing once and for all. And I am not going to create a separate ticket for each problem I have identified. There are just too many of them.

No way I am going to submit my changes back to saaj-ri. Again, these changes have fixed some issues but I do not have any idea what is not fixed. I do not have time to do it properly. Last but not least I do not want to have my name associated with this pile of crap.

I am not going to create a ticket for the problem with namespace prefixes described in the previous post: why bother?

In fact the only ticket I would like to create is for closing saaj-ri project for good but no doubt such a ticket will not be accepted. And it is probably a good thing: this project is a chance to keep some very talented people busy at least part of their time. This way they have less time to apply their expertise in other projects. Every little thing counts...

No comments:

Post a Comment