Wednesday, December 30, 2020

The most f*cked up piece of software: "Leaky abstractions? Never heard of it"

Note: if you came here looking for a solution for the particular problems mentioned in this post you would be disappointed: there is no solution, sorry. I advise you to read the post anyway: you might still find it useful. You can also skip to the section below where I describe some possible actions you can try: who knows, you might be lucky.

Law of Leaky Abstractions describes a very real cause of problems in software products. saaj-ri was never shy on leaking abstractions in quite a spectacular way. The fact that it managed to have two different classes that implemented DOM Document interface is really telling: you could start with a Document instance, and then invoking getDocumentElement().getOwnerDocument() would give you a different Document instance. Bugs related to this "feature" might be very entertaining to track.

But this is nothing compare to what the project achieved just in half a year in 2017...

 

After being hit by the bug described in the previous post I thought I would not have to deal with saaj-ri anymore for quite some time. I thought I just file a bug and that is it. (After I find which one of the myriads repositories for this particular code is the real owner.)

I was wrong...

For some new project we had to make sure that our soft runs in a particular environment. We had never tried that environment before so we expected some problems along the way. No big deal.

After we resolved the most obvious problems we eventually noticed that the warnings we have added to detect the saaj-ri bug described in the previous post are present in the logs in the new environment.

This is strange... I looked at the new environment closer and noticed that it included some saaj-ri jars. Thanks to magic of the service loader we were getting classes from this saaj-ri jar and not from the JDK. The environment's saaj-ri version was 1.4.1 while the JDK 8 has 1.3.<something>. Nice, we have got some bug fixes along the way for free. We can now remove the JDK patch and just patch the saaj-ri version packaged with the new environment. This looks a bit cleaner.

But before making a patch I just switched our codebase to use the same saaj-ri 1.4.1 version, and fired the tests...

Call me naïve but I expected only one failing test: the test for the bug with namespace prefixes.

But I was not disappointed, oh no. Seeing test results made me laugh. About third of the tests for a particular functionality were failing. NPEs in the third-party code, failed assertions because something expected was not found in the test messages, errors like "org.w3c.dom.DOMException: WRONG_DOCUMENT_ERR: A node is used in a different document than the one that created it", you name it.

So much for "some bug fixes along the way"!

Sit comfortably and let me tell you a fairytale:

Once upon a time there was a project, just a little project. It implemented the SAAJ specification, and to minimize amount of code it relied on the internal DOM classes of the JDK. It was probably a reasonable decision back in those days given the fact that the little project was also bound to be included in the JDK.

And so it became part of the JDK core functionality. Big mistake was it. It was added to the JDK and mostly forgotten: for years there were literally no changes.

But then JMPS (как вы яхту назовёте...) came along. The little project suddenly got a big problem: the dependency on the internal DOM classes became a liability.

Instead of saying "screw the JPMS, we just open one internal module of the JDK to the other" somebody decided to do "the right thing". Much, much bigger mistake! Especially given the fact where this stuff ended up in the latest JDK versions.

Well, the fairytale did last long, but it came to an abrupt end. The project maintainers had to do a lot of work and probably did not have a lot of time for it.

So how would you do it properly? You are not allowed to use the classes from other modules directly. You cannot inherit. You decided to wrap.

This is probably the only correct decision that the project maintainers made and I guess it is only by accident: there was no other choice. But the execution of this decision resulted in a mess and an utter nightmare.

If you implement an API by wrapping code that implements the same or similar API it is important not to allow wrapped instances to leak into the code that uses the wrappers. If you allow this it would be just too easy to pass a wrapper instance into a wrapped instance's method. Like in this case, a SOAPElement into DOM Node.appendChild() for example.

So the correct decision is "wrap everything and do not allow wrapped instance to be visible to the external code".

I would estimate that for saaj-ri it is like 2 week of work with about a week or two to implement comprehensive unit tests that can prove that no wrapped instance is ever leaked outside. Tests are really important here because DOM API allows myriads different ways to obtain one instance form another.

There is also that pesky DOMImplementation interface which would probably require some additional work but let's forget about it for the moment. It is forgotten in saaj-ri anyway.

So after those 3-4 weeks one can be reasonably sure the job is done and it is done correctly. There of course would be a thing or two in some corner parts of the API that might have been forgotten, but for the rest the code would be ready.

I do not know what the project maintainers decided to do, but what they did is just the minimal wrapping, probably stopping after they made sure the existing unit tests work.

This created an enormous "abstraction leakage": the new saaj-ri implementation was readily giving you wrapper instances and wrapped instances without any thought. Sometimes you would get a wrapper instance from one method and a wrapped instance from another quite similar method (like NS vs non-NS methods).

The project's history is really telling: the first commit for JDK9 is done on 20 January 2017 and since then whole February and March there were sporadic fixes in different parts of the wrapping code mostly by adding other wrapper classes. And it looks like that crap also survived some reviews from the JDK team and nobody raised any concerns.

There were even more fixes in that area in May again adding more wrapper classes. The fixes clearly indicate that the whole "wrap only if it is really necessary" approach was broken but nobody gave a damn.

Meanwhile this crap was released as part of JDK9, and people started to notice, with issues like WSS-605. So other projects were forced to implement workarounds which looked, honestly, quite terrible. But I cannot blame the maintainers who had to come up with such hacks: they thought (and rightly so) that they must do whatever they could to make sure their projects can be used with JDK9.

Somewhere in June-July 2017 there were more fixes in saaj-ri for the wrapping code. Again nobody said stop to that madness.

So after 5 months of "development" the result was still a mess. This big mess ended up as saaj-ri 1.4.1 release. This is what I was dealing with.

Just for kicks I decided to spend some time trying to make our tests work. Of course I could go the way WSS4J went to fix WSS-605, but as I said the fix they implemented looked terrible. Instead I decided to patch saaj-ri wherever necessary. With just couple of changes I plugged some abstraction leaking holes and got rid of WRONG_DOCUMENT_ERR and NPEs

But I still had failed assertions in our tests. It took me some time to debug them but eventually I found the reason: the wrapped instance escaped through yet another hole and some library we used navigated the wrapped document but used the wrapping element in the condition to decide if it found whatever it needed.

Some googling took me to SANTUARIO-516 where a similar problem is discussed. Following the links Icame to this issue. They "have fixed" it here

The fix? Yet another class was wrapped. I rather enjoyed the comment to the fix:

This change fixes the delegation so that element types, including attributes and text nodes, correctly identify as being part of the same owning document.
Note that whilst this fixes the consistency of some cloned elements, a proper fix is beyond what can be achieved within this change and requires a comprehensive review.

Wow! After three and a half years they finally started to suspect something is wrong. Or maybe the comment is just about the newly added wrapper class.

Of course they did not bother to set up that "comprehensive review". Or maybe they did bother and came with nothing.

Another problem with their fix is that it is done only to the saaj-ri version which is based on the new "jakarta" saaj interfaces (jakarta.xml.soap.*). Tough luck if you need it for the previous version (javax.xml.soap.*)

I reapplied this fix to 1.4.1 version where I did my changes so far. It fixed some of our tests but some others that were OK before this change started to fail with WRONG_DOCUMENT_ERR. Splendid work they did: they fixed one place where it leaked and introduced another leak.

This "fix" yet again demonstrates how f*cked up this project is.

I had to track and change more places where they did not bother with the proper wrapping or unwrapping. Our tests finally passed, except for one I was originally expecting not to pass, for the issue described in the previous post.

Phew...

I even went as far as patched WSS4J and removed their WSS-605 fix, and the tests were still OK. This just proves that whatever is done in other projects to work around saaj-ri issues were not really necessary if the saaj-ri developers did their job properly in the first place.

After all this work to make the tests pass we are still faced with the question: what must we do with our project w.r.t. saaj-ri?

The answer is clear: get rid of it, and now!

We will not use my fixes in our project. It is clear that saaj-ri 1.4+ is so broken that we do not have any guarantee it does not fail in some other use case which we just happen not to test or even in the tested scenarios but with slightly different SOAP message. Given the way some of these errors are manifested it would be nightmare to debug them in production.

But getting rid of saaj-ri will take some time so right now we just switch back to saaj-ri 1.3 packaged with the JDK8: we still have this possibility in this new environment.

 

If you came here looking for a solution for a problem with saaj-ri, well, I can only say you have a big problem. The best solution is to get rid of saaj dependency. I understand it is not always possible but this is definitely a way to go.

If you are still on JDK8 and got saaj-ri 1.4 dependency by accident try to sitch to the JDK saaj-ri version.

You can try to find another saaj implementation and hope it does not have any quirks.

In some cases you can work around the problems by staying "pure DOM" as long as possible and convert you DOM to SOAP at the very end. Or the other way around, get hold of the wrapped DOM as soon as possible and just continue with DOM.

If your code manipulates SOAP documents via saaj try to change it to use DOM API and then apply what I have written above.

But if you have some third-party dependency that uses saaj and not DOM then, I am sorry, you are doomed.

I am not going to create tickets for saaj-ri for the problems I have found. It is clear that the project maintainers are changing the code to hide the symptoms and are not going to fix the whole thing once and for all. And I am not going to create a separate ticket for each problem I have identified. There are just too many of them.

No way I am going to submit my changes back to saaj-ri. Again, these changes have fixed some issues but I do not have any idea what is not fixed. I do not have time to do it properly. Last but not least I do not want to have my name associated with this pile of crap.

I am not going to create a ticket for the problem with namespace prefixes described in the previous post: why bother?

In fact the only ticket I would like to create is for closing saaj-ri project for good but no doubt such a ticket will not be accepted. And it is probably a good thing: this project is a chance to keep some very talented people busy at least part of their time. This way they have less time to apply their expertise in other projects. Every little thing counts...

Sunday, December 27, 2020

The most f*cked up piece of software: "no worries, we know better how you want your SOAP message parsed"

Well, as I wrote at the beginning of the previous post: at the time I had encountered the problem I did not think it would be really interesting to blog about it. So I did not.

But about one month ago we got a new business partner. We use SOAP (and in our case: still use ssaj-ri) to communicate with this partner. Messages received from that partner were failing some complex preprocessing step. Needless to say we did not have any problems with messages sent by other partners.

First we ignored the problem because this was a new partner and their messages were not compliant with some other part of the specification we relied on. We even had to make some changes to adapt our code to detect and accept their peculiar view on the specification. And at first the business partner did not send a lot of messages so we even did not notice the problem.

Eventually we noticed that all messages from that partner are failing a particular preprocessing step. When I looked at the problem I could not understand the reason for the failure. Basically, the message had several parts, each part went through the same preprocessing step, and all but one part were preprocessed successfully. One always failed.

I dumped the message before the preprocessing step and then preprocessed it separately, and it failed exactly the same way as in our product.

I was prepared to blame the business partner. After all, we did have a problem with them not following the specification. So I thought it is just another place where they do something incorrectly.

But I knew that the preprocessing library we use also has its issues. I had found and reported one some time ago, and it was confirmed and fixed.

So I started playing directly with this library trying to make it accept the message, but nothing helped.

Then I decided to look at the message as we receive it. I dumped it before it went to the SOAP parser and tried to preprocess it separately. Again, it failed the same way.

OK, case is closed: the business partner is to blame?

But then I looked at the dumped messages. I mean, really looked, with my eyes. And I could not believe what I have seen. I added a unit test that just parsed the "bytes as dumped before the saaj-ri SOAP parser" and immediately serialized the parsed message. And the parsed-then-serialized message differed from the original one.

This is the original message:

<Envelope xmlns="http://www.w3.org/2003/05/soap-envelope">
<Header ...>
...
<Body ...>
...

And this is the parsed-then-serialized message:

<env:Envelope xmlns:env="http://www.w3.org/2003/05/soap-envelope">
<env:Header ...>
...
<env:Body ...>
...

See those "env:" namespace prefixes? When I saw this ... let's put it this way ... I expressed my deep concerns about the state of the software development in general and in this project in particular. I wished all the best to the developer who is responsible for this particular behavior, and to all his relatives from 10 generations before. I was amazed by the state of cognitive problems the said developer must have suffered throughout her or his life.

Needless to say parsing the same raw data with a DOM parser and then preprocessing it suddenly fixed the problem we were having.

I found the code responsible for this fubar almost immediately. This is it: NameImpl.java

Just look at it! Does it not call for even more best wishes to the developer who is responsible for this ... pile? It certainly forced me to do it.

The code looks like it was created by some junior developer who had just came across this wonderful feature of java: static nested classes. And the developer had gone full throttle into using this feature.

Do not get me wrong: I do not have anything against junior developers. Of course I went through this myself. In many ways I am probably still a junior developer. But this is why we have things like code reviews and people like mentors or team leaders or whatever.

This code should have been cleaned up right in the next commit after it was added.

Actually I digress. Normally I would not even blink at this kind of code in a 3rd party dependency: life is too short for that. But having wasted quite some time on this problem I thought I have all the right to rant about it.

The gem of this problem lies in all those wonderful classes like SOAP1_1Name which explicitly check for an empty namespace prefix and "helpfully" replace it with some hardcoded predefined value.

What on earth whoever had added these checks was thinking?! "Oh, look, empty prefix. It is probably some kind of mistake, so let's fix it."? Had this person expected others would thank him? How about a medal? I think something like a slap on the hands with an industrial-grade keyboard would be more appropriate.

It looks like we have to give more priority to the task of moving from saaj-ri.

It is jut bizarre how much low quality code ends up in the core tools ...

For the moment we have just patched this class. Since we still use JDK 8 in most cases, it means we had to patch the version that is packaged with the JDK and use -Xbootclasspath/p JVM option.

And we have added some startup warning to detect that we are running in an environment without the patch being active.

Saturday, December 26, 2020

The most f*cked up piece of software ...

... that I have seen so far. Surprisingly it is not JBoss.

I know I did not blog for a long time. It is not like I did not have any fun with Oracle, JBoss, JDK, or some other woefully broken code. It is just that most problems I come across look pretty mundane. Or they require quite an in-depth write up and I feel like it is stupid to spend a lot of time on describing something after I already have wasted hours if not days to understand and fix.

Case in point: I first ran into a problem with a particular functionality almost 3 years ago. The problem was reproducible only under some specific circumstances so it took some time to find the root cause. When I found it I did not think it would be really interesting to blog about it. After all, I had already encountered much more interesting problems in software coming from the same group of developers. So I came with a workaround which helped in a lot of cases, and I decided that we need to get rid of this dependency altogether as soon as possible.

Unfortunately other tasks with higher priority prevented me from completing this task, and we still have this dependency. On the other hand had I completed that task I would not have recently a "pleasure" wasting literally days on tracking some other problems that were caused by the same software.

I decided that the sheer variance of problems in one not particularly complicated project is really worth blogging about.

Meet the "the most fucked up piece of software du jour": SAAJ reference implementation. I did not really trace its pedigree as I did not care about what particular shithole "incubator of promising projects" it crept out, but it is present under prefix "metro" in a lot of code repositories. Names of some authors in the source code of saaj-ri match names of authors in other metro-related projects like WSIT. So I am not amused.

What made all the problems worse is that Sun started to include metro subprojects in JDK as core functionality. (As if the JDK code was perfect and it needed some low quality code for balance.) As a result a lot of broken code ended up being used by a lot of developers in a lot of projects. This all made fixing bugs more difficult: even if they were fixed as part of the metro projects the fixes were seldom included in the JDK.

Oracle continued this tradition, at least up to JDK 11, when they finally removed a bunch of stuff, including SAAJ, from the JDK. But before that they made some extremely bad choices which resulted in saaj-ri being broken beyond anything imaginable.

saaj-ri is now part of Eclipse EE4J and looking at its git repository I can only conclude that the project decay is evident. Not that I care.

But saaj-ri is still included in a lot of places and a lot of developers are still using it, directly or indirectly. I can only wish good luck to them.

Well, enough ranting. Time to show the code. Let's start with the problem I encountered 3 years ago.

Part of our application was processing SOAP messages using saaj packaged with the JDK. We noticed that under some circumstances processing of several SOAP messages in parallel was almost grinding to a halt. After taking some thread dumps and analyzing them we found this wonderful piece of code in saaj classes.

This is the code from the saaj-ri. The code in the JDK was under a different package name but for the rest it the same or very close. See line 91:

private static ParserPool parserPool = new ParserPool(5);

You can enjoy the source code of ParserPool here, although it is not interesting.

In the current JDK8 EnvelopeFactory is a bit different: it has a classloader-bound pool. What a relief!

What does it mean basically? You can parse at most 5 SOAP messages in parallel [from the code loaded by the same classloader]. If for some reason the parsing takes a lot of time you have a serious problem.

This code raises many questions like why the hell they need the pool at all? OK, this might have been important back when saaj-ri was initially developed but why this code is still present now? Why only 5 instances?

To add insult to injury this code has also a resource leak. In some circumstances you might end up with an empty parser pool: some exceptions can be thrown after a parser was retrieved from the pool. In that case the parser is not returned in the pool.

But do not despair! Good developers of saaj-ri are ready to combat the problems. They have actually fixed the resource leak somewhere in 1.4.X version.

And they added a way to increase the pool size! Wow - now you have it configurable. (As usual, with a JDK system property, once and for everybody, but who am I to point to a better way?)