Boring stuff: Clowns

Showing posts with label Clowns. Show all posts

Wednesday, December 30, 2020

The most f*cked up piece of software: "Leaky abstractions? Never heard of it"

Note: if you came here looking for a solution for the particular problems mentioned in this post you would be disappointed: there is no solution, sorry. I advise you to read the post anyway: you might still find it useful. You can also skip to the section below where I describe some possible actions you can try: who knows, you might be lucky.

Law of Leaky Abstractions describes a very real cause of problems in software products. saaj-ri was never shy on leaking abstractions in quite a spectacular way. The fact that it managed to have two different classes that implemented DOM Document interface is really telling: you could start with a Document instance, and then invoking getDocumentElement().getOwnerDocument() would give you a different Document instance. Bugs related to this "feature" might be very entertaining to track.

But this is nothing compare to what the project achieved just in half a year in 2017...

After being hit by the bug described in the previous post I thought I would not have to deal with saaj-ri anymore for quite some time. I thought I just file a bug and that is it. (After I find which one of the myriads repositories for this particular code is the real owner.)

I was wrong...

For some new project we had to make sure that our soft runs in a particular environment. We had never tried that environment before so we expected some problems along the way. No big deal.

After we resolved the most obvious problems we eventually noticed that the warnings we have added to detect the saaj-ri bug described in the previous post are present in the logs in the new environment.

This is strange... I looked at the new environment closer and noticed that it included some saaj-ri jars. Thanks to magic of the service loader we were getting classes from this saaj-ri jar and not from the JDK. The environment's saaj-ri version was 1.4.1 while the JDK 8 has 1.3.<something>. Nice, we have got some bug fixes along the way for free. We can now remove the JDK patch and just patch the saaj-ri version packaged with the new environment. This looks a bit cleaner.

But before making a patch I just switched our codebase to use the same saaj-ri 1.4.1 version, and fired the tests...

Call me naïve but I expected only one failing test: the test for the bug with namespace prefixes.

But I was not disappointed, oh no. Seeing test results made me laugh. About third of the tests for a particular functionality were failing. NPEs in the third-party code, failed assertions because something expected was not found in the test messages, errors like "org.w3c.dom.DOMException: WRONG_DOCUMENT_ERR: A node is used in a different document than the one that created it", you name it.

So much for "some bug fixes along the way"!

Sit comfortably and let me tell you a fairytale:

Once upon a time there was a project, just a little project. It implemented the SAAJ specification, and to minimize amount of code it relied on the internal DOM classes of the JDK. It was probably a reasonable decision back in those days given the fact that the little project was also bound to be included in the JDK.

And so it became part of the JDK core functionality. Big mistake was it. It was added to the JDK and mostly forgotten: for years there were literally no changes.

But then JMPS (как вы яхту назовёте...) came along. The little project suddenly got a big problem: the dependency on the internal DOM classes became a liability.

Instead of saying "screw the JPMS, we just open one internal module of the JDK to the other" somebody decided to do "the right thing". Much, much bigger mistake! Especially given the fact where this stuff ended up in the latest JDK versions.

Well, the fairytale did last long, but it came to an abrupt end. The project maintainers had to do a lot of work and probably did not have a lot of time for it.

So how would you do it properly? You are not allowed to use the classes from other modules directly. You cannot inherit. You decided to wrap.

This is probably the only correct decision that the project maintainers made and I guess it is only by accident: there was no other choice. But the execution of this decision resulted in a mess and an utter nightmare.

If you implement an API by wrapping code that implements the same or similar API it is important not to allow wrapped instances to leak into the code that uses the wrappers. If you allow this it would be just too easy to pass a wrapper instance into a wrapped instance's method. Like in this case, a SOAPElement into DOM Node.appendChild() for example.

So the correct decision is "wrap everything and do not allow wrapped instance to be visible to the external code".

I would estimate that for saaj-ri it is like 2 week of work with about a week or two to implement comprehensive unit tests that can prove that no wrapped instance is ever leaked outside. Tests are really important here because DOM API allows myriads different ways to obtain one instance form another.

There is also that pesky DOMImplementation interface which would probably require some additional work but let's forget about it for the moment. It is forgotten in saaj-ri anyway.

So after those 3-4 weeks one can be reasonably sure the job is done and it is done correctly. There of course would be a thing or two in some corner parts of the API that might have been forgotten, but for the rest the code would be ready.

I do not know what the project maintainers decided to do, but what they did is just the minimal wrapping, probably stopping after they made sure the existing unit tests work.

This created an enormous "abstraction leakage": the new saaj-ri implementation was readily giving you wrapper instances and wrapped instances without any thought. Sometimes you would get a wrapper instance from one method and a wrapped instance from another quite similar method (like NS vs non-NS methods).

The project's history is really telling: the first commit for JDK9 is done on 20 January 2017 and since then whole February and March there were sporadic fixes in different parts of the wrapping code mostly by adding other wrapper classes. And it looks like that crap also survived some reviews from the JDK team and nobody raised any concerns.

There were even more fixes in that area in May again adding more wrapper classes. The fixes clearly indicate that the whole "wrap only if it is really necessary" approach was broken but nobody gave a damn.

Meanwhile this crap was released as part of JDK9, and people started to notice, with issues like WSS-605. So other projects were forced to implement workarounds which looked, honestly, quite terrible. But I cannot blame the maintainers who had to come up with such hacks: they thought (and rightly so) that they must do whatever they could to make sure their projects can be used with JDK9.

Somewhere in June-July 2017 there were more fixes in saaj-ri for the wrapping code. Again nobody said stop to that madness.

So after 5 months of "development" the result was still a mess. This big mess ended up as saaj-ri 1.4.1 release. This is what I was dealing with.

Just for kicks I decided to spend some time trying to make our tests work. Of course I could go the way WSS4J went to fix WSS-605, but as I said the fix they implemented looked terrible. Instead I decided to patch saaj-ri wherever necessary. With just couple of changes I plugged some abstraction leaking holes and got rid of WRONG_DOCUMENT_ERR and NPEs

But I still had failed assertions in our tests. It took me some time to debug them but eventually I found the reason: the wrapped instance escaped through yet another hole and some library we used navigated the wrapped document but used the wrapping element in the condition to decide if it found whatever it needed.

Some googling took me to SANTUARIO-516 where a similar problem is discussed. Following the links Icame to this issue. They "have fixed" it here

The fix? Yet another class was wrapped. I rather enjoyed the comment to the fix:

This change fixes the delegation so that element types, including attributes and text nodes, correctly identify as being part of the same owning document.

Note that whilst this fixes the consistency of some cloned elements, a proper fix is beyond what can be achieved within this change and requires a comprehensive review.

Wow! After three and a half years they finally started to suspect something is wrong. Or maybe the comment is just about the newly added wrapper class.

Of course they did not bother to set up that "comprehensive review". Or maybe they did bother and came with nothing.

Another problem with their fix is that it is done only to the saaj-ri version which is based on the new "jakarta" saaj interfaces (jakarta.xml.soap.*). Tough luck if you need it for the previous version (javax.xml.soap.*)

I reapplied this fix to 1.4.1 version where I did my changes so far. It fixed some of our tests but some others that were OK before this change started to fail with WRONG_DOCUMENT_ERR. Splendid work they did: they fixed one place where it leaked and introduced another leak.

This "fix" yet again demonstrates how f*cked up this project is.

I had to track and change more places where they did not bother with the proper wrapping or unwrapping. Our tests finally passed, except for one I was originally expecting not to pass, for the issue described in the previous post.

Phew...

I even went as far as patched WSS4J and removed their WSS-605 fix, and the tests were still OK. This just proves that whatever is done in other projects to work around saaj-ri issues were not really necessary if the saaj-ri developers did their job properly in the first place.

After all this work to make the tests pass we are still faced with the question: what must we do with our project w.r.t. saaj-ri?

The answer is clear: get rid of it, and now!

We will not use my fixes in our project. It is clear that saaj-ri 1.4+ is so broken that we do not have any guarantee it does not fail in some other use case which we just happen not to test or even in the tested scenarios but with slightly different SOAP message. Given the way some of these errors are manifested it would be nightmare to debug them in production.

But getting rid of saaj-ri will take some time so right now we just switch back to saaj-ri 1.3 packaged with the JDK8: we still have this possibility in this new environment.

If you came here looking for a solution for a problem with saaj-ri, well, I can only say you have a big problem. The best solution is to get rid of saaj dependency. I understand it is not always possible but this is definitely a way to go.

If you are still on JDK8 and got saaj-ri 1.4 dependency by accident try to sitch to the JDK saaj-ri version.

You can try to find another saaj implementation and hope it does not have any quirks.

In some cases you can work around the problems by staying "pure DOM" as long as possible and convert you DOM to SOAP at the very end. Or the other way around, get hold of the wrapped DOM as soon as possible and just continue with DOM.

If your code manipulates SOAP documents via saaj try to change it to use DOM API and then apply what I have written above.

But if you have some third-party dependency that uses saaj and not DOM then, I am sorry, you are doomed.

I am not going to create tickets for saaj-ri for the problems I have found. It is clear that the project maintainers are changing the code to hide the symptoms and are not going to fix the whole thing once and for all. And I am not going to create a separate ticket for each problem I have identified. There are just too many of them.

No way I am going to submit my changes back to saaj-ri. Again, these changes have fixed some issues but I do not have any idea what is not fixed. I do not have time to do it properly. Last but not least I do not want to have my name associated with this pile of crap.

I am not going to create a ticket for the problem with namespace prefixes described in the previous post: why bother?

In fact the only ticket I would like to create is for closing saaj-ri project for good but no doubt such a ticket will not be accepted. And it is probably a good thing: this project is a chance to keep some very talented people busy at least part of their time. This way they have less time to apply their expertise in other projects. Every little thing counts...

Sunday, December 27, 2020

The most f*cked up piece of software: "no worries, we know better how you want your SOAP message parsed"

Well, as I wrote at the beginning of the previous post: at the time I had encountered the problem I did not think it would be really interesting to blog about it. So I did not.

But about one month ago we got a new business partner. We use SOAP (and in our case: still use ssaj-ri) to communicate with this partner. Messages received from that partner were failing some complex preprocessing step. Needless to say we did not have any problems with messages sent by other partners.

First we ignored the problem because this was a new partner and their messages were not compliant with some other part of the specification we relied on. We even had to make some changes to adapt our code to detect and accept their peculiar view on the specification. And at first the business partner did not send a lot of messages so we even did not notice the problem.

Eventually we noticed that all messages from that partner are failing a particular preprocessing step. When I looked at the problem I could not understand the reason for the failure. Basically, the message had several parts, each part went through the same preprocessing step, and all but one part were preprocessed successfully. One always failed.

I dumped the message before the preprocessing step and then preprocessed it separately, and it failed exactly the same way as in our product.

I was prepared to blame the business partner. After all, we did have a problem with them not following the specification. So I thought it is just another place where they do something incorrectly.

But I knew that the preprocessing library we use also has its issues. I had found and reported one some time ago, and it was confirmed and fixed.

So I started playing directly with this library trying to make it accept the message, but nothing helped.

Then I decided to look at the message as we receive it. I dumped it before it went to the SOAP parser and tried to preprocess it separately. Again, it failed the same way.

OK, case is closed: the business partner is to blame?

But then I looked at the dumped messages. I mean, really looked, with my eyes. And I could not believe what I have seen. I added a unit test that just parsed the "bytes as dumped before the saaj-ri SOAP parser" and immediately serialized the parsed message. And the parsed-then-serialized message differed from the original one.

This is the original message:

<Envelope xmlns="http://www.w3.org/2003/05/soap-envelope">
<Header ...>
...
<Body ...>
...

And this is the parsed-then-serialized message:

<env:Envelope xmlns:env="http://www.w3.org/2003/05/soap-envelope">
<env:Header ...>
...
<env:Body ...>
...

See those "env:" namespace prefixes? When I saw this ... let's put it this way ... I expressed my deep concerns about the state of the software development in general and in this project in particular. I wished all the best to the developer who is responsible for this particular behavior, and to all his relatives from 10 generations before. I was amazed by the state of cognitive problems the said developer must have suffered throughout her or his life.

Needless to say parsing the same raw data with a DOM parser and then preprocessing it suddenly fixed the problem we were having.

I found the code responsible for this fubar almost immediately. This is it: NameImpl.java

Just look at it! Does it not call for even more best wishes to the developer who is responsible for this ... pile? It certainly forced me to do it.

The code looks like it was created by some junior developer who had just came across this wonderful feature of java: static nested classes. And the developer had gone full throttle into using this feature.

Do not get me wrong: I do not have anything against junior developers. Of course I went through this myself. In many ways I am probably still a junior developer. But this is why we have things like code reviews and people like mentors or team leaders or whatever.

This code should have been cleaned up right in the next commit after it was added.

Actually I digress. Normally I would not even blink at this kind of code in a 3rd party dependency: life is too short for that. But having wasted quite some time on this problem I thought I have all the right to rant about it.

The gem of this problem lies in all those wonderful classes like SOAP1_1Name which explicitly check for an empty namespace prefix and "helpfully" replace it with some hardcoded predefined value.

What on earth whoever had added these checks was thinking?! "Oh, look, empty prefix. It is probably some kind of mistake, so let's fix it."? Had this person expected others would thank him? How about a medal? I think something like a slap on the hands with an industrial-grade keyboard would be more appropriate.

It looks like we have to give more priority to the task of moving from saaj-ri.

It is jut bizarre how much low quality code ends up in the core tools ...

For the moment we have just patched this class. Since we still use JDK 8 in most cases, it means we had to patch the version that is packaged with the JDK and use -Xbootclasspath/p JVM option.

And we have added some startup warning to detect that we are running in an environment without the patch being active.

Saturday, December 26, 2020

The most f*cked up piece of software ...

... that I have seen so far. Surprisingly it is not JBoss.

I know I did not blog for a long time. It is not like I did not have any fun with Oracle, JBoss, JDK, or some other woefully broken code. It is just that most problems I come across look pretty mundane. Or they require quite an in-depth write up and I feel like it is stupid to spend a lot of time on describing something after I already have wasted hours if not days to understand and fix.

Case in point: I first ran into a problem with a particular functionality almost 3 years ago. The problem was reproducible only under some specific circumstances so it took some time to find the root cause. When I found it I did not think it would be really interesting to blog about it. After all, I had already encountered much more interesting problems in software coming from the same group of developers. So I came with a workaround which helped in a lot of cases, and I decided that we need to get rid of this dependency altogether as soon as possible.

Unfortunately other tasks with higher priority prevented me from completing this task, and we still have this dependency. On the other hand had I completed that task I would not have recently a "pleasure" wasting literally days on tracking some other problems that were caused by the same software.

I decided that the sheer variance of problems in one not particularly complicated project is really worth blogging about.

Meet the "the most fucked up piece of software du jour": SAAJ reference implementation. I did not really trace its pedigree as I did not care about what particular ~~shithole~~ "incubator of promising projects" it crept out, but it is present under prefix "metro" in a lot of code repositories. Names of some authors in the source code of saaj-ri match names of authors in other metro-related projects like WSIT. So I am not amused.

What made all the problems worse is that Sun started to include metro subprojects in JDK as core functionality. (As if the JDK code was perfect and it needed some low quality code for balance.) As a result a lot of broken code ended up being used by a lot of developers in a lot of projects. This all made fixing bugs more difficult: even if they were fixed as part of the metro projects the fixes were seldom included in the JDK.

Oracle continued this tradition, at least up to JDK 11, when they finally removed a bunch of stuff, including SAAJ, from the JDK. But before that they made some extremely bad choices which resulted in saaj-ri being broken beyond anything imaginable.

saaj-ri is now part of Eclipse EE4J and looking at its git repository I can only conclude that the project decay is evident. Not that I care.

But saaj-ri is still included in a lot of places and a lot of developers are still using it, directly or indirectly. I can only wish good luck to them.

Well, enough ranting. Time to show the code. Let's start with the problem I encountered 3 years ago.

Part of our application was processing SOAP messages using saaj packaged with the JDK. We noticed that under some circumstances processing of several SOAP messages in parallel was almost grinding to a halt. After taking some thread dumps and analyzing them we found this wonderful piece of code in saaj classes.

This is the code from the saaj-ri. The code in the JDK was under a different package name but for the rest it the same or very close. See line 91:

private static ParserPool parserPool = new ParserPool(5);

You can enjoy the source code of ParserPool here, although it is not interesting.

In the current JDK8 EnvelopeFactory is a bit different: it has a classloader-bound pool. What a relief!

What does it mean basically? You can parse at most 5 SOAP messages in parallel [from the code loaded by the same classloader]. If for some reason the parsing takes a lot of time you have a serious problem.

This code raises many questions like why the hell they need the pool at all? OK, this might have been important back when saaj-ri was initially developed but why this code is still present now? Why only 5 instances?

To add insult to injury this code has also a resource leak. In some circumstances you might end up with an empty parser pool: some exceptions can be thrown after a parser was retrieved from the pool. In that case the parser is not returned in the pool.

But do not despair! Good developers of saaj-ri are ready to combat the problems. They have actually fixed the resource leak somewhere in 1.4.X version.

And they added a way to increase the pool size! Wow - now you have it configurable. (As usual, with a JDK system property, once and for everybody, but who am I to point to a better way?)

Monday, September 28, 2015

WTF collection: how not to add an API

Some time ago we decided to move one of our projects to JDK 1.8. After all, JDK 1.7 reached EOL. While we did not really want to start with all those JDK 1.8 features, the security updates were important for us. In addition, the newest versions of popular browsers refuse to work over HTTPS with web servers running on JDK 1.7, and our project has a web UI.

The move was not an easy exercise. Yes, most of the problems were traced to our code, although for some of them I would have expected better diagnostics from the tools or frameworks.

But one problem stood out.

The project communicates over HTTPS with some remote web services. And we started getting errors for one of them. The web service is a bit different than others: the web service provider insisted on using the SSL/TLS Server Name Indication (SNI).

We were having some problems when we started communicating with this web service initially while we were still running with JDK 1.7. And now the errors with JDK 1.8 were remarkably similar to the errors we had initially with the web service. It was immediately clear who is the primary suspect.

After all I knew that the JDK 1.8 includes some API to explicitly control the SNI behavior. But I hoped that the JDK does the right thing if SNI settings are not explicitly controlled. Our code did not do it.

Let's look closer at it. This is what -Djavax.net.debug=all for.

First surprise: after setting the property on, I could not log in in our web based UI! We got some errors in browsers saying that the HTTPS connection could not be established. Removing the property helped with UI. How it is even possible to release a product where enabling some debugging output would break the functionality?! Yes, a lot of developers, me including, did make similar mistakes, but leaving such a bug in a released version is too far from my point of view.

And how the hell I suppose to debug the suspected SNI problem? Let's try to use another JDK. OpenJDK 1.8.0.51 out, Oracle JDK 1.8.0_60 in. Result: the problem with the debugging property and HTTPS in the web UI is gone.

Hope dies last ... the problem with the web service is still there. Of course it would have been too easy if this problem were also solved.

But at least I could now look at the SSL debugging output. And indeed, exactly what I thought: SNI is not sent.

I also knew about jsse.enableSNIExtension system property. We started the JVM without this property so the default behavior must have been used. Needless to say that explicitly setting the property to true did not change a thing.

The rest was just a tedious work of creating a reproduction scenario and some googling. A simple program with some basic HttpURLConnection manipulations did not reproduce the problem: the JDK was sending the SNI info. Time to look at the JDK source code and do more debugging.

From my point of view the authors of that part of Java have had reserved themselves a long time ago a permanent place in the developer's hell. This code is a mess and a bloody nightmare. Yes, I have seen some code that was much worse. But somehow I expected the JDK code be of a better quality...

After many cycles of modifying the test program, debugging, and studying the mess they called "source code" I came to this beauty:

HttpsClient.java, starting from line 430, method public void afterConnect():

[430]    public void afterConnect()  throws IOException, UnknownHostException {
...
[439]                    s = (SSLSocket)serverSocket;
[440]                    if (s instanceof SSLSocketImpl) {
[441]                       ((SSLSocketImpl)s).setHost(host);
[442]                    }
...
[470]            // We have two hostname verification approaches. One is in
[471]            // SSL/TLS socket layer, where the algorithm is configured with
            ...
            The rest of the very long and insightful comment is stripped
[518]             boolean needToCheckSpoofing = true;
[519]             String identification =
[520]                s.getSSLParameters().getEndpointIdentificationAlgorithm();
[521]             if (identification != null && identification.length() != 0) {
[522]                 if (identification.equalsIgnoreCase("HTTPS")) {
[523]                    // Do not check server identity again out of SSLSocket,
[524]                    // the endpoint will be identified during TLS handshaking
[525]                    // in SSLSocket.
[526]                    needToCheckSpoofing = falsee;
[527]                }   // else, we don't understand the identification algorithm,
[528]                    // need to check URL spoofing here.
[529]            } else {
[530]                 boolean isDefaultHostnameVerifier = false;
...
[535]                 if (hv != null) {
[536]                    String canonicalName = hv. getClass().getCanonicalName();
[537]                     if (canonicalName != null &&
[538]                    canonicalName.equalsIgnoreCase(defaultHVCanonicalName)) {
[539]                        isDefaultHostnameVerifier = true;
[540]                    }
[541]                }  else {
...
[545]                    isDefaultHostnameVerifier = true;
[546]                }

[548]                 if (isDefaultHostnameVerifier) {
[549]                    // If the HNV is the default from HttpsURLConnection, we
[550]                    // will do the spoof checks in SSLSocket.
[551]                    SSLParameters paramaters = s.getSSLParameters();
[552]                    paramaters.setEndpointIdentificationAlgorithm ("HTTPS");
[553]                    s.setSSLParameters(paramaters);

[555]                    needToCheckSpoofing = false;
[556]                }
[557]            }

[559]            s.startHandshake();
...
[581]    }

There are so many things that are wrong here. But let's start:

Lines 440 - 442: the hostname is passed to the SSL socket via a non-public API. This basically prevents you from providing your own SSL socket factory with your own SSL sockets delegates. Your sockets will not get hostname info. And hostname is used by the default trust verification mechanism invoked from the default SSL socket implementation.
The biggest "wrong" in this code starts on line 470. The authors have probably wasted all their energy on those 50 lines of comments, and got nothing left to properly implement the logic. Basically the SNI information is sent only if method SSLSocketImpl.setSSLParameters() is invoked. And if it is not invoked, no SNI is sent. And the code above shows that setSSLParameters() is invoked in one case only: if no endpoint identification algorithm was specified and no "default" hostname verifier was set. Our code had a custom hostname verifier, and oooops SNI was not send.

The funny thing about it: if one bothers to explicitly specify an endpoint identification algorithm, even the default one, SNI is not sent either.

There is actually a bug JDK-8072464 about the "non-default hostname verifier", but it does not mention an explicitly specified endpoint identification algorithm. And it looks like they do not plan to fix it in 1.8.
There is another bug lurking in the API and the implementation: there is no easy way to disable the SNI or actually just customize it for a particular connection. Yes, one can disable sending SNI by setting jsse.enableSNIExtension system property to false, but it is a JVM-wide setting. Don't you hate when the only way to get some functionality is use some system property? I do hate that kind of "all or nothing" approach. And JDK is full of it. One of the worst offenders is javamail: it gives you a way to specify per-connection settings and still relies on JVM system properties is some cases. Really clever technic!

Back to SNI: you see, to explicitly specify SNI you have to implement an SSL socket factory, which is already quite a step. Then you can use setSSLParameters() to customize the SNI or provide an empty list if you do not want to have SNI sent. So far so good, but this is the only place where you are in control of a socket. And it is too early. Because HttpsCient.afterConnect() is invoked much later. Say there is no endpoint identification algorithm specified and no "default" hostname verifier set. Or just imagine bug JDK-8072464 is actually fixed. In this case the default SNI behavior kicks in and overwrites whatever you have specified in the socket factory. Remember that little setHost() on line 441? This is where the host name gets into the SSL parameters. And then the code on line 551 - 553 overrides your precious SNI with the one that was set on line 441.

So in reality you have to implement an SSL socket factory and an SSLSocket so that you can do some additional SNI manipulation in method startHandshake(). But then you will not get hostname set because lines 440 - 442 are not executed for your custom SSLSocket.

A small detour: go read what they call a "JDK Enhancement Proposal" about SNI, especially the section about testing.

Need to verify that the implementation doesn't break backward
compatibility in unexpected ways.

A noble goal, is it not?

Just imagine how much time they have spent on that JEP, on the API, on the implementation. Net result? Puff, zilch, and JDK-8072464.

Of course all this applicable only if your code relies on the JDK's HTTP support. This is probably another very good reason to move to libraries like Apache HttpComponents. I do not know if it properly supports SNI and gives you enough rope but it can at least be patched much easier if needed.

Since we still have to rely on the JDK's HTTP support I had to resort to a custom SSL socket factory and a custom SSL socket and on things like "instanceof SSLSocketImpl" and typecasts. Too much code to my liking to work around some silly bug. But at least we now can send messages to that web service.

And, by the way, there is another problem with the JDK's SNI. From my point of view it is also a bug but this time it goes over somewhat grey area of definition ambiguity. The code in question is Utilities.rawToSNIHostName().

The SNI RFC section 3.1 prohibits use of IP addresses in SNI, and the JDK behaves correctly if hostname in the above code is an IP address.

But they also ignore hostnames without a '.' character. This is wrong. I guess they try to follow the RFC where it says:

"HostName" contains the fully qualified DNS hostname of the server,
as understood by the client.

There two problems with the JDK's behavior. First of all, the spec says "as understood by the client". If a hostname with or without a '.' character is correctly resolved to an IP, it is as good as "fully qualified as understood by the client". So the JDK incorrectly excludes hostnames without a '.' character.

On the other hand, if the JDK follows some other specification of a "fully qualified DNS hostname", then a mere presence of a '.' character in a hostname does not make it fully qualified. It is at most "a qualified name". Unless of course the JDK authors have somewhere a specification that says "a hostname is fully qualified if it has at least one '.' character". But I bet they just got tired after writing all those specs, JEPs, APIs, and comments in the code.

Sunday, August 2, 2015

WTF collection: JDK and broken expectations

Imagine: you have a piece of code that have been working fine for a long time. The code connects to some web server, sends some data, gets the result, etc. Nothing fancy here.

One day the very same piece of software was used to send data to yet another web server. This time it was different: we got long delays and a lot of errors like "Unexpected end of stream". Time to investigate.

The packet capture revealed one particular HTTP request header that just screamed "You want some problems? Just use me! " The header was Expect: 100-continue

The code in question uses java.net.HttpURLConnection class. (Yes, I know about Apache commons HTTP, but that's beyond the point.) I just hoped that this header is not set by the JDK code automatically. It is always a great PITA to change HTTP headers that the JDK babysits. Just try to set HTTP request header "Host" for example! Fortunately "Expect" header is not one of those restricted by the JDK. It was set by the application itself based on some conditions. Couple of changes and this header is not sent any more. No more delays, no more "Unexpected end of stream" errors, everything works.

By now I was curious enough to find out what was going on. Even the fact that I had to go through some JDK code did not stop me. Almost every time I had to look in the JDK code I got the feeling I could easily have guess where the authors of that code grew up...

Yep, no surprises here, the same style, see: sun.net.www.protocol.http. HttpURLConnection class. Here is the relevant piece of "logic":

private void  expect100Continue()throws IOException {
            // Expect: 100-Continue was set, so check the return code for
            // Acceptance
            int oldTimeout = http.getReadTimeout();
            boolean enforceTimeOut = false;
            boolean timedOut = false;
            if (oldTimeout <= 0) {
                // 5s read timeout in case the server doesn't understand
                // Expect: 100-Continue
                http.setReadTimeout(5000);
                enforceTimeOut = true;
            }

            try {
                http.parseHTTP(responses, pi, this);
            } catch (SocketTimeoutException se) {
                if (!enforceTimeOut) {
                    throw se;
                }
                timedOut = true;
                http.setIgnoreContinue(true);
            }
            if (!timedOut) {
                // Can't use getResponseCode() yet
                String resp = responses.getValue(0);
                // Parse the response which is of the form:
                // HTTP/1.1 417 Expectation Failed
                // HTTP/1.1 100 Continue
                if (resp != null && resp.startsWith("HTTP/")) {
                    String [] sa = resp.split("\\s+");
                    responseCode = -1;
                    try {
                        // Response code is 2nd token on the line
                        if (sa.length > 1)
                            responseCode = Integer.parseInt(sa[1]);
                    } catch (NumberFormatException numberFormatException) {
                    }
                }
                if (responseCode != 100) {
                    throw new ProtocolException("Server rejected operation");
                }
            }

            http.setReadTimeout(oldTimeout);

            responseCode = -1;
            responses.reset();
            // Proceed
    }

Nice thing: they decided to work around some broken HTTP servers that may send a 100 HTTP response even if a request does not have the "Expect: 100-continue" header. See those http.setIgnoreContinue() here and there?

This is actually the only nice thing I can say about this code.

The authors also decided to work around another possible misbehavior with respect to the "Expect" header. See that comment 5s read timeout in case the server doesn't understand Expect: 100-Continue on line 1185? Except that all this happens only if the connection does not have its own read timeout set, see line 1184.

But if you decided to protect your application against some servers that take too long to respond by setting read timeout on an HttpURLConnection to some reasonable value like 30 or 60 seconds, you are screwed. Because the JDK starts waiting for a 100 HTTP response and if the server does not send it, the JDK times out after waiting as long as your read timeout setting (30 or 60 or whatever seconds!). In that case the JDK does not bother sending the request body to the server, and your application gets a SocketTimeoutException. Nice work, guys!

And it is not all. Another very interesting logic starts on line 1200. If some response was read from the server, the code verifies the response code. Completely useless and extremely harmful piece of code: if the server responded with anything other than 100 the JDK reports "Server rejected operation" error.

Now go back to RFC 2616 and read the very first requirement for HTTP/1.1 origin servers (emphasis mine):

- Upon receiving a request which includes an Expect request-header
field with the "100-continue" expectation, an origin server MUST
either respond with 100 (Continue) status and continue to read
from the input stream, or respond with a final status code. The
origin server MUST NOT wait for the request body before sending
the 100 (Continue) response. If it responds with a final status
code, it MAY close the transport connection or it MAY continue
to read and discard the rest of the request.  It MUST NOT
perform the requested method if it returns a final status code.

See that or respond with a final status code part? Say the server implements that part of the specification correctly. It receives a request with "Expect: 100-continue" header, decides that it cannot handle the request because for example it does not recognize the specified context path. The server immediately sends a 404 HTTP response. But the JDK knows better and instead of the real error your application gets a stupid and meaningless "Server rejected operation" error. Good luck finding out what went wrong.

Be warned and do not use the "Expect: 100-continue" HTTP request header if you have to rely on java.net.HttpURLConnection class. And do not think the JDK code is somehow of a better quality than rest of code out there. In fact it is normally on par or even worse.

Wednesday, June 19, 2013

Aaaaaah! Metro, you made my day!

As I have mentioned before, WS-security unit tests in metro are using SAAJ to handle SOAP messages. A typical test goes like this:

A SOAP message is created and populated with some elements by the test code.
A WS-Policy-based configuration that defines what to do with the message is created.
A security operation is performed.
The resulting SOAP message is written to a file.
A new instance of SOAP messages is created from the file.
A new WS-Policy-based configuration that defines how to validate the message is created.
Validation is performed.

I decided to add some JAX-WS code to one of the test. Originally the test was doing both signing and encryption, but I removed encryption. It is much easier to see what is going on. Then I verified that the test is still green, and that it fails if I modify the file the test generates before the file is read in. Just in case: you never can be too careful.

I have added the following steps at the end of the test:

The file with the message, created after the security operation, is read in the streaming mode as a JAX-WS message.
A new WS-Policy-based validation configuration is created. It is actually exactly the same code as in SAAJ SOAP message case.
Validation is performed. This is done by a different set of classes than in SAAJ case, although class names are similar. Here metro shines again: although the operations are similar, the way validation is invoked is different. Worse, all the parameters for JAX-WS way of validating are type-compatible with SAAJ validation. Trying to use SAAJ validation code with JAX-WS message compiles but fails with NPE.

After some failed attempts I have got a version that not only compiled but went quite deep into metro code, and then failed with signature validation. It was "Reference #idxxx: signature digest values mismatch" or something like that.

This was ... interesting. The same message is OK if validated as a SOAPMessage instance and fails if validated as a JAX-WS message. Something fishy was going on. Metro had not only provided multiple implementations of XML-Security, but they also managed to make them incompatible. Remind me, what does that "WSIT" stand for?

Of course the problem might have been in the way I have set up JAX-WS-based validation, but I was quite sure it is another genuine "feature" of metro.

In order to understand what the error message means it is necessary to understand what the signed message looks like. It is a SOAP message with a lot of additional stuff added to the SOAP Header (a lot of details omitted):

<Envelope ...
   <Header>
       <Header1 wsu:Id="id_h1".../>
...
          <ds:Signature >
          <ds:SignedInfo>
...
              <ds:Reference URI="#id_h1">
                  <ds:DigestValue>some base64</ds:DigestValue>
              </ds:Reference>
              <ds:Reference URI="#id_body">... <ds:Reference>
...
          </ds:SignedInfo>
          <ds:SignatureValue>some base64</ds:SignatureValue>
          </ds:Signature>
       <HeaderN wsu:Id="id_hN".../>
   </Header>
   <Body wsu:Id="id_body">...</Body/>
</Envelope>

Each <ds:Reference> element "describes" a particular thing that is signed, typically some element from the same message, as well as how that thing has to be preprocessed before calculating the digest, what algorithm is used to produce the message digest, and the message digest value. URI attribute of <ds:Reference> specifies which element is digested.

<ds:SignedInfo> can contain a lot of such <ds:Reference> elements. And then comes <ds:SignatureValue> that is actually a digitally signed message digest of <ds:SignedInfo> element.

The order of signed header elements and <ds:Signature> is not important. Some say the signature must come after to be signed header elements to facilitate streaming, but this is a moot point. Most often than not SOAP Body also has to be signed, so you can kiss goodbye that nice streaming theory.

Anyway the error I was getting, "Reference #idxxx: signature digest values mismatch", was about the very first <ds:Reference> in the message. It meant that the verifying code looked at URI attribute, URI="#id_h1" in this case, found the corresponding header element by its id, <Header1 wsu:Id="id_h1".../>, and calculated its digest. And the calculated digest did not match <ds:DigestValue> of the <ds:Reference>.

I switched on the logging and repeated the test several times, with the same result. I was not sure what I wanted to see. The logging did not show anything exciting. But then I noticed some pattern. The output contained calculated and expected digest values taken from <ds:DigestValue> of the <ds:Reference>. The values were unreadable because the digest is just byte[], and metro guys did not bother with encoding/decoding or pretty-printing them. While the expected digest was clearly changing from run to run, the calculated digest looked the same. This was clearly wrong because the digest should have been calculated over the header element including all its attributes. While most of the things remained unchanged, wsu:Id attribute differed from run to run. So the calculated digest had to be different as well.

Checking the verification code under the debugger confirmed this: the calculated digest was exactly the same every time the test was executed. So what exactly metro is using as the source for the digest calculation? Turned out: in this particular case nothing.

Yeap, nothing. So the calculated "digest" is probably some fixed initial state of the message digest implementation.

The problem had nothing to do with how I used metro API. The real reason was the signed message itself. Time to show the relevant <ds:Reference> in its full glory:

<ds:Reference URI="#_5002">
    <ds:DigestMethod Algorithm="http://www.w3.org/2000/09/xmldsig#sha1"/>
    <ds:DigestValue>dKof/iss1y+eaCxi5xQGzXZw8RQ=<ds:DigestValue>
<ds:Reference>

The key to the problem is not what is there, but rather what is absent. Some important piece is missing, namely, "instructions" on how the date has to be preprocessed before calculating the digest. Normally a <ds:Reference> looks like this:

<ds:Reference URI="#_5006">
    <ds:Transforms>
        <ds:Transform Algorithm="http://www.w3.org/2001/10/xml-exc-c14n#">
            <exc14n:InclusiveNamespaces PrefixList="S"/>
        <ds:Transform>
    <ds:Transforms>
    <ds:DigestMethod Algorithm="http://www.w3.org/2000/09/xmldsig#sha1"/>
    <ds:DigestValue>JWq3aJtzUP98fkiThJ0WYtcrWCY=<ds:DigestValue>
<ds:Reference>

<ds:Transforms> and its child <ds:Transform> elements are such "instructions".

The rest was easy: knowing that there were no <ds:Transform> elements I looked at what metro does in this case with the referenced element. Well, nothing. No data is digested.

Some questions remained though:

Why the signing code produced a messaged without <ds:Transform> elements?
Why the SAAJ message was successfully validated?
What is the expected behavior in case there is no <ds:Transform> elements according to the specification?

Let's start with the last question. The answer is: I do not care. It might be valid. It might be not valid, but in any case it is definitely not the "signature digest values mismatch". This is actually an answer on the first question as well. Why the signing code produced a messaged without <ds:Transform> elements? It does not matter because metro might need to process such messages anyway, no matter how they are created.

Why the SAAJ message was successfully validated? Well, the validation was performed by same code as the singing. For SAAJ messages metro delegates all the work to JSR 105 "Java XML Digital Signature API Specification" implementation, which is now part of JDK. Basically it is some older version of Apache Santuario, repackaged by Sun. I checked Apache Santuario source and found some remarkable similarities with metro code. Except that Santuario's code does not have this particular bug because after applying all the transforms it checks if there is any data left to be processed, and processes it. And metro does not perform this check. The check existed in Santuario's code for ages, from the first check-in of JSR 105 support in 2005. I guess metro had "borrowed" some even older version of that code. As a result metro fails if there are no <ds:Transform> elements, and also might fail if there are <ds:Transform> elements. I did not check completely the logic there but it looks like some combination of <ds:Transform> elements might result in the same error. The "I" is WSIT looks more and more like a joke.

By the way, why the signing code produced a messaged without <ds:Transform> elements? The transforms to use are coming from WS-Policy, but not directly. At least when I tested some metro samples, the generated messages had <ds:Transform> elements, but WS-Policy declarations used by the samples did not have any explicit mentioning of transforms. Metro runtime is probably adding this during parsing of WS-Policy. The signing code in the unit test uses a combination of WS-Policy file and some runtime policy manipulation to create the final policy for the signature process. What exactly has to be singed is defined by java code, so this is probably the reason why the final policy ended up having no transforms specified. Sure enough after I found out how to add this info to the java-created policy and modified the code, signed messages were produced with <ds:Transform> in <ds:Reference>. And the JAX-WS way of verifying the message went OK as well.

What can I add? At least interoperability-wise metro really shines.

Monday, June 17, 2013

Web services: in search of interoperability

I have some experience with web services, SOAP, SAAJ, what not, but until recently I did not have any "pleasure" to work with all the rest of WS-* standards. WS-Addressing, WS-Crap, WS-Security, WS-whatever.

Not that long ago however I ran out of luck.

The product I am working on can operate as a WS service or a WS client. And it works. But there are situations when "it works" is not good enough. Usually it has to do with artificial barriers created to protect certain markets. This all is accompanied by some "specification" to which a product must conform. As a rule such specification is badly edited collection of copy-pasted fragment from other standard documents from well-known authorities like W3, OASIS, etc., often with contradictory statements. Nobody cares.

Recently we needed to prove our product conforms to one of such specs. We were lucky. Not only there was a specification, there was also a "compliancy service" provided. The "compliancy service" could accept SOAP requests and thus validate if the sender conforms to the specification, or it could send SOAP requests and validate SOAP responses to validate if the receiver conforms to the specification.

Last year I had to deal with another "specification" and "compliancy service" from the same people. Do you know what one needs to have a good compliancy service? No, you do not have to conform to well-known standards, or even to your own specification. Monopoly is good enough. Add some crappy software and you are set.

For example, the software they used then (and still use) could not handle some erroneous HTTP requests. Instead of returning any kind of response the "compliancy service" did nothing. Literally. The connection was kept open, but not a single byte of response was sent back. Eventually the client timed out trying to read a response. It took us more than a month collecting data and e-mailing them before they agreed that the problem is on their side. The problem is still not fixed.

So I knew I had a lot of fun ahead, I just did not know how much.

This time everything revolved around WS-Addressing and optionally WS-Security. How exactly WS-* stuff had to be applied was specified in an "interoperability standard" document. The document was unclear on couple of dozens points, but it was a "standard", so our product had to be "standard"-compliant.

The "compliancy service" found no problem in our requests and responses in case no WS-Security had to be applied. Adding XML signature to the picture changed everything.

First, the "compliancy service" did not like what request elements our product signed. It complained we were signing more than needed. Turned out it was the case of "do what I do and not what I say". The "standard" defined some elements as mandatory and allowed some optional elements to be present. In the section that described what has to be signed it said that all mandatory and optional (if present) elements must be signed. But "compliancy service" did not like that our requests had optional elements that were signed. OK, no optional stuff then. And no more complains from the "compliancy service".

But when I started testing our product as a web service provider all hell broke loose. No matter what I did the "compliancy service" said "signature verification failed". Just that.

Since then I have learned what JWSDP, XWSS, WSIT, Metro, you name it, means. I have seen monsters much worse than in JBoss code.

And I found out that by 2008 there were still XML parsers in the wild that would reject valid XML documents as invalid. And that in 2013 somebody would still use that parser. Ok, ok, granted, I am not really sure if that XML parsing problem is a feature of a parser itself. It might very well be that the parser was improved as part of the "compliancy service" development. But still.. failing to parse XML if there is a character entity representing a whitespace character between some XML elements? Like this:


<Header>
<header1 …/>
<header2 …/>&#x20;
</Header>&#x20;
<Body …/>
</Envelope>

Remove any of these two entities, and the problem goes away, even if is added anywhere else. +100 to "compliance level". Grrr.

After a lot of experiments and quite some test code to generate XML signature I found out that the "compliancy service" did not like new lines in our response messages. Only after I produced a signed response that did not contain new line characters, the "compliancy service" gave up and accepted the response.

This was really strange because request messages with new lines did not cause any trouble. Submitting a bug report to them was not really an option. We did not have another month.

I found out that the "compliancy service" uses some WS-* toolkit from Sun, not sure of the exact name and version of the toolkit. Nowadays it goes under name "Metro". Or is it WSIT? Beats me. Anyway, based on some stack traces I have seen it was a version from around 2008. Oh, Sun! I had some pleasures debugging JAX-WS RI some time ago. Fine experience, unforgettable.

So I decided to download the latest version of that toolkit to play with it. The decision opened a bright world of project code names and their relationships. Googling classes from the stacktrace resulted in XWSS, JSWDP, WSIT, with XWSS being the primary suspect. Project migrations, consolidations, broken download links, Oracle buying Sun added even more fun.

All the roads led to metro and WSIT. The latest version is 2.3, so be it.

Setting it up and running some samples went mostly flawless, but when I started experimenting with soapUI, I immediately ran into an issue. The sample I was using was a SOAP 1.2 web service, but I sent to it a SOAP 1.1 request. Granted, it was a faulty request, but a SOAPFault with NullPointerException and nothing more is quite an extreme way to say "I do not support SOAP 1.1 here".

By the way do you know what WSIT stands for? Web Services Interoperability Technologies. Yeap, "Interoperability".

I also tested how character entities are parsed. I could not reproduce the problem. At least this one is solved. Or it was not a problem of the toolkit at all.

The real fun began when I started sending signed requests from soapUI. First I have got bitten by the fact that soapUI "friendly" modified my messages.

Next problem I ran into was much more serious: some of the signed messages my test code produced were happily accepted by metro and some were rejected with an error that sounded like "Signature verification of SOAP Body failed".

Some of the messages accepted by metro had new line characters, so again the problem we had with the "compliancy service", if it was the problem of the toolkit, was solved. Needless to say when I generated response messages with the exact formatting they still were rejected by the "compliancy service".

And what about the test messages that metro rejected? I actually found the cause quite quickly. Under some circumstances metro chops off whitespace characters and probably also comments that are direct child nodes on after SOAP Body. They probably do it in order to promote "I" ("-nteroperability"). What else can be the reason? And of course whitespaces are significant with XML digital signature.

For example, this:

<Envelope>
…
<Body>
    <elementX .../>
</Body>
</Envelope>

is treated by metro as if it were

<Envelope>
…
<Body><elementX .../></Body>
</Envelope>

But not always. Who said life is easy?

Looking back I know that I was lucky when I have tested our product as a WS client sending data to the "compliancy service". Pure by chance the request messages did not have any whitespace characters in Body.

I should say the source code of metro is a ... well, I do not know. Saying "mess" would not do it justice. It would be more like a huge compliment. Classes with same name in multiple packages maybe doing the same thing? Or maybe not the same? Methods longer than couple of thousand lines? Copy-paste? You name it, and it is there.

I also found that the problem was reported to them, maybe even multiple times. It is always easy when you know exactly what you are looking for. And of course it was reported fixed. Ha! This is another thing I do not understand: you have found a problem, you have fixed it. Is it so much work to fire "find usage" in your IDE? One of the remaining places is in the file next to the one you just have modified! To me it says a lot about quality of the project and people working on it.

The problem is in JAX-WS integration code of WSIT, but given the complexity of metro, JAX-WS is probably the only way metro is used, so the problem affects everybody who is using metro with XML-Security. And people are still running into this problem. The answer from "Interoperability" specialists is of course "it is your problem". Unfortunately it is true.

Another "stamp of quality" is their unit tests. WS-Security subproject has only 11 tests that do something about WS-Security. Compare that with 27 tests around WS-Policy parsing. Even more interesting fact is that their WS-Security tests do not test JAX-WS code paths. Metro web site claims that they use XML streaming to improve performance. And part of their code is using XMLStreamReader. Whether it improves performance I do not know since they like to copy data into XMLStreamBuffer objects to use them in other places as XMLStreamReader. But their unit tests are using SAAJ to read SOAP messages, and not the streaming code. As a result the code that is actually used by a metro-based WS client or server is not tested.

I should probably even not mention the possibility to have some unit tests for testing the interoperability with other toolkits. Doubt they would understand the concept.

Anyway, knowing the problem and the fix I repeated my failing tests, this time fixing the data as needed under debugger. Sure thing, no more signature errors.

Net result: our product is compliant with WS-Security standard. I know what we need to do to make a particular configuration of the latest version of "the great interoperability toolkit of all times" to accept our messages. Given the complexity of metro I have no idea if some other configuration would be OK with our messages.

I still have no idea why the "compliancy service" did not like our responses with new lines in it. Needless to say I tried again making sure there are no whitespaces in Body, but the error was still the same.

If you are thinking of using metro for your projects, do not. Even if you do not need WS-Security. If they manage to screw things up during parsing I do not want to think what they can do in more complicated cases.

If you are unfortunate to use metro now, especially with WS-Security... Well, if you have metro on both sides, you will not be beaten by this bug, because normally metro generates SOAP Body without whitespace characters.

If you have some interoperability issues with XML signatures using metro and some other toolkit, check the messages. Maybe you are lucky and all your issues are caused by whitespaces in SOAP Body.

If you are a metro developer... let me say no more.

Wednesday, September 26, 2012

JBoss and CXF: it is not over yet

I came across the Spring story some time ago but did not have time to write about it then. I would probably spend time writing about it now but recently I had to refresh my knowledge of JBoss and CXF.

Besides Spring usage, JBoss CXF integration has more nice things to offer. They are more subtle but they definitely deserve to be mentioned.

Remember that little jbossws-cxf.xml? It has its own secrets.

First of all there are several locations where JBoss WS code looks for it. Do you know where exactly the file is packaged in jar/war/whatever? If you think the correct location is META-INF/ for jars and WEB-INF/ for wars, you are close but it is only part of the story. This is the location that the WS deployer checks for jbossws-cxf.xml at application startup. Note that JBoss does not use classloading to locate this file, it goes straight to the jar/war and checks if the file is present. If it is there it is processed together with some other JBoss CXF specific deployment descriptors that the deployer can find at this moment. If no META-INF/jbossws-cxf.xml or WEB-INF/jbossws-cxf.xml is found the deployer does not look at other deployment descriptors even if they are present. This is important because whatever JBoss manages to find at this moment is used for configuration of WS services and injection of WS client references. Well, may be.

Of other places where JBoss WS deployer looks the most important is META-INF/cxf/jbossws-cxf.xml. Pay attention to that cxf in the path. JBoss: consistent as ever; flexible as nothing ever before. It gives you possibility to place config files in at least two different places. There are of course notable differences. First one is already explained above: the deployer is loading config files only if it detects META-INF/jbossws-cxf.xml or WEB-INF/jbossws-cxf.xml. Then it loads that file and also META-INF/cxf/jbossws-cxf.xml if present. So if you have only META-INF/cxf/jbossws-cxf.xml, the deployer will not see at application startup.

Another difference is that uses classloading to load a single META-INF/cxf/jbossws-cxf.xml. This, of course, is a source of some amusement if you happen to have multiple jars with META-INF/cxf/jbossws-cxf.xml. JBoss has some jars with that file already, like server/<server_name>/deployers/jbossws.deployer/jbossws-cxf-client.jar. If you also have META-INF/cxf/jbossws-cxf.xml in your application then which file is loaded depends on the classloading configuration. I do not know how important things from jbossws-cxf-client.jar are but you definitely do not want to miss your configuration.

Location is covered; time to look at the moment when JBoss looks for the file. It does it not only during application startup, as described above. If you use WS API like javax.xml.ws.Service.create() JBoss will do it again. Basically it is the same story except that it never loads META-INF/jbossws-cxf.xml or WEB-INF/jbossws-cxf.xml. It just goes straight to all those other JBoss CXF specific deployment descriptors that it can find, including META-INF/cxf/jbossws-cxf.xml. This process might use a different classloader than the one used during deployment so the loaded configuration might differ significantly from the one loaded during application startup.

A nice finishing touch comes from Spring XML parsing code. Most likely it is somehow configured from JBoss but I did not care to find out where and how. Deployer triggers (META-INF/jbossws-cxf.xml or WEB-INF/jbossws-cxf.xml) are parsed with XSD validation on, all other files are parsed with validation off. It seems to be a minor difference unless you want the configuration from jbossws-cxf.xml applied to the WS objects you get back from Service.create() and Service.getPort().

Basically you can use <jaxws:client> to provide custom configuration on a per-client basis, the same way as you can use <jaxws:endpoint> to configure your endpoints. But there is great confusion around <jaxws:client>: it looks like it is ignored by JBoss or CXF. For example interceptors configured on a CXF bus are used properly but interceptors configured on <jaxws:client> are not.

The reason is that you need quite crazy identification string for your <jaxws:client>. Things that are perfectly OK for <jaxws:endpoint> won't work for <jaxws:client>. Why? Because it is JBoss of course!

Some JBoss documentation suggests you have to use <jaxws:client id="{your.service.namespace}YourPortName" >. This kind of ID is not valid according to Spring XML schema. If you put it in your META-INF/jbossws-cxf.xml or WEB-INF/jbossws-cxf.xml, the application startup will fail with XSD validation error. To be fair to JBoss it looks like this particular naming convention is coming from CXF and not from JBoss.

But it is still not all! <jaxws:client id="{your.service.namespace}YourPortName" > does not work. It does not work if you rely on @Resource or @WebServiceRef, or, at least, it did not work for me when I tried, but I did not try hard. It does not work for javax.xml.ws.Service.create(), and I needed it to work so I wasted most of my time on this use case.

Still other sources say that <jaxws:client id="{your.service.namespace}YourPortName" createdFromAPI="true"> is a way to go. It still does not work! It took me quite some time poking in CXF and Spring code under debugger to find out the version that worked for me: <jaxws:client id="{your.service.namespace}YourPortName.jaxws-client.proxyFactory">.

Basically it is CXF that ends up looking up a Spring bean named "{your.service.namespace}YourPortName.jaxws-client.proxyFactory" to get its configuration. The funniest thing is that both versions of configuration register a bean with this name. But for whatever reason when a bean is created with createdFromAPI="true", it is registered all right but it is not found later by CXF. And if it is created with ".jaxws-client.proxyFactory" suffix it is registered and successfully found by CXF.

I stopped looking further. I found enough to make my application work. But I still wonder: how could I miss it in the beginning? Nice, intuitive, easily discoverable configuration. Unforgivable.

Friday, September 7, 2012

Assorted facts about JBoss. Fact 6: JBoss and CXF: match made in heaven.

Bored? Want to learn something new? Or waste some time? Need quality headache?

My advice: look at web services and try to make something just a bit more complicated than "hello world" under JBoss. You will never forget it.

JBoss 6 uses CXF under the hood to support WS-related sections of EJB specification. Both products have their strong points but JBoss is definitely the leader in the "most creative WTF" contest.

JBoss supports declarative configuration of both web service implementation and web service client. All you need is to create a JBoss specific deployment descriptor, a file called jbossws-cxf.xml, and package it with your application. The nice touch here is that this deployment descriptor is Spring-based. This alone makes JBoss leading contester.

It is not that I have anything against Spring. Well, I actually have a lot to say about it, but it deserves a separate post.

No, really, just think about it: first you come up with a way to configure the server using XML, supporting hundreds of different schemas, all based on the same XML configuration parser library. And then you say "screw it, I fancy Spring here".

They did not go as far as packaging Spring with JBoss. Why? Who knows. Having said A they might just well have said B. But no, they use classloading to detect at runtime if Spring is present. Only if Spring is there JBoss goes ahead and reads jbossws-cxf.xml. Wow, say good bye to all hard work you put into creating that jbossws-cxf.xml - it is not used anyway. But you do not know that yet.

After some googling and digging you realize that you need Spring. Sigh. But you do need that configuration from jbossws-cxf.xml! Being "standard obeying" you bite the bullet package Spring with your application. (Being lazy you just drop Spring into server/<server_name>/lib. You save yourself a lot of time but miss a lot of fun.) Redeploy and ... your precious jbossws-cxf.xml is not loaded.

Things are looking up: you are not bored any more. You spend some time double checking your deployment and redeploying it. Nope, still not loaded. If you study jboss logs carefully you might spot this small message Spring not available, skipping check for user provided jbossws-cxf.xml / cxf.xml configuration files. If you know your way about JBoss jmx console you go and check your application's classloader and try to load some Spring classes via it. They are loaded just fine. WTF?

Remember that runtime classloading check I mentioned above? Turns out it runs very early during application deployment, before the classloaders for the application are set up. As a result the check is using the classloader of the WS deployer coming from server/<server_name>/deployers/jbossws.deployer. Surprise!

It is getting more and more interesting. You have to have Spring and you can't package it with your application. Right, but you need to make this application work. Forget about standards and keeping things where they belong. Spring to server/<server_name>/lib, JBoss restart, wow! Congratulations, you made it.

I leave the question of the proper place for Spring jars as an exercise. Instead let's look at more fun that JBoss-CXF integration might bring.

This is how I first got involved in this matter: I had an EJB application that used JMS. For reasons not important here it was decided that the application should use Apache ActiveMQ. No problem, I tested it locally under JBoss 6.1, it looked OK, and so the application was deployed in an integration environment, together with ActiveMQ resource adapter. So far so good. Later more applications were added to that environment. One of them failed to start. There were some classloading errors involving Spring classes. The application was a war that internally used Spring so Spring was packaged into WEB-INF/lib and the classloading was properly configured (WEB-INF/lib jars first). I was asked to look into the problem because there was another Spring in classpath: from ActiveMQ which is using Spring internally.

Of course removing ActiveMQ RA solved the problem for that application. Changing classloading rules of the application (to parent first) "solved" the problem as well.

Digging further I realized what has happened. JBoss classloader architecture is notoriously broken so the Spring classes from ActiveMQ leak into every application. When this failing application (with the original configuration WEB-INF/lib jars first) was starting JBoss WS deployer noticed some WS annotations and went ahead with whatever it supposed to do, including looking for Spring and jbossws-cxf.xml. Spring was there leaked from ActiveMQ. (Remember, the classloader at this moment is the one from jbossws.deployer.) The application did not have jbossws-cxf.xml but the harm was done: some instances of Spring classes were created along the way and remained referenced from CXF objects.

The application startup continued and JBoss finally created the application specific classloader and the rest of the startup code was executed using this classloader. Along the way CXF was involved again and it noticed those Spring instances created before so it went ahead with more Spring dance. But this time all the Spring classes were coming from WEB-INF/lib with predictable result: ClassCastException.

Of course changing classloading configuration would fix the problem in this case. Spring classes would always come from ActiveMQ keeping CXF happy. Well, until some other application, RA, whatever, that has Spring packaged is deployed in the same server. And do not forget that it can be a different version of Spring jars. Care to resolve this mess?

Priceless! Never attribute to malice that which ... I am not sure. Sometimes I think somebody at RedHat came up with this brilliant business idea of how to make more money out of JBoss support ...

Tuesday, July 17, 2012

How good intentions can grind SSL to a halt

We were running a load test of our application installed at the customer. The set up was quite complicated: several nodes running the application at the customer site and a single test server outside of the customer network struggling to send a lot of messages to the customer. The servers running there were in turn sending separate ack messages to our test server.

It also involved a firewall at the customer site, HTTPS with client authentication, custom key- and trust stores, XML digital signatures, Oracle database and Oracle AQ queues and more. And JBoss as the application server.

The setup was working all right but the customer has very strict performance requirements which we had to meet. Hence the tests.

After solving couple of obvious and not so obvious problems we reached quite a good throughput. Not what we wanted but reasonable. The problem was that when we tried to increase the load the hell broke loose.

The log files contained a large number of "send failed" messages with HTTP status code 403 and some HTML page clearly coming from the customer's firewall. The HTML was a generic "you are not allowed in" error message. The customer also reported seeing a log of "read timeout" messages in the firewall's logs.

The firewall was not really a dummy thing in our setup. It also handled SSL at the customer side. Our test server sent HTTPS messages to the firewall which did authentication magic and then passed the request to one of the nodes using HTTP. On the way back the application sent the ack messages as HTTP requests to the firewall which then initiated HTTPS to our test server.

403 errors happened both on our test server and on the servers at the customer, but the pattern was different. Under a moderate load the errors were happening at the customer side, not many, comparing to the number of messages. And first errors came quite quickly. There were no errors on the test server.

Increasing the load did not change the situation until we reached some threshold. The situation at the customer side did not change, but we immediately started to see 403 messages on our test server. Much, much more than at the customer side.

How such a problem can be analyzed? There are probably several ways but I describe what I did.

Switch on SSL logging on our test server: -Djavax.net.debug=ssl and rerun the tests.

Analyze the logs. The logs were already quite large and SSL debug setting added a lot more info. And the test server was both sending and receiving HTTPS messages. So analyzing the logs took some time. But eventually I started to see a pattern.

<timestamp> INFO  [STDOUT] (<thread name>) *** ClientHello, Unknown-3.3
<timestamp> INFO  [STDOUT] (<thread name>) *** 
...
<timestamp + [10-20-30sec]> INFO  [STDOUT] (<thread name>) INFO %% Created:  [Session-101, TLS_DHE_RSA_WITH_AES_128_CBC_SHA]
<timestamp + [20-30sec]> INFO  [STDOUT] (<thread name>) *** ServerHello, TLSv1
...
<timestamp + [20-30sec]> INFO  [STDOUT] (<thread name>) *** ServerHelloDone
<timestamp + [20-30sec]> INFO  [STDOUT] (<thread name>) WRITE: TLSv1 Alert, length = 2
<timestamp + [20-30sec]> INFO  [STDOUT] (<thread name>) Exception sending alert: java.net.SocketException: Broken pipe

So the server thread receives a ClientHello SSL message and then does something for dozens of seconds before trying to send ServerHello message. And then is it too late: the sending side has closed the connection.

When the same thread was again involved, the log was different:

<timestamp> INFO  [STDOUT] (<thread name>) *** ClientHello, Unknown-3.3
<timestamp> INFO  [STDOUT] (<thread name>) *** 
<timestamp> INFO  [STDOUT] (<thread name>) %% Resuming [Session-78, TLS_DHE_RSA_WITH_AES_128_CBC_SHA]
<timestamp> INFO  [STDOUT] (<thread name>) *** ServerHello, TLSv1
…
<timestamp + [200-300ms]> INFO  [STDOUT] (<thread name>) *** 
<timestamp + [200-300ms]> INFO  [STDOUT] (<thread name>)  *** READ: TLSv1 Application Data, length = 208
...

The delay was happening only if a new SSL session was created. It did not happen for all the created sessions though. In some cases there was some delay between ClientHello and ServerHello but not that large: well under 10 seconds.

Googling did not help. The most messages describe "reverse DNS lookup" problem which I believed had nothing to do with our problem. Just in case I tried to verify this with a simple test application which did not experienced any delays doing reverse DNS lookup, both for known and unknown addresses.
Everything I have seen so far could match 403s in the log at the customer side. The problem happened only if a new SSL session was created which did not happen that often. This explained why there were not that many 403s. And the reason for the errors starting to appear almost immediately was also clear: more SSL sessions had to be created to cope with more connection attempts from the customer to our test server.
The question remained: what was causing the delays. Since I am lazy I did not want waste my time on running some network traffic analysers. Good move in hindsight.

Simple HTTPS tests in my development environment were OK. Reproducing the complete setup locally was out of the question. And debugging on the server ... well, I tried it, but it was extremely slow most of the time. SSH tunnel, VM server, whatever.

Instead I have started the tests again and fired jstack -l <jboss_pid> in the loop redirecting the output to individual files.

As soon I noticed 'broken pipe' messages in the log I stopped the test, found the thread name which logged 'broken pipe' and looked at the output of jstack to see what the thread was doing between outputting ClientHello and ServerHello. Not much, apparently:
```
"<thread-name>" daemon prio=10 tid=0x00002aab10d17000 nid=0x354b waiting for monitor entry [0x0000000053617000]
   java.lang.Thread.State: BLOCKED (on object monitor)
 at java.security.SecureRandom.nextBytes(SecureRandom.java:433)
 - waiting to lock <0x00002aaacfc89bf0> (a java.security.SecureRandom)
 at java.math.BigInteger.randomBits(BigInteger.java:479)
...
 at org.bouncycastle.crypto.generators.DHParametersHelper.generateSafePrimes(Unknown Source)
 at org.bouncycastle.crypto.generators.DHParametersGenerator.generateParameters(Unknown Source)
 at org.bouncycastle.jce.provider.JDKKeyPairGenerator$DH.generateKeyPair(Unknown Source)
 at com.sun.net.ssl.internal.ssl.DHCrypt.<init>(DHCrypt.java:76)
...
 at com.sun.net.ssl.internal.ssl.ServerHandshaker.clientHello(ServerHandshaker.java:425)
```
This "waiting to lock <0x00002aaacfc89bf0>" was in almost every jstack output, and then multiple times. There was of course one thread holding the lock, and the thread was also processing ClientHello and generating DH (Diffie-Hellman) key pair.
OK, I found the bottleneck but what's next? One thing that bothered me was that bouncycastle in the stack trace, but first I concentrated on the BigInteger generation. Unfortunately, because I lost some time on that, on reading about /dev/random, /dev/urandom, and various post about the default java's SecureRandom being not that fast. But trying to switch to something else involved quite some work.
I decided to look at bouncycastle. As it happened the application had bouncycastle packaged because it used some of the classes but we never needed it to be a security provider. But the fact that is was there in the stack trace clearly proved that it was installed as a security provider. Fortunately the tests we were doing did not use the functionality that relied on bouncycastle so I just deleted the jars and executed the tests again.

I actually ran the tests several times, with and without bouncycastle. Every run with bouncycastle had some 'broken pipe' messages in the log. Every run without it had none. And there were no delays between ClientHello and ServerHello messages.
Well, is bouncycastle that slow? I was not sure but I doubted. So more googling, looking at the sources of bouncycastle and the cryptoproviders that come with JDK provided the explanation.

When the SSL server code needs to generate a "Diffie-Hellman" key pair for a new session, it knows only the required key strength. It asks java for a key pair generator and uses it to generate a key pair of the required strength. To generate the key pair some DH parameters (this includes large random prime numbers) are needed. A newly created bouncycastle key pair generator generates parameters first time it is used. The SSL code never reuses key pair generators.

Result: bouncycastle has to generate large random prime numbers for each new key pair. The default JDK DH code generates DH parameters only once and then caches them so every newly created key pair generator just reuses what is there!

Interestingly the latest version of bouncycastle (1.47) has added some caching but is it per thread and if I am not mistaken DH parameters have to be cached explicitly.
I verified that my findings were correct with a small test application that generated a DH key pair using a default crypto provider and bouncycastle. 4 threads doing it in parallel with the default provider took less than 100 ms and with bouncycastle more than a minute.
OK, the delays are explained. But why bouncycastle ended up being registered as a security provider? For my test application I had to explicitly add it with Security.addProvider(). And even then it was not used if I just asked for a DH keypair generator, it was only used when I explicitly asked for a DH keypair generator from bouncycastle.

And it is not mentioned anywhere in JDK/JRE or JBoss startup or deploy files. It means that there is somewhere some piece of code that not only loads bouncycastle provider but also registers it before at least some of the JDK's default providers.

Another round of tests, this time with -Djava.security.debug=provider. This produces enough information to understand when a provider is loaded. The log output clearly shows that bouncycastle is not there during startup of JBoss or after the startup completed. But almost immediately after I started to send test data some output about bouncycastle was added to the log.

Unfortunately java.security.debug does not help with the most important question: what component/jar/whatever is loading a provider.

Last step: I modified BouncyCastleProvider so that it now prints a stack trace to the log. Another test run and we have the winner:

java.lang.Exception: XXYYZZ
       at org.bouncycastle.jce.provider.BouncyCastleProvider.<init>(BouncyCastleProvider.java:78)
       at sun.reflect.xxx
...
       at java.lang.Class.newInstance(Class.java:308)
       at org.apache.ws.security.WSSConfig.loadProvider(WSSConfig.java:569)
...
       at org.apache.ws.security.WSSConfig.<init>(WSSConfig.java:304)
...
       at org.apache.ws.security.message.WSSecTimestamp.<init>

This class comes from wss4j. And what a beauty it is! You can find the full source code on the internet, for example here.

The important parts are:

/**
     * a static boolean flag that determines whether default JCE providers
     * should be added at the time of construction.
     *
     * These providers, and the order in which they are added, can interfere
     * with some JVMs (such as IBMs).
     */
    private static boolean addJceProviders = true;

    /**
     * Set the value of the internal addJceProviders flag.  This flag
     * turns on (or off) automatic registration of known JCE providers
     * that provide necessary cryptographic algorithms for use with WSS4J.
     * By default, this flag is true, for backwards compatibility.  You may
     * wish (or need) to initialize the JCE manually, e.g., in some JVMs.
     */
    public static void setAddJceProviders(boolean value) {
        addJceProviders = value;
    }
    private synchronized void staticInit() {
...
        if (addJceProviders) {
            addJceProvider("BC", "org.bouncycastle.jce.provider.BouncyCastleProvider");
        }
...
    }
...
    private boolean loadProvider(String id, String className) {
...
        // Install the provider after the SUN provider (see WSS-99)
        // Otherwise fall back to the old behaviour of inserting
        // the provider in position 2. For AIX, install it after
        // the IBMJCE provider.
        //
        int ret = 0;
        for (int i = 0; i < provs.length; i++) {
            if ("SUN".equals(provs[i].getName())
                || "IBMJCE".equals(provs[i].getName())) {
                ret =
                    java.security.Security.insertProviderAt(
                         (java.security.Provider) c.newInstance(), i + 2);
                break;
            }
        }
        if (ret == 0) {
            ret =
                java.security.Security.insertProviderAt(
                     (java.security.Provider) c.newInstance(), 2);
        }
...
    }

Wow! Not only these ... alternatively thinking ... individuals mess the environment they have no business to touch, they do it by default so if I do not want it I have to opt-out. They also knew that whatever they were doing can cause problems. They were at least once bitten by this code (WSS-99). But the code is still there. Some people just never learn. Unfortunately they lay rakes and let other people step on them.

And the reason? It is there, in WSS-99: so that they can get to the strong encryption algorithms on some JDKs. Road to hell is paved with good intentions.

Newsflash, guys: whatever you are doing is wrong. You have no business altering the runtime environment like that. Your jar is just a small piece of big pile of code running on a server. And the problem you just so creatively "solved" could have been solved in a much less obtrusive way, and only for those environments where the problem really existed.

But no, "we know better" kicks in again.

Now the cause of the performance problem is identified but how can it be fixed? Unfortunately our application needs wss4j. But even if we did not need it, it comes with JBoss. It is there silently sitting and waiting its moment.

The best way is of course to patch wss4j. I do not have a problem with patching but unfortunately it is not a solution in our case. In many cases we can't control the environment. Quite often our application is installed into a "standard" jboss environment of a client. Normally our application is the only application in this environment but we still allowed doing only some configuration changes like JAVA_OPTS and such.

Another possibility is to make sure bouncycastle is loaded explicitly for example via jre/lib/security/java.security. But again, it is not always allowed. In addition it just does not feel right: that jdk is not used only by our application and bouncycastle is definitely not present by default in classpath.

One can repackage bouncycastle to remove class org.bouncycastle.jce.provider.BouncyCastleProvider. Or try and mess with classloading preventing this class to be loaded. But these solutions look even less attractive.

SecurityManager can help as well because adding a provider is hooked into it, but using a SecurityManager brings its own share of nuances. For starters it requires understanding SecurityManager. Good luck with it :-)

WSSConfig class has a possibility to disable this stupid functionality so going this way can also solve the problem. The only challenge here is to do it soon enough before WSSConfig manages to screw up things. The best way is building a separate jar with some JBoss deployment descriptor so that the code is triggered during JBoss startup.

To continue testing I have added some code to our application that does two things. First it disables WSSConfig's automatic provider loading. And then it checks if bouncycastle is already loaded and which DH keypair generator is used by default. If I get bouncycastle's keypair generator I just move bouncycastle to the end of the list. Yes I know I am changing the environment I have no right to touch :-) but we need to continue testing. Later I probably just move this code into a separate JBoss specific jar.

And for the authors of this wonderful library, wss4j ... Guys, there is still room for improvement. Maybe my hard drive is not partitioned optimally? Or how about patching KDE2 under FreeBSD? There are so many unsolved problems around. Do not let common sense stop you.