Tuesday, July 17, 2012

How good intentions can grind SSL to a halt

We were running a load test of our application installed at the customer. The set up was quite complicated: several nodes running the application at the customer site and a single test server outside of the customer network struggling to send a lot of messages to the customer. The servers running there were in turn sending separate ack messages to our test server.

It also involved a firewall at the customer site, HTTPS with client authentication, custom key- and trust stores, XML digital signatures, Oracle database and Oracle AQ queues and more. And JBoss as the application server.

The setup was working all right but the customer has very strict performance requirements which we had to meet. Hence the tests.

After solving couple of obvious and not so obvious problems we reached quite a good throughput. Not what we wanted but reasonable. The problem was that when we tried to increase the load the hell broke loose.

The log files contained a large number of "send failed" messages with HTTP status code 403 and some HTML page clearly coming from the customer's firewall. The HTML was a generic "you are not allowed in" error message. The customer also reported seeing a log of "read timeout" messages in the firewall's logs.

The firewall was not really a dummy thing in our setup. It also handled SSL at the customer side. Our test server sent HTTPS messages to the firewall which did authentication magic and then passed the request to one of the nodes using HTTP. On the way back the application sent the ack messages as HTTP requests to the firewall which then initiated HTTPS to our test server.

403 errors happened both on our test server and on the servers at the customer, but the pattern was different. Under a moderate load the errors were happening at the customer side, not many, comparing to the number of messages. And first errors came quite quickly. There were no errors on the test server.

Increasing the load did not change the situation until we reached some threshold. The situation at the customer side did not change, but we immediately started to see 403 messages on our test server. Much, much more than at the customer side.

How such a problem can be analyzed? There are probably several ways but I describe what I did.
  1. Switch on SSL logging on our test server: -Djavax.net.debug=ssl and rerun the tests.

  2. Analyze the logs. The logs were already quite large and SSL debug setting added a lot more info. And the test server was both sending and receiving HTTPS messages. So analyzing the logs took some time. But eventually I started to see a pattern.
    <timestamp> INFO  [STDOUT] (<thread name>) *** ClientHello, Unknown-3.3
    <timestamp> INFO  [STDOUT] (<thread name>) *** 
    ...
    <timestamp + [10-20-30sec]> INFO  [STDOUT] (<thread name>) INFO %% Created:  [Session-101, TLS_DHE_RSA_WITH_AES_128_CBC_SHA]
    <timestamp + [20-30sec]> INFO  [STDOUT] (<thread name>) *** ServerHello, TLSv1
    ...
    <timestamp + [20-30sec]> INFO  [STDOUT] (<thread name>) *** ServerHelloDone
    <timestamp + [20-30sec]> INFO  [STDOUT] (<thread name>) WRITE: TLSv1 Alert, length = 2
    <timestamp + [20-30sec]> INFO  [STDOUT] (<thread name>) Exception sending alert: java.net.SocketException: Broken pipe
    
    So the server thread receives a ClientHello SSL message and then does something for dozens of seconds before trying to send ServerHello message. And then is it too late: the sending side has closed the connection.

  3. When the same thread was again involved, the log was different:
    <timestamp> INFO  [STDOUT] (<thread name>) *** ClientHello, Unknown-3.3
    <timestamp> INFO  [STDOUT] (<thread name>) *** 
    <timestamp> INFO  [STDOUT] (<thread name>) %% Resuming [Session-78, TLS_DHE_RSA_WITH_AES_128_CBC_SHA]
    <timestamp> INFO  [STDOUT] (<thread name>) *** ServerHello, TLSv1
    …
    <timestamp + [200-300ms]> INFO  [STDOUT] (<thread name>) *** 
    <timestamp + [200-300ms]> INFO  [STDOUT] (<thread name>)  *** READ: TLSv1 Application Data, length = 208
    ...
    
    The delay was happening only if a new SSL session was created. It did not happen for all the created sessions though. In some cases there was some delay between ClientHello and ServerHello but not that large: well under 10 seconds.

  4. Googling did not help. The most messages describe "reverse DNS lookup" problem which I believed had nothing to do with our problem. Just in case I tried to verify this with a simple test application which did not experienced any delays doing reverse DNS lookup, both for known and unknown addresses.

  5. Everything I have seen so far could match 403s in the log at the customer side. The problem happened only if a new SSL session was created which did not happen that often. This explained why there were not that many 403s. And the reason for the errors starting to appear almost immediately was also clear: more SSL sessions had to be created to cope with more connection attempts from the customer to our test server.

  6. The question remained: what was causing the delays. Since I am lazy I did not want waste my time on running some network traffic analysers. Good move in hindsight.

    Simple HTTPS tests in my development environment were OK. Reproducing the complete setup locally was out of the question. And debugging on the server ... well, I tried it, but it was extremely slow most of the time. SSH tunnel, VM server, whatever.

    Instead I have started the tests again and fired jstack -l <jboss_pid> in the loop redirecting the output to individual files.

    As soon I noticed 'broken pipe' messages in the log I stopped the test, found the thread name which logged 'broken pipe' and looked at the output of jstack to see what the thread was doing between outputting ClientHello and ServerHello. Not much, apparently:
    "<thread-name>" daemon prio=10 tid=0x00002aab10d17000 nid=0x354b waiting for monitor entry [0x0000000053617000]
       java.lang.Thread.State: BLOCKED (on object monitor)
     at java.security.SecureRandom.nextBytes(SecureRandom.java:433)
     - waiting to lock <0x00002aaacfc89bf0> (a java.security.SecureRandom)
     at java.math.BigInteger.randomBits(BigInteger.java:479)
    ...
     at org.bouncycastle.crypto.generators.DHParametersHelper.generateSafePrimes(Unknown Source)
     at org.bouncycastle.crypto.generators.DHParametersGenerator.generateParameters(Unknown Source)
     at org.bouncycastle.jce.provider.JDKKeyPairGenerator$DH.generateKeyPair(Unknown Source)
     at com.sun.net.ssl.internal.ssl.DHCrypt.<init>(DHCrypt.java:76)
    ...
     at com.sun.net.ssl.internal.ssl.ServerHandshaker.clientHello(ServerHandshaker.java:425)
    This "waiting to lock <0x00002aaacfc89bf0>" was in almost every jstack output, and then multiple times. There was of course one thread holding the lock, and the thread was also processing ClientHello and generating DH (Diffie-Hellman) key pair.

  7. OK, I found the bottleneck but what's next? One thing that bothered me was that bouncycastle in the stack trace, but first I concentrated on the BigInteger generation. Unfortunately, because I lost some time on that, on reading about /dev/random, /dev/urandom, and various post about the default java's SecureRandom being not that fast. But trying to switch to something else involved quite some work.

  8. I decided to look at bouncycastle. As it happened the application had bouncycastle packaged because it used some of the classes but we never needed it to be a security provider. But the fact that is was there in the stack trace clearly proved that it was installed as a security provider. Fortunately the tests we were doing did not use the functionality that relied on bouncycastle so I just deleted the jars and executed the tests again.

    I actually ran the tests several times, with and without bouncycastle. Every run with bouncycastle had some 'broken pipe' messages in the log. Every run without it had none. And there were no delays between ClientHello and ServerHello messages.

  9. Well, is bouncycastle that slow? I was not sure but I doubted. So more googling, looking at the sources of bouncycastle and the cryptoproviders that come with JDK provided the explanation.

    When the SSL server code needs to generate a "Diffie-Hellman" key pair for a new session, it knows only the required key strength. It asks java for a key pair generator and uses it to generate a key pair of the required strength. To generate the key pair some DH parameters (this includes large random prime numbers) are needed. A newly created bouncycastle key pair generator generates parameters first time it is used. The SSL code never reuses key pair generators.

    Result: bouncycastle has to generate large random prime numbers for each new key pair. The default JDK DH code generates DH parameters only once and then caches them so every newly created key pair generator just reuses what is there!

    Interestingly the latest version of bouncycastle (1.47) has added some caching but is it per thread and if I am not mistaken DH parameters have to be cached explicitly.

  10. I verified that my findings were correct with a small test application that generated a DH key pair using a default crypto provider and bouncycastle. 4 threads doing it in parallel with the default provider took less than 100 ms and with bouncycastle more than a minute.

  11. OK, the delays are explained. But why bouncycastle ended up being registered as a security provider? For my test application I had to explicitly add it with Security.addProvider(). And even then it was not used if I just asked for a DH keypair generator, it was only used when I explicitly asked for a DH keypair generator from bouncycastle.

    And it is not mentioned anywhere in JDK/JRE or JBoss startup or deploy files. It means that there is somewhere some piece of code that not only loads bouncycastle provider but also registers it before at least some of the JDK's default providers.

    Another round of tests, this time with -Djava.security.debug=provider. This produces enough information to understand when a provider is loaded. The log output clearly shows that bouncycastle is not there during startup of JBoss or after the startup completed. But almost immediately after I started to send test data some output about bouncycastle was added to the log.

    Unfortunately java.security.debug does not help with the most important question: what component/jar/whatever is loading a provider.

  12. Last step: I modified BouncyCastleProvider so that it now prints a stack trace to the log. Another test run and we have the winner:
    java.lang.Exception: XXYYZZ
           at org.bouncycastle.jce.provider.BouncyCastleProvider.<init>(BouncyCastleProvider.java:78)
           at sun.reflect.xxx
    ...
           at java.lang.Class.newInstance(Class.java:308)
           at org.apache.ws.security.WSSConfig.loadProvider(WSSConfig.java:569)
    ...
           at org.apache.ws.security.WSSConfig.<init>(WSSConfig.java:304)
    ...
           at org.apache.ws.security.message.WSSecTimestamp.<init>
    This class comes from wss4j. And what a beauty it is! You can find the full source code on the internet, for example here.

    The important parts are:
    /**
         * a static boolean flag that determines whether default JCE providers
         * should be added at the time of construction.
         *
         * These providers, and the order in which they are added, can interfere
         * with some JVMs (such as IBMs).
         */
        private static boolean addJceProviders = true;
    
        /**
         * Set the value of the internal addJceProviders flag.  This flag
         * turns on (or off) automatic registration of known JCE providers
         * that provide necessary cryptographic algorithms for use with WSS4J.
         * By default, this flag is true, for backwards compatibility.  You may
         * wish (or need) to initialize the JCE manually, e.g., in some JVMs.
         */
        public static void setAddJceProviders(boolean value) {
            addJceProviders = value;
        }
        private synchronized void staticInit() {
    ...
            if (addJceProviders) {
                addJceProvider("BC", "org.bouncycastle.jce.provider.BouncyCastleProvider");
            }
    ...
        }
    ...
        private boolean loadProvider(String id, String className) {
    ...
            // Install the provider after the SUN provider (see WSS-99)
            // Otherwise fall back to the old behaviour of inserting
            // the provider in position 2. For AIX, install it after
            // the IBMJCE provider.
            //
            int ret = 0;
            for (int i = 0; i < provs.length; i++) {
                if ("SUN".equals(provs[i].getName())
                    || "IBMJCE".equals(provs[i].getName())) {
                    ret =
                        java.security.Security.insertProviderAt(
                             (java.security.Provider) c.newInstance(), i + 2);
                    break;
                }
            }
            if (ret == 0) {
                ret =
                    java.security.Security.insertProviderAt(
                         (java.security.Provider) c.newInstance(), 2);
            }
    ...
        }

    Wow! Not only these ... alternatively thinking ... individuals mess the environment they have no business to touch, they do it by default so if I do not want it I have to opt-out. They also knew that whatever they were doing can cause problems. They were at least once bitten by this code (WSS-99). But the code is still there. Some people just never learn. Unfortunately they lay rakes and let other people step on them.

    And the reason? It is there, in WSS-99: so that they can get to the strong encryption algorithms on some JDKs. Road to hell is paved with good intentions.

    Newsflash, guys: whatever you are doing is wrong. You have no business altering the runtime environment like that. Your jar is just a small piece of big pile of code running on a server. And the problem you just so creatively "solved" could have been solved in a much less obtrusive way, and only for those environments where the problem really existed.

    But no, "we know better" kicks in again.

Now the cause of the performance problem is identified but how can it be fixed? Unfortunately our application needs wss4j. But even if we did not need it, it comes with JBoss. It is there silently sitting and waiting its moment.

The best way is of course to patch wss4j. I do not have a problem with patching but unfortunately it is not a solution in our case. In many cases we can't control the environment. Quite often our application is installed into a "standard" jboss environment of a client. Normally our application is the only application in this environment but we still allowed doing only some configuration changes like JAVA_OPTS and such.

Another possibility is to make sure bouncycastle is loaded explicitly for example via jre/lib/security/java.security. But again, it is not always allowed. In addition it just does not feel right: that jdk is not used only by our application and bouncycastle is definitely not present by default in classpath.

One can repackage bouncycastle to remove class org.bouncycastle.jce.provider.BouncyCastleProvider. Or try and mess with classloading preventing this class to be loaded. But these solutions look even less attractive.

SecurityManager can help as well because adding a provider is hooked into it, but using a SecurityManager brings its own share of nuances. For starters it requires understanding SecurityManager. Good luck with it :-)

WSSConfig class has a possibility to disable this stupid functionality so going this way can also solve the problem. The only challenge here is to do it soon enough before WSSConfig manages to screw up things. The best way is building a separate jar with some JBoss deployment descriptor so that the code is triggered during JBoss startup.

To continue testing I have added some code to our application that does two things. First it disables WSSConfig's automatic provider loading. And then it checks if bouncycastle is already loaded and which DH keypair generator is used by default. If I get bouncycastle's keypair generator I just move bouncycastle to the end of the list. Yes I know I am changing the environment I have no right to touch :-) but we need to continue testing. Later I probably just move this code into a separate JBoss specific jar.

And for the authors of this wonderful library, wss4j ... Guys, there is still room for improvement. Maybe my hard drive is not partitioned optimally? Or how about patching KDE2 under FreeBSD? There are so many unsolved problems around. Do not let common sense stop you.