Friday, May 27, 2011

Mule startup quirks

Mule saga continues. Next problem I wanted to tackle was "we had to restart the mule server and the problems went away … or not ... and we had to restart it again ...".

A small detour: Mule can run as a standalone application where it handles everything itself. It can also be embedded into some application or packaged as a JEE web application and then deployed in a web container. Originally the project's Mule was running as a standalone server. This was changed at the client's request some time before going into production and now Mule is an EAR file with a single web application that runs in a separate OC4J instance. Mule provides the necessary code to make it work: a servlet context listener ( MuleXmlBuilderContextListener) to initialize Mule and a servlet ( MuleReceiverServlet) to receive HTTP requests. As a side effect of this embedded mode Mule can't really handle its lifecycle. Everything is managed through OC4J and OPMN. This was also given as the reason to use embedded mode: "look, this way we get Mule into our management infrastructure!"

Back to the topic: needless to say stopping/starting the mule server multiple times locally or in a development or test environment did not cause any problems. Time to look at the production log files. No surprises, it is full with exception stack traces, which makes it much harder to find something if one does not know what to look for. It took me some time to find this lonely line:

ERROR [OC4J Launcher] org.mule.retry.notifiers.ConnectNotifier: Failed to connect/reconnect: jms://QUEUE_NAME. Root Exception was: Io exception: Connection reset. Type: class java.sql.SQLException


After looking at the log entries before and after the found one I realized what has happened and why it was necessary to restart Mule multiple times. It goes like this:

  1. The Oracle database that provides AQ JMS support experiences some problems. As a result Mule also has problems, connections get dropped, etc.

  2. Administrators restart Mule server without (fully) solving database problems.

  3. Mule JMS support code fails to initialize JMS connection at Mule startup. This stops Mule startup sequence but not OC4J instance or OC4J Mule application. Mule can't stop itself because it does not control its lifecycle when it runs embedded.

  4. As a result Mule is started in a half broken state with at least some services uninitialized but is seen by OPMN as successfully started. So much for the management infrastructure.

  5. Subsequent requests to Mule fail because even though Mule servlet is up and running most of Mule services are not.


Okay, the problem is clear. Admins are notified to watch for that "Failed to connect/reconnect: jms://QUEUE_NAME" message in the logs. Except that it does not help much because OC4J (actually OPMN) can choose to bring OC4J instances down and up any time OPMN finds necessary. Given the software runs in OC4J cluster this can lead to situations when Mule runs on one node in the cluster and fails on another. Oops.

Now the question is: can anything be done to improve the situation?
I looked at the Mule initialization logic (class MuleXmlBuilderContextListener) and found that it does something ... interesting.

Method MuleXmlBuilderContextListener.initialize(ServletContext context) is responsible for initialization and here it is (from Mule 2.2.1), with some minor details omitted:

public void initialize(ServletContext context)
{
...
try
{
muleContext = createMuleContext(config, context);
muleContext.start();
}
catch (MuleException ex)
{
context.log(ex.getMessage(), ex);
// Logging is not configured OOTB for Tomcat, so we'd better make a
// start-up failure plain to see.
ex.printStackTrace();
}
catch (Error error)
{
...
throw error;
}
}

Basically almost any failure during initialization, be it configuration file errors, connection problems, whatever, are ignored. Sure they are logged, at best, but that is it. RuntimeExceptions are not caught, but most of Mule failures during startup result in MuleException, which is caught. For example, errors in the configuration file lead to an exception thrown somewhere down the chain from createMuleContext(). Some connection failures happen somewhere in muleContext.start() and do not throw any exception but do stop the initialization. What were they thinking?!

Is it possible to do better than that? I ended up with this, but a copy-paste version of MuleXmlBuilderContextListener with some modification would work as well:

public class CustomMuleXmlBuilderContextListener extends MuleXmlBuilderContextListener {

public void contextInitialized(ServletContextEvent event) {
initialize(event.getServletContext());
if (muleContextEx != null) {
String errMsg = "Mule context is not initialized";
logger.error(errMsg, muleContextEx);
throw new IllegalStateException(errMsg, muleContextEx);
}
if (muleContext != null && !muleContext.isStarted()) {
String errMsg = "Mule context is initialized but some services failed to start";
IllegalStateException ex = new IllegalStateException(errMsg);
logger.error(errMsg, ex);
throw ex;
}
}

protected MuleContext createMuleContext(String configResource, ServletContext context)
throws ConfigurationException, InitialisationException
{
try {
return muleContext = super.createMuleContext(configResource, context);
} catch (ConfigurationException ex) {
muleContextEx = ex;
throw ex;
} catch (InitialisationException ex) {
muleContextEx = ex;
throw ex;
}
}

//
private MuleContext muleContext;
private MuleException muleContextEx;
}

I had to override createMuleContext to get the created context instance because muleContext.isStarted() is the only way to find out if Mule has actually started successfully. This new listener makes sure that:

  1. If Mule is not started but reports no exception the listener makes sure there is one, and it is logged. (Yeah, even more noise in the log.) But this makes a failure like the one caused this investigation "Failed to connect/reconnect: jms://QUEUE_NAME" be more visible in the log.

  2. A failure during Mule initialization and startup is reported back to the web container. This hits another grey area of servlet specification which does not say what MUST or SHOULD happen in this case. There is a note, for example, in v. 2.4 of Java Servlet Specification:
    SRV.10.6 Listener Exceptions

    Some exceptions do not occur under the call stack of another component in the application. An example of this is ... a ServletContextListener that throws an unhandled exception during a notification of servlet context initialization ... In this case, the Developer has no opportunity to handle the exception. The container may respond to all subsequent requests to the Web application with an HTTP status code 500 to indicate an application error....

    OC4J has here fortunately a sensible behavior: the web application is marked internally as not started. Any HTTP request to it triggers an attempt to start it. So the listener gets another chance to start Mule and if the problem went away Mule starts and ready to serve requests. This does not work of course for configuration file failures but does its job nicely for connection failures.

Note: I know about RetryPolicy and its friends. It is a subject of one of my upcoming posts.

Tuesday, May 17, 2011

What should I think of mule?

"Every SOA project must use an ESB". I do not remember who said that to me but some people do follow this advice. The project mentioned in the previous post is not an exception. Being a nice example of the vendor driven architecture the project had used Oracle ESB for some time. Apparently that thing was so ridden with bugs the team decided to use something else instead, namely, mule community edition. By the time I have joined the project the switch was completed so I was not really involved with ESB. Until recently.

The production environment has being behaving quite flaky for quite some time. Some of the problems were traced back to the Oracle BPEL software. But there were also problems related to the mule server. The problem descriptions were not really helpful. It went like that: "oh, we have messages remaining in the queue", or "the problems begin after one of other services goes down; we suspect that mule holds HTTP connections open ...", or "we had to restart the mule server and the problems went away … or not ... and we had to restart it again ...".

Time to investigate.

The mule configuration used by the project is not that complex. There are some proxy services that were added under "URL transparency" motto. Couple of those proxy services perform some minor request/response transformations. And then there are some services that logically look like one way WS endpoints but internally use Oracle AQ messaging and a simple content-based routing to decouple sending and receiving components and to add message recovery.

The first thing I wanted to check was that "we suspect that mule holds HTTP connections open ..." and see if it is possible to force mule to close an HTTP connection after each request. After all the task sounded easy.

As a seasoned developer who knows next to nothing about a piece of software I google for "mule http connection keepalive". The first link HTTP Transport - Mule 2.x User Guide - mulesoft.org looks promising.

Wait a minute... to read documentation I have to be registered? Rrrrright. Given I need to investigate several possible problems it might be better to register, but this can wait. Google cache first.

I should say the documentation looks nice. Except of course you need to know all the mule story and architecture before you can get sense of the documentation. And it still does not explain why they need to have both 'keepAlive' and 'keep-alive' properties. Anyway after some reading I thought my problem is solved. After all this is what the documentation says:

Attributes of <outbound-endpoint...>
...







keep-alive boolean  Controls if the socket connection is kept alive. If set to true, a keep-alive header with the connection timeout specified in the connector will be returned. If set to false, a "Connection: close" header will be returned.
So let's try it out. First I verified that the current configuration does not close the HTTP connection after each request. It does not: it uses HTTP 1.1 and sends no "Connection: close" header. Hitting one of the proxy services multiple times in a row results in a single HTTP connection open.

Next I have added keep-alive="false", redeployed mule and executed the same test. There is nice "keepAlive: false" HTTP header and no "Connection" header. Who said it is going to work the first time?

I hate this. I mean I do everything as documented and it does not work. Given I just started poking around mule I would not immediately suspect the documentation is wrong. So I check and double check, and run more tests. And check again and make sure I looked at the correct documentation since the project is using mule 2.2.1 and the latest version is 3.something. Nope, everything looks correct, it just does not work.

Reading the doc page further I come across section Including Custom Header Properties. The same way of specifying Connection: close is also discussed for example here: Disable HTTP Connection keep-alive in CXF proxy.

Guess what? It does not work either. Just for kicks I have added another custom property "ILikeTheMuleDocumentation: false". This one works. Wow!

Next I tried something else that was mentioned in the same thread Disable HTTP Connection keep-alive in CXF proxy, namely, adding "Connection: close" with a deprecated "http.custom.headers" property. This one finally worked. And indeed produced a deprecation warning in logs.

By this time I became quite curious. I mean one must be quite … hmm ... talented developer to achieve such an interesting result of working and non-working features. So I looked in the sources and, after some digging, I found this:

package org.mule.transport.http.transformers;

...

public class ObjectToHttpClientMethodRequest
...

protected void setHeaders(HttpMethod httpMethod, MuleMessage msg)
{
// Standard requestHeaders
String headerValue;
String headerName;

for (Iterator iterator = msg.getPropertyNames().iterator(); iterator.hasNext();)
{
headerName = (String) iterator.next();

if (headerName.equalsIgnoreCase(HttpConnector.HTTP_CUSTOM_HEADERS_MAP_PROPERTY))
{
if (logger.isInfoEnabled())
{
logger.warn("Deprecation warning: There is not need to set custom headers using: " + HttpConnector.HTTP_CUSTOM_HEADERS_MAP_PROPERTY
+ " you can now add the properties directly to the outbound endpoint or use the OUTBOUND property scope on the message.");
}

Map customHeaders = (Map) msg.getProperty(HttpConnector.HTTP_CUSTOM_HEADERS_MAP_PROPERTY);
if (customHeaders != null)
{
for (Iterator headerItr = customHeaders.entrySet().iterator(); headerItr.hasNext();)
{
Map.Entry entry = (Map.Entry) headerItr.next();
if (entry.getValue() != null)
{
httpMethod.addRequestHeader(entry.getKey().toString(), entry.getValue().toString());
}
}
}
}
else if (HttpConstants.REQUEST_HEADER_NAMES.get(headerName) == null
&& !HttpConnector.HTTP_INBOUND_PROPERTIES.contains(headerName))

{
headerValue = msg.getStringProperty(headerName, null);
if (headerName.startsWith(MuleProperties.PROPERTY_PREFIX))
{
headerName = new StringBuffer(30).append("X-").append(headerName).toString();
}

httpMethod.addRequestHeader(headerName, headerValue);
}
}

Set attNams = msg.getAttachmentNames();
if (msg.getPayload() instanceof InputStream
&& attNams != null && attNams.size() > 0)
{
// must set this for receiver to properly parse attachments
httpMethod.addRequestHeader(HttpConstants.HEADER_CONTENT_TYPE, "multipart/related");
}
}
...
Brilliant, is it not? And, by the way, I did not find any place in the outbound path where 'keep-alive' property is checked. So much for a nice documentation. So I ended up using "http.custom.headers" to set "Connection: close" header.

Well, by that time I was well prepared to investigate the remaining issues with our mule configuration. I was sure I would come across even more pearls.