Friday, December 19, 2014

SOA composite deployment coherence issue

While deploying a composite to a soa cluster - the deployment was stuck for more than 20 mnts. To make it worse when it was cancelled/retried it corrupted the MDS causing soa-infra to fail while restart.

Clearly the logs showed STUCK THREAD

<[STUCK] ExecuteThread: '56' for queue: 'weblogic.kernel.Default (self-tuning)' has been busy for "602" seconds working on the request "Workmanager: default, Version: 0, Scheduled=true, Started=true, Started time: 602461 ms
[
POST /soa-infra/deployer HTTP/1.1
Connection: TE
TE: trailers, deflate, gzip, compress
User-Agent: Oracle HTTPClient Version 10h
Accept-Encoding: gzip, x-gzip, compress, x-compress
ECID-Context: 
Authorization: Basic amNoZW42Ol8xYW1BZG1pbg==
Content-type: application/octet-stream
Content-Length: 69483

]", which is more than the configured time (StuckThreadMaxTime) of "600" seconds. Stack trace:
Thread-701 "[STUCK] ExecuteThread: '56' for queue: 'weblogic.kernel.Default (self-tuning)'" {
    -- Waiting for notification on: java.util.HashMap@4343a522[fat lock]
    java.lang.Object.wait(Object.java:???)
    oracle.integration.platform.blocks.deploy.CoherenceCompositeDeploymentCoordinatorImpl.submitRequestAndWaitForCompletion(CoherenceCompositeDeploymentCoordinatorImpl.java:352)
    oracle.integration.platform.blocks.deploy.CoherenceCompositeDeploymentCoordinatorImpl.coordinateCompositeRedeploy(CoherenceCompositeDeploymentCoordinatorImpl.java:255)
    oracle.integration.platform.blocks.deploy.servlet.BaseDeployProcessor.overwriteExistingComposite(BaseDeployProcessor.java:487)
    oracle.integration.platform.blocks.deploy.servlet.BaseDeployProcessor.deploySARs(BaseDeployProcessor.java:298)
    ^-- Holding lock: java.lang.Object@73823526[thin lock]


The soa-infra error

weblogic.application.ModuleException: [HTTP:101216]Servlet: "FabricInit" failed to preload on startup in Web application: "/soa-infra".
oracle.fabric.common.FabricException: Error in getting XML input stream: oramds:/deployed-composites/AccountBS_rev1.0/composite.xml: oracle.mds.exception.MDSException: MDS-00054: The file to be loaded oramds:/deployed-composites/AccountBS_rev1.0/composite.xml does not exist.


In case of soa-Infra error this blog has steps on how to recover

The deployment STUCK THREAD issue points to coherence related issues, there are many useful troubleshooting documents on oracle.support

General Coherence Network Troubleshooting And Configuration Advice (Doc ID 1389045.1)

Coherence and SOA Suite Integration Recommendations (Doc ID 1557370.1)

Troubleshooting Tips for Coherence - Oracle Service Oriented Architecture (SOA) Suite Integration Issues (Doc ID 1388786.1)

"oracle.integration.platform.blocks.deploy.CoherenceCompositeDeploymentCoordinatorImpl.submitRequestAndWaitForCompletion" Error and Slow Response While Accessing Composites In EM Console (Doc ID 1437883.1)

SOA 11g Composite Deployment Results in Stuck Thread Error: <[STUCK] ExecuteThread - Unable to Deploy the Composites in a Cluster (Doc ID 1086654.1)

SOA 11g Health Check: Verify Consistency of Coherence wka and wka.port Configuration (Doc ID 1578203.1)

SOA 11g: How Many Nodes are Required to be Specified as Coherence WKA Members in a SOA/OSB Cluster? (Doc ID 1511706.1)

Stuck Threads during SOA Cluster Deployment (Doc ID 1564586.1)

IpMontor Failed To Verify The Reachability Of Senior Member (Doc ID 1530288.1)



OSB-SOA-OSB zig zag pattern

Recently we ran into performance problems with some of our services following a OSB-SOA-OSB pattern, most of the services follow a OSB-SOA pattern, but for some where the tuxedo transport is used we ran into this pattern as OSB has the tuxedo transport unlike SOA suite.

As we all know OSB is stateless and SOA is generally not stateless and needs a DB, services following a OSB+SOA pattern, also need a consistent logging solution. These custom logging solutions are generally JMS based asynchronous solutions where a listener picks up the log messages and write to db.

There are multiple challenges in a OSB-SOA-OSB-SOA pattern

1) if OSB and SOA domains are separate - then there will be network hops while calling OSB to SOA and other way round - we can use the soa direct (t3) based communication but it has it's own challenges

some of the challenges are
a) if you are using t3 - you cannot use load balancer url - so have to be careful to give all managed server node urls (and test that load balancing is happening)
b) you cannot set a time out on these calls (mostly the JTA timeout takes effect?)
c) there could be additional complications if you use owsm policy at endpoints
d) transaction behavior could also be a challenge

2) having a consistent logging solution, as it is better to do logging as near to the source as possible, we should deploy a common solution for both OSB and SOA, so you would have to set up JMS queues in both OSB and SOA and deploy the code in both environments

3) The complete solution might not scale well - as the threading models for OSB and SOA are very different, and when you have a OSB to SOA to OSB and multiple of such calls - some thread deadlock situations will not be surprising


what was our problem?

we faced a huge latency in our response time, it turned out we were using the publish activity in OSB to do our logging, however publish is not really asynchronous if you publish to a proxy service

this blog helped us to confirm this.

We were publishing to a proxy service which publish to a JMS based business service - if the JMS configurations are not correct - this whole thread waits for a few seconds.

The other problem we had is the tuning of OSB and SOA, There are lot of material available on tuning - however few critical learning are

1) OSB - using workmanagers is absolute must to avoid any problematic areas going out of proportion and bringing down the node, work manager will help contain any spike that a service might have and consuming resourced causing others to starve

how many max threads to assign, assigning work managers to all services or only to some services, assigning to only proxy services or both proxy and business services - these are again areas to tune and test.


2) SOA - if audit is made 'off' at soa-infra level at least a 1 sec drop in response time was noticed, soa db connections, use of Gridlink datasource are also important tuning parameters.

some interesting db queries to check soa table space size here and time taken by soa components here

In summary, we managed to improve response time and throughput but OSB-SOA-OSB patterns will always have a bit overhead compared to only OSB option or only SOA option.

My recommendation would be to have OSB+SOA in same domain/node and then leverage soa direct transport to co-locate as much processing as possible, that would be a much faster option, I believe with 12c such domain topology might become more popular.