Tuesday 19 April 2011

ORACLE BPEL 11g, Composite Deployment Propagation Issue Due To Coherence.

If you are facing BPEL (SOA) composite deployment propagation issues in 11g oracle SOA cluster, then there is a very good chances that this happening due to the cluster communication section.

In 11g the cluster communication is maintained by another Oracle product called Coherence.

When you look at the Soa Diagnostic logs (and/or the coherence log if you have redirected the coherence logs to some other file)  you might witness logs such as:


Log from SOA1Cohearance.log

2011-03-30 11:08:27.270 Oracle Coherence GE 3.3.2/391 <Warning> (thread=PacketPublisher, member=1): Experienced a 22869 ms communication delay (probable remote GC) with Member(Id=2, Timestamp=2011-03-30 10:17:51.636, Address=IPAddress:Port, MachineId=<MachineID>, Location=process:<Process@Domain>); 123 packets rescheduled, PauseRate=0.0075, Threshold=6802011-03-30 11:31:00.643 Oracle Coherence GE 3.3.2/391 <Warning> (thread=Cluster, member=1): This senior Member(Id=1, Timestamp=2011-03-29 16:13:25.607, Address=IPAddress:Port, MachineId=<Machine ID>, Location=process:<Domain@DomanName>) appears to have been disconnected from another senior Member(Id=2, Timestamp=2011-03-30 10:17:51.636, Address=IPAddress:Port, MachineId=<MachineID>, Location=process:<Process@Domain>), which is the only member of its cluster, but did not respond to any of the termination requests; manual intervention may be necessary to stop that process.2011-03-30 11:31:00.695 Oracle Coherence GE 3.3.2/391 <D5> (thread=Cluster, member=1): TcpRing: disconnected from member 2 due to a disconnect request2011-03-30 11:31:01.652 Oracle Coherence GE 3.3.2/391 <D5> (thread=Cluster, member=1): TcpRing: connecting to member 2 using TcpSocket{State=STATE_OPEN, Socket=Socket[addr=/<IPAddress>,port=<Port>,localport=<Port>]}2011-03-30 11:31:01.949 Oracle Coherence GE 3.3.2/391 <D6> (thread=PacketPublisher, member=1): Member(Id=2, Timestamp=2011-03-30 10:17:51.636, Address=IPAddress:Port, MachineId=<MachineID>, Location=process:<Process@Domain>) has failed to respond to 17 packets; declaring this member as paused.2011-03-30 11:31:03.672 Oracle Coherence GE 3.3.2/391 <D5> (thread=Cluster, member=1): TcpRing: connecting to member 2 using TcpSocket{State=STATE_OPEN, Socket=Socket[addr=/<IPAddress>,port=<Port>,localport=<Port>]}2011-03-30 11:31:05.694 Oracle Coherence GE 3.3.2/391 <D5> (thread=Cluster, member=1): TcpRing: connecting to member 2 using TcpSocket{State=STATE_OPEN, Socket=Socket[addr=/<IPAddress>,port=<Port>,localport=<Port>]}



Log from SOA2Cohearance.log

2011-03-30 11:30:53.820 Oracle Coherence GE 3.3.2/391 <Warning> (thread=PacketPublisher, member=2): Timeout while delivering a packet; removing Member(Id=1, Timestamp=2011-03-29 16:13:25.607, Address=IPAddress:Port, MachineId=<Machine ID>, Location=process:<Domain@DomanName>)
2011-03-30 11:30:53.820 Oracle Coherence GE 3.3.2/391 <D5> (thread=PacketPublisher, member=2): Member 1 left service soa_domain_SOA_ClusterCacheService with senior member 2
2011-03-30 11:30:53.821 Oracle Coherence GE 3.3.2/391 <D5> (thread=PacketPublisher, member=2): Member 1 left Cluster with senior member 2
2011-03-30 11:30:53.821 Oracle Coherence GE 3.3.2/391 <D5> (thread=ReplicatedCache:soa_domain_SOA_ClusterCacheService, member=2): Service soa_domain_SOA_ClusterCacheService: sending ServiceConfigSync to all
2011-03-30 11:30:54.181 Oracle Coherence GE 3.3.2/391 <D5> (thread=Cluster, member=2): TcpRing: disconnected from member 1 due to the peer departure
2011-03-30 11:31:00.683 Oracle Coherence GE 3.3.2/391 <Warning> (thread=Cluster, member=2): The member formerly known as Member(Id=1, Timestamp=2011-03-30 11:30:53.82, Address=IPAddress:Port, MachineId=<Machine ID>, Location=process:<Process@Domain>) has been forcefully evicted from the cluster, but continues to emit a cluster heartbeat; henceforth, the member will be shunned and its messages will be ignored.
2011-03-30 11:31:01.670 Oracle Coherence GE 3.3.2/391 <D4> (thread=TcpRingListener, member=2): Rejecting connection to member 1 using TcpSocket{State=STATE_OPEN, Socket=Socket[addr=/<IPAddress>,port=<Port>,localport=<Port>]}
2011-03-30 11:31:03.685 Oracle Coherence GE 3.3.2/391 <D4> (thread=TcpRingListener, member=2): Rejecting connection to member 1 using TcpSocket{State=STATE_OPEN, Socket=Socket[addr=/<IPAddress>,port=<Port>,localport=<Port>]}
2011-03-30 11:31:05.704 Oracle Coherence GE 3.3.2/391 <D4> (thread=TcpRingListener, member=2): Rejecting connection to member 1 using TcpSocket{State=STATE_OPEN, Socket=Socket[addr=/<IPAddress>,port=<Port>,localport=<Port>]}
When you see log message such as above, All it says is that Coherence is not able to communicate to the other node member in the cluster. In other words the other node, which keeps disconnecting very often, is not responding to coherence connection request.

The following was the solution in my/our case: 

Finally with various testing and researching, found out that the second node container was taking a long time with Garbage collection (GC) and was found that the second node's Java start up parameters where not set up to standards.

As a result  the Java start up parameters were tuned as following:

-XX:+UseParallelGC -XX:+UseAdaptiveSizePolicy -XX:ParallelGCThreads=4 -XX:+UseGetTimeOfDay -Xms4096m -Xmx4096m -XX:PermSize=512m -XX:MaxPermSize=756m -XX:NewSize=512m -XX:MaxNewSize=512m -XX:MaxTenuringThreshold=0 -XX:SurvivorRatio=10

and aded -Dtangosol.coherence.mode=prod to EXTRA_JAVA_OPTIONS variable

Note: Dont forget to restart your weblogic managed servers.

With the above settings, the issue with BPEL composite Deployment propagation disappeared.

No comments:

Post a Comment