I encountered this error last week and it was quite a headache to find out what the source of the problem was.
Short summary:
I thought the BEA-010213 error and the rollback log entries meant that we had a database problem. This was a wrong assumption, there’s another storage type in weblogic domains which use transactions (or state transitions?), which are JMS queues, which live in persistent stores in the managed servers.
That was the source of our problem; our persistent stores became corrupted because of storage problems.
The situation:
We store our persistent stores on a NFS share, in short; perhaps a bad decision but we chose this to be able to enlarge/shrink storage when needed without impacting our running systems.
There were latency problems with these mounts, these were detected by the ops team and relayed to us (at this point we didn’t connect the dots yet)
What we didn’t deduct then was that the persistent stores are always open when weblogic is running, thus when we’re having timeout and latency problems with these mounts we’re not sure if there are messages being written, not completely written or in the midst of a state transition (e.g. from visible to pending to complete) This probably was the case.
The problem:
I assume our persistent stores became corrupted. This is because of the timeout of the nfs shares, which simply put, temporarily stopped the persistent stores ability to store the messages persistently (that’s a bad thing, see what happened there?)
My assumption is that BEA-010213 entry tells you that the message has hit a corrupted part of the persistent store, will roll back and will try again.
Some symptoms to recognize this problem:
- Massive amounts of BEA-010213 errors in your server logs
- The JMS messages running through the impacted stores vary in processing time
- Some messages were processed instantly
- Some messages took more than six hours to be processed
- There are no messages lost (big relief!!)
- The problem got worse, probably because the nfs timeouts kept occurring
- Rebooting doesn’t help (gotta try, right?)
The solution:
Long story short: You’ll need to re-create the filestores.
Problem with this: Easier said than done because the messages were stored persistently for a reason, our business partner would not have been amused (mildly put) if we had lost messages. Thus we needed to drain the last messages out of our corrupt filestores before recreating them.
I followed these steps to make sure we did not lost messages:
- Pause production for all the JMS servers which are on one managed server.
- Pausing production will make sure that there are no more messages written to the filestore. They will be written to the queues in one of the other managed servers. (please make sure you’re using distributed queues..!)
- Wait until all messages are out of the store, be wary of “pending” messages. Most important: Be very, very patient young grasshopper. You can check the status on the queues themselves or on the JMS servers.
- Change the path of the filestore. (Don’t choose nfs mounts location if you don’t need to)
- Note; The new location will force a recreation on the next reboot of the weblogic server
- Reboot the weblogic server
- Check for the creation of the new filestore in the new location
- Make sure production is resumed on each JMS server you paused in step 1.
- Check the weblogic server logs for BEA-010213 errors!! (Your storage itself might be corrupt!)
Note; If you don’t know the techniques in step 1 to 7, stop right now. Make sure you understand what you’re doing and how you can check each step for success or failure. If you’re in doubt, read this, this and this for more information about the Java Message Service (JMS) and how it’s implemented in Java as well as Weblogic.
I hope this page helps you out, I couldn’t find any resources on the interwebs for this problem. (my current copy of the interwebs is d.d. 2016_05_12)
Extra:
Just to be complete, below is the full error which could be found in the weblogic server logs. (I replaced the our server and company info)
<May 10, 2016 8:29:11 AM CEST> <Info> <EJB> <server.mycompany.local> <mydomain_prd_server_3> <[STANDBY] ExecuteThread: ’22’ for queue: ‘weblogic.kernel.Default (self-tuning)’> <<anonymous>> <> <f74bef2a1d3d3074:-41f56401:1547b2c357a:-7ffd-0000000000f0430a> <1462861751328> <BEA-010213> <Message-Driven EJB: MyProcessMDB’s transaction was rolled back. The transaction details are: Name=[EJB com.company.someprocess.process.MyProcessMDB.onMessage(javax.jms.Message)],Xid=BEA1-2EB589EEBD18103F3734(2135458311),Status=Rolled back. [Reason=weblogic.transaction.internal.AppSetRollbackOnlyException: setRollbackOnly called on transaction],numRepliesOwedMe=0,numRepliesOwedOthers=0,seconds since begin=70,seconds left=60,XAServerResourceInfo[WLStore_mydomain-interface_FS_osb_prd_server_2]=(ServerResourceInfo[WLStore_mydomain-interface_FS_osb_prd_server_2]=(state=new,assigned=none),xar=null,re-Registered = false),SCInfo[mydomain_prd+mydomain_prd_server_3]=(state=rolledback),SCInfo[osb_prd+osb_prd_server_1]=(state=rolledback),SCInfo[osb_prd+osb_prd_server_2]=(state=rolledback),properties=({weblogic.transaction.name=[EJB com.company.someprocess.process.MyProcessMDB.onMessage(javax.jms.Message)]}),OwnerTransactionManager=ServerTM[ServerCoordinatorDescriptor=(CoordinatorURL=mydomain_prd_server_3+managed_server_3.mycompany.local:7001+mydomain_prd+t3+, XAResources={WSATGatewayRM_mydomain_prd_server_3_mydomain_prd, mydomain_prd, WLStore_mydomain_prd_process-ear_FS_mydomain_prd_server_3, DATASOURCE_mydomain_prd, SCHEMA_mydomain_prd},NonXAResources={})],CoordinatorURL=osb_prd_server_1+MANAGED_SERVER_3.mycompany.local:7001+osb_prd+t3+).>