Support Migration Notice: To update migrated JIRA cases click here to open a new case use www.vmware.com/go/sr | vFabric Hyperic 5.7.0 is Now Available

Sigar

JVM crashes on AIX 64-bit using 32-bit JVM (not sure if it's a SIGAR bug or a JVM bug)

Details

  • Type: Bug Bug
  • Status: Closed Closed
  • Priority: Critical Critical
  • Resolution: Cannot Reproduce
  • Affects Version/s: 1.6.1
  • Fix Version/s: None
  • Component/s: None
  • Environment:
    OS: AIX 5.4 64-bit
    JDK: IBM J9 VM (build 2.3, J2RE 1.5.0 IBM J9 2.3 AIX ppc-32 j9vmap3223-20070426 (JIT enabled)
    SIGAR 1.6.0.14
  • Case Links:
    none

Description

I don't know how to analyze a javacore file, so I'm not sure is this is a SIGAR bug or a bug in the IBM JVM. However, the line:

1XHEXCPMODULE Compiling method: org/rhq/core/system/ProcessInfo.update(J)V

and the thread http://www.ibm.com/developerworks/forums/thread.jspa?messageID=14024900 seem to indicate it might be a JIT bug.

Doug, would you please take a look at the attached javacore file and let me know what you think? I can also provide the core file via FTP if that would be helpful (it's 145 MB).

Thanks!

  1. javacore.20090508.230248.295018.txt
    12/May/09 9:18 AM
    558 kB
    Ian Springer
  2. javacore.20090512.190834.557264.0003.txt
    13/May/09 6:13 AM
    557 kB
    Ian Springer
  3. javacore.20090514.071500.393424.0003.txt
    14/May/09 9:56 AM
    207 kB
    Ian Springer
  4. javacore.20090515.134047.528522.0003.txt
    15/May/09 3:00 PM
    119 kB
    Ian Springer
  5. Snap.20090512.190834.557264.0002.trc
    13/May/09 6:13 AM
    418 kB
    Ian Springer
  6. Snap.20090514.071500.393424.0002.trc
    14/May/09 9:56 AM
    418 kB
    Ian Springer
  7. Snap.20090515.134047.528522.0002.trc
    15/May/09 3:00 PM
    81 kB
    Ian Springer
  8. Snap0002.20090508.230248.295018.trc
    12/May/09 9:19 AM
    410 kB
    Ian Springer

Issue Links

Activity

Hide
Doug MacEachern added a comment -

Are you able to reproduce the crash using 'java -jar sigar.jar test' and/or any of the sigar.jar built-in shell commands?

Show
Doug MacEachern added a comment - Are you able to reproduce the crash using 'java -jar sigar.jar test' and/or any of the sigar.jar built-in shell commands?
Hide
Doug MacEachern added a comment -

Hi Ian,

Could it be a similar issue as this thread: http://forums.hyperic.com/jiveforums/thread.jspa?threadID=7800&tstart=0 ?
fwiw, we have users running the HQ agent on 64-bit aix w/ the 32-bit jre and haven't run into this before.

You can use the sigar shell to run the tests and/or commands a bunch of times, for example:

java -jar sigar.jar time 10 test

time n [any shell command]

javacore.txt doesn't seem too helpful, I'll try to make use of the Snap.trc files and see if that sheds any light.

Show
Doug MacEachern added a comment - Hi Ian, Could it be a similar issue as this thread: http://forums.hyperic.com/jiveforums/thread.jspa?threadID=7800&tstart=0 ? fwiw, we have users running the HQ agent on 64-bit aix w/ the 32-bit jre and haven't run into this before. You can use the sigar shell to run the tests and/or commands a bunch of times, for example: java -jar sigar.jar time 10 test time n [any shell command] javacore.txt doesn't seem too helpful, I'll try to make use of the Snap.trc files and see if that sheds any light.
Hide
Ian Springer added a comment -

I tried disabling JIT in case it was a JIT bug, but the JVM still crashed - crash files are attached.

So far, I've been unable to get the JVM running various SIGAR prompt commands and tests to crash using the same JVM on the same machine.

Show
Ian Springer added a comment - I tried disabling JIT in case it was a JIT bug, but the JVM still crashed - crash files are attached. So far, I've been unable to get the JVM running various SIGAR prompt commands and tests to crash using the same JVM on the same machine.
Hide
Ian Springer added a comment -

I got JVMs running only SIGAR prompt commands to crash 17 separate times. All of the javacore files contain:

2XMFULLTHDDUMP Full thread dump J9 VM (J2RE 5.0 IBM J9 2.3 AIX ppc-32 build 20081126_26240_bHdSMr, native threads):
3XMTHREADINFO "main" (TID:0x30152F00, sys_thread_t:0x30010E08, state:R, native ID:0x001B6047) prio=5
4XESTACKTRACE at org/hyperic/sigar/Sigar.nativeClose(Native Method)
4XESTACKTRACE at org/hyperic/sigar/Sigar.close(Sigar.java:229)
4XESTACKTRACE at org/hyperic/sigar/cmd/Shell.shutdown(Shell.java:196)
4XESTACKTRACE at org/hyperic/sigar/cmd/Shell.main(Shell.java:230)
4XESTACKTRACE at sun/reflect/NativeMethodAccessorImpl.invoke0(Native Method)
4XESTACKTRACE at sun/reflect/NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:79)
4XESTACKTRACE at sun/reflect/DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
4XESTACKTRACE at java/lang/reflect/Method.invoke(Method.java:618)
4XESTACKTRACE at org/hyperic/sigar/cmd/Runner.main(Runner.java:214)

Hopefully that or something else in the crash files, which I've attached, will provide a clue.

By the way, I tried running the JON Agent on the latest 32-bit Java6 VM, and it still crashed after about 24 hours.

Show
Ian Springer added a comment - I got JVMs running only SIGAR prompt commands to crash 17 separate times. All of the javacore files contain: 2XMFULLTHDDUMP Full thread dump J9 VM (J2RE 5.0 IBM J9 2.3 AIX ppc-32 build 20081126_26240_bHdSMr, native threads): 3XMTHREADINFO "main" (TID:0x30152F00, sys_thread_t:0x30010E08, state:R, native ID:0x001B6047) prio=5 4XESTACKTRACE at org/hyperic/sigar/Sigar.nativeClose(Native Method) 4XESTACKTRACE at org/hyperic/sigar/Sigar.close(Sigar.java:229) 4XESTACKTRACE at org/hyperic/sigar/cmd/Shell.shutdown(Shell.java:196) 4XESTACKTRACE at org/hyperic/sigar/cmd/Shell.main(Shell.java:230) 4XESTACKTRACE at sun/reflect/NativeMethodAccessorImpl.invoke0(Native Method) 4XESTACKTRACE at sun/reflect/NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:79) 4XESTACKTRACE at sun/reflect/DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 4XESTACKTRACE at java/lang/reflect/Method.invoke(Method.java:618) 4XESTACKTRACE at org/hyperic/sigar/cmd/Runner.main(Runner.java:214) Hopefully that or something else in the crash files, which I've attached, will provide a clue. By the way, I tried running the JON Agent on the latest 32-bit Java6 VM, and it still crashed after about 24 hours.
Hide
Doug MacEachern added a comment -

The shell stacktrace doesn't look related, but if you can give me the details on how-to reproduce, I'll look into it.
Back to my earlier question:
"Could it be a similar issue as this thread: http://forums.hyperic.com/jiveforums/thread.jspa?threadID=7800&tstart=0 ?"

I took a quick look at how ProcessInfo is used in RHQ, looks like SystemInfoFactory is a singleton?
--> private static SystemInfo cachedSystemInfo;

Is it the case that multiple threads are concurrently accessing the underlying Sigar object(s)?
If so, that's likely the problem. If SigarProxyCache is being used underneath, there's a change in the master branch the makes calls to the Sigar object synchronized.. I can merge that to the 1.6 branch if you want to give it a shot.

Show
Doug MacEachern added a comment - The shell stacktrace doesn't look related, but if you can give me the details on how-to reproduce, I'll look into it. Back to my earlier question: "Could it be a similar issue as this thread: http://forums.hyperic.com/jiveforums/thread.jspa?threadID=7800&tstart=0 ?" I took a quick look at how ProcessInfo is used in RHQ, looks like SystemInfoFactory is a singleton? --> private static SystemInfo cachedSystemInfo; Is it the case that multiple threads are concurrently accessing the underlying Sigar object(s)? If so, that's likely the problem. If SigarProxyCache is being used underneath, there's a change in the master branch the makes calls to the Sigar object synchronized.. I can merge that to the 1.6 branch if you want to give it a shot.
Hide
Ian Springer added a comment -

>The shell stacktrace doesn't look related, but if you can give me the details on how-to reproduce, I'll look into it.

Yeah, I didn't think so either. I noticed that the filesystem where the JVM was running from was at 100% usage; is it possible that somehow caused the JVM to crash on the call to Sigar.nativeClose()?

>Back to my earlier question:
>"Could it be a similar issue as this thread: http://forums.hyperic.com/jiveforums/thread.jspa?threadID=7800&tstart=0 ?"
>
>I took a quick look at how ProcessInfo is used in RHQ, looks like SystemInfoFactory is a singleton?
>--> private static SystemInfo cachedSystemInfo;
>
>Is it the case that multiple threads are concurrently accessing the underlying Sigar object(s)?

I think that is the case. But why have we not seen these crashes on any platforms other than AIX?

>If so, that's likely the problem. If SigarProxyCache is being used underneath, there's a change in the master branch the makes calls to the Sigar object synchronized.. I can merge that to the 1.6 branch if you want to give it a shot.

Sure, that would be great.

Thanks,
Ian

Show
Ian Springer added a comment - >The shell stacktrace doesn't look related, but if you can give me the details on how-to reproduce, I'll look into it. Yeah, I didn't think so either. I noticed that the filesystem where the JVM was running from was at 100% usage; is it possible that somehow caused the JVM to crash on the call to Sigar.nativeClose()? >Back to my earlier question: >"Could it be a similar issue as this thread: http://forums.hyperic.com/jiveforums/thread.jspa?threadID=7800&tstart=0 ?" > >I took a quick look at how ProcessInfo is used in RHQ, looks like SystemInfoFactory is a singleton? >--> private static SystemInfo cachedSystemInfo; > >Is it the case that multiple threads are concurrently accessing the underlying Sigar object(s)? I think that is the case. But why have we not seen these crashes on any platforms other than AIX? >If so, that's likely the problem. If SigarProxyCache is being used underneath, there's a change in the master branch the makes calls to the Sigar object synchronized.. I can merge that to the 1.6 branch if you want to give it a shot. Sure, that would be great. Thanks, Ian
Hide
Doug MacEachern added a comment -

Not sure why you haven't seen it on other platforms, possible that this isn't even the issue, but worth a shot. I merged the change and trigger the hudson build, you can grab the updated sigar.jar here:
http://hudson.hyperic.com/job/sigar-1.6-amd64-linux-2.6/

The AIX binaries are also available if you want to grab those too:
http://hudson.hyperic.com/job/sigar-1.6-ppc-aix-5.2/
http://hudson.hyperic.com/job/sigar-1.6-ppc64-aix-5.2/

Let me know how that works out and we'll go from there.

Show
Doug MacEachern added a comment - Not sure why you haven't seen it on other platforms, possible that this isn't even the issue, but worth a shot. I merged the change and trigger the hudson build, you can grab the updated sigar.jar here: http://hudson.hyperic.com/job/sigar-1.6-amd64-linux-2.6/ The AIX binaries are also available if you want to grab those too: http://hudson.hyperic.com/job/sigar-1.6-ppc-aix-5.2/ http://hudson.hyperic.com/job/sigar-1.6-ppc64-aix-5.2/ Let me know how that works out and we'll go from there.
Hide
Doug MacEachern added a comment -

Sounds like Sigar object access needs to be synchronized, created SIGAR-150 to make this easier.

Show
Doug MacEachern added a comment - Sounds like Sigar object access needs to be synchronized, created SIGAR-150 to make this easier.
Hide
Ian Springer added a comment -

Doug,

I tried using SIGAR 1.6.3 with the 64-bit AIX JDK6, and the JON Agent has been happily running for 3 weeks or so, so I'm not so sure it's the lack of synchronization that is causing the crashes. Whatever is causing the crashes seems to be specific to the 32-bit AIX SIGAR native library (though it's also possible it's some other bug in SIGAR 1.6.0 that has been fixed in SIGAR 1.6.3).

-Ian

Show
Ian Springer added a comment - Doug, I tried using SIGAR 1.6.3 with the 64-bit AIX JDK6, and the JON Agent has been happily running for 3 weeks or so, so I'm not so sure it's the lack of synchronization that is causing the crashes. Whatever is causing the crashes seems to be specific to the 32-bit AIX SIGAR native library (though it's also possible it's some other bug in SIGAR 1.6.0 that has been fixed in SIGAR 1.6.3). -Ian
Hide
Doug MacEachern added a comment -

Hi Ian,

Good to hear things are stable now. Did you also update the sigar.jar? If so that contains the change to synchronize access to the Sigar object underneath SigarProxyCache.
Haven't seen this issue on 32-bit AIX running the HQ agent.

Show
Doug MacEachern added a comment - Hi Ian, Good to hear things are stable now. Did you also update the sigar.jar? If so that contains the change to synchronize access to the Sigar object underneath SigarProxyCache. Haven't seen this issue on 32-bit AIX running the HQ agent.
Hide
Ian Springer added a comment -

Yep, I did update both the native libs and the jar to v1.6.3. So maybe it is indeed an issue with lack of synchronization.

Do you plan on releasing 1.6.3 soon?

Thanks,
Ian

Show
Ian Springer added a comment - Yep, I did update both the native libs and the jar to v1.6.3. So maybe it is indeed an issue with lack of synchronization. Do you plan on releasing 1.6.3 soon? Thanks, Ian

People

Vote (0)
Watch (1)

Dates

  • Created:
    Updated:
    Resolved:
    Last comment:
    4 years, 45 weeks, 3 days ago