Tuesday, March 21, 2006

Wasting time on debugging memory errors again

I'm again losing days of work on debugging an OutOfMemoryError in a production system. The tricky part is that the code implements a very thin wrapper over a database, bulk processes messages, and is totally stateless. The software stack is JVM, then JDBC driver, then Hibernate, then Spring, then my code. There's no memory leak, I could confirm this much with a profiler - whatever was causing the trouble was allocating temp objects held by references on stack, so when the OutOfMemoryError unwound the stack, the smoking gun was gone...

Finally, I turned to JDK 6.0. It's in beta at the moment, but it has a very useful feature: a command line switch "-XX:+HeapDumpOnOutOfMemoryError" that'll cause a full heap dump (in HPROF heap dump format) whenever an OutOfMemoryError is thrown. After having the ops guys install the JDK 6.0 on the machine, I restarted the software under it, with the abovementioned switch, sit back, and waited for a memory error with a grin. And waited. And waited some more. Finally, waited for more than two hours while the system was running on full load. Nothing.

To my fullest and utter surprise, the memory error doesn't manifest itself when running under JDK 6.0, even after few hours of fully stressed operation. Damn. Isn't it typical? Maybe we have again hit a JDK-specific memory bug that got fixed in this later JDK? Unfortunately, I really cannot seriously propose to colleagues to run our production systems on a beta JDK...

Anyway, "-XX:+HeapDumpOnOutOfMemoryError" sounds like something that should have been part of the Java long, long ago. Big enterprise systems run into memory problems. That's a fact. There's few tasks as frustrating as trying to isolate them as the problem inherently manifests itself nonlocally. To have the JVM dump a heap snapshot at that point is invaluable. Don't having this feature caused me one sleepless night too many by now. I heard YourKit will have (or already has?) the ability to analyse HPROF snapshots, which would be really dandy for excavating in the results. Failing that, I still can use the HAT profiler, hopefully they have incorporated my patches to it in the past one year :-)

3 comments:

Anonymous said...

1) Heap dump on OOM will be backported to 1.5, 1.4 and, IIRC, 1.3!

2) HAT support will come YK 6

3) Can't you take a number of shapshots on your prod. env, say every 30 minuted, then compare? Unless the bug is very localized in time, you should get some hints.

4) Will you be posting anything on the old blog, or should I unsubscribe?

Attila Szegedi said...

1) Great stuff when it eventually becomes reality, but I have a problematic OOME just today.

2) Also great stuff - can't wait. I really grew fond of it.

3) I did that, every 30 seconds, and I couldn't catch the bug. I even attached Eclipse to the running process on the remote machine via a SSH tunnel from my desktop, setting an exception breakpoint for OOME. It was quite tedious though to crawl the stacks and look for something unusual once the breakpoint was hit. The problem with having JVM suspended at a breakpoint is that you can't selectively unpause few threads (namely YK agent) and take a profiler memory snapshot while the rest of the threads are standing by, one of them harboring a memory hog. That's why the new JDK 6.0 heap dump switch is so useful.

4) I don't plan on posting on the old blog anymore.

Anonymous said...

Hi,

You might want to check out our performance insight articles here:
http://www.jinspired.com/products/jxinsight/insights.html

The following article might be useful in understanding the difference between a memory leak and a resource capacity issue.
http://www.jinspired.com/products/jxinsight/outofmemoryexceptions.html

Kind regards,

William Louth
JXInsight Product Architect
JInspired

"J*EE tuning, testing, tracing and monitoring with Insight"
http://www.jinspired.com