Performance Engineering Series – Tuning for Performance - Case Study

Introduction

This series of articles will introduce the reader to explore more about Performance Engineering, specifically how to deal with actual issues of the product in a production deployment scenario. This is the most discussed area and most of the people I have interacted with look forward to getting in to this role and deal with actual issues and make their contribution.

This work requires knowledge and understanding from a multitude of aspects like –

The product design
Usage profile of the product
CPU and Memory and Disk configuration of the platform
Software configuration – Memory settings
Connected software components and their impact

The reader is expected to look at the problem holistically to make the best use of the resources being deployed and then tune the application to obtain the required performance.

As this spans a multitude of areas, extensive knowledge of all the areas is necessary and this comes mostly by experience. So, its always advisable for the readers to proactively examine all the available artifacts from which they can draw their own conclusions which can be later compared to the actual solution points to better their understanding.

Finally this is a continuous process and each case provides a unique insight in to ‘what else’ can be done to make the application better and to provide a better customer satisfaction.

With this introduction, let us get in to our first case which is the most recent one which I have worked on.

Case Study of JVM Tuning - Introduction

This particular problem was presented to me with the following symptoms –

Frequent unavailability of the application (crash issues)
Significant degraded performance in repeated spells

Background

To give a brief background, this particular application is responsible for Indexing and Search for the entire data that lives in the entire system. The design is based on Lucene and is quite dated but has scaled well in the many years of its existence.

The application runs on a JVM with JDK8 on an Oracle Linux VM with 48vCPU and 320GB RAM on the cloud. The application heap currently stood at 63GB and with the following GC parameters at the time of reporting the problem -

XX:ConcGCThreads=18 -XX:G1HeapRegionSize=4194304 -XX:G1ReservePercent=20 -XX:InitialHeapSize=67645734912 -XX:InitiatingHeapOccupancyPercent=30 -XX:MaxGCPauseMillis=500 -XX:MaxHeapSize=67645734912 -XX:+ParallelRefProcEnabled -XX:+UseG1GC

From the above GC parameters, it can be understood that some tuning was performed earlier which worked well until the crash problem + degraded performance started to show. These were the major issues and inferences after looking at the GC log(s)

Heap not being fully utilized. It doesn’t grow due to the parameters involved.
Despite too many cycles, not much is reclaimed.
Mixed GC’s

Long lived objects
Humongous objects (allocation request > 50% of current region size)

Region size – Currently at 4MB. Larger the region size, higher the wastage, but smaller the size, higher the GC time to be spent to clear
GC activity comparatively higher
CPU activity around the 50% mark for the entire duration of application activity.

These are some of the GC graphs prior to the attempted tuning (more below). All these graphs are courtesy of the website https://gceasy.io. All credits to them!

Tuning process and outcomes

As a first thing, the JDK version was checked as there is a process that the organization follows which requires that the JDK version be consistent across. But it was found that at bundled patch update also upgraded to the JDK to a minor version in the 1.8.xxx.

Crash problem analysis

The crash investigation led way to some previously reported issues which involved calls to JNI library, specifically the network at a time when the JVM was busy with GC activity and all the threads are blocked to complete the activity. The JVM eventually crashed with these errors –

# A fatal error has been detected by the Java Runtime Environment:

# SIGSEGV (0xb) at pc=0x00007f83cfd3a9d5, pid=2704, tid=0x00007f83ce50e700

# JRE version: Java(TM) SE Runtime Environment (8.0_271-b25) (build 1.8.0_271-b25)

# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.271-b25 mixed mode linux-amd64 )

# Problematic frame:

# V [libjvm.so+0x5b39d5] G1ParScanThreadState::copy_to_survivor_space(InCSetState, oopDesc*, markOopDesc*)+0x35

# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again

# An error report file with more information is saved as:

# /tmp/hs_err_pid2704.log

# If you would like to submit a bug report, please visit:

# http://bugreport.java.com/bugreport/crash.jsp

A search on the internet showed these existing bugs reported for the above errors and were linked to JNI calls –

The code being called in this particular application was pretty old and did not show any possible connections with JNI calls as described. But it was definitely something to look out for.

Degraded Performance Analysis

With this out of the way, the next thing was to look at the degraded performance of the JVM. To remediate the problem quickly, based on the prior made observations, the following set of tuning parameters were recommended -

-XX:ConcGCThreads=16 -XX:G1ReservePercent=15 -XX:InitiatingHeapOccupancyPercent=40 -XX:ParallelGCThreads=48 -XX:G1HeapRegionSize=4194304

After making the above changes, there was a slight improvement in the performance but was not very consistent. These are the GC graphs after applying the above parameters -

After applying the tuning parameters, in the meanwhile, additional data from NMON was also investigated to understand the amount of additional CPU head room available. The GC logs were examined in detail to get and understanding and have been categorized as below along with the remedial actions for each of them -

Log entry

[G1Ergonomics (Mixed GCs) start mixed GCs, reason: candidate old regions available, candidate old regions: 1549 regions, reclaimable: 4541130480 bytes (6.71 %), threshold: 5.00 %

Problem – Lot of regions getting filled up quickly and the threshold is at default of 5%.

Proposed Solution - -XX:G1HeapWastePercent=10

This gives more head room for the objects to be alive but it will also have an impact on the overall size of the heap.

For the current 60GB, 5% would be 3GB. But with it being increased to 10%, the wasted space would be 6GB. So, the proposal is to increase the heap to 70GB.

Log entry

G1Ergonomics (Concurrent Cycles) request concurrent cycle initiation, reason: occupancy higher than threshold, occupancy: 21848129536 bytes, allocation request: 4809696 bytes, threshold: 20293720470 bytes (30.00 %), source: concurrent humongous allocation]

[G1Ergonomics (Concurrent Cycles) do not request concurrent cycle initiation, reason: still doing mixed collections, occupancy: 21055406080 bytes, allocation request: 134217752 bytes, threshold: 20293720470 bytes (30.00 %), source: concurrent humongous allocation]

The allocation requests (in MB) from the application were like this (computed after filtering out the data after copying to a spreadsheet) –

4.586884	364.9705	8.974113	2.962227	5.730789	88.24471	5.70269	9.895233
2.108994	4.586884	8.974113	182.4853	5.55349	5.70269	2.705833	5.70269
4.586884	8.974113	8.974113	25.06779	4.586884	364.9705	78.94887	8.974113
5.172989	8.974113	5.70269	2.108994	5.70269	5.70269	4.586884	4.586884
3.312172	8.974113	9.895233	39.72817	30.17077	8.974113	44.12237	5.70269
5.70269	5.70269	9.895233	8.493942	3.538147	8.974113	4.586884	5.70269
2.331924	9.895233	9.895233	2.62748	4.586884	8.974113	5.70269	5.70269
4.586884	9.895233	4.586884	3.121033	4.586884	4.586884	2.484375
4.586884	9.895233	8.974113	4.586884	2.000023	2.108994	12.58921
8.974113	4.586884	4.586884	2.887321	8.641792	3.450066	3.176476

Problem – Humongous allocations and the resultant mixed GC’s due to long lived objects.

Proposed Solution – Increase the region size to the next size 8MB. Although this doesn’t completely solve the issue but it would reduce the fragmentation that would be caused by multiple empty regions and this will finally have a positive effect on the concurrent GC cycles

Log Entry

[G1Ergonomics (CSet Construction) finish choosing CSet, eden: 705 regions, survivors: 101 regions, old: 0 regions, predicted pause time: 2814.90 ms, target pause time: 200.00 ms]

Problem – Pause time goals

In the first set of tuning values, the MaxGCPauseMillis was set to a value of 500ms. Although the overall average added up to this number, the 90^th percentile was about 700ms.

The second set of tuning where this value was removed, it defaulted to a value of 200ms. In this spell, the 90% average was 650ms.

Proposed solution – Leave it at the current default value of 200ms and observe the performance for a few days (or) change the value to somewhere around 300-400ms to arrive at a sweet spot and give more head room to the JVM

Other parameters, their significance and the path forward

XX:G1ReservePercent=15 – This parameter worked well, and there were no to-space issues, so this can be left as-is

-XX:InitiatingHeapOccupancyPercent=40 – This also didn’t show much impact, so, it either can be left at this number or reduced to the previous number of 30 so that the marking cycle can start early. This can be taken up later.

Final list of GC parameters proposed

-XX:G1HeapRegionSize=8M -XX:ConcGCThreads=16 -XX:ParallelGCThreads=48 -XX:InitiatingHeapOccupancyPercent=40 -XX:+ParallelRefProcEnabled -XX:G1ReservePercent=15 -XX:G1HeapWastePercent=10

HEAPSIZE=70g

Results and outcome

Based on the above investigations, firstly the JDK version as reverted to a more stable standardized sub version.

The application of the final recommended GC parameters, stabilized the instance to a great extent and the problem of crashes / degraded performance was not reported.

Conclusion and take away

This article should have provided an overall perspective of dealing with a production incident and the approach that was taken to identifying and root causing the problem and then finally suggesting a remediation.

The tuning options recommended were very specific to this particular use case and usually remain unique for each and every application. It is up to the engineers working on the problem to take reference of these and then adapt / modify them appropriately for their specific requirement as there is no ‘one size fits all’ concept. The same problem could have been dealt with a different set of tuning parameters, and this is one of the many approaches and does not necessarily make this the only one solution. Its up to the engineers who are workign on such problems to be intrigued and invest time in investigation and experiment as much to get things right.

On a closing note, I would like to appreciate the time you have spent to read through this article and I hope it would have provided some insight and a path to deal with such problems. All the best and I would see you next time with another problem. Until then, stay tuned!

Search This Blog

The Performance Engineering Blog

Performance Engineering Series – Tuning for Performance - Case Study - 1

Popular posts from this blog

Performance Engineering Series – Tuning for Performance - Case Study – 5

Performance Engineering Series – Tuning for Performance - Case Study – 4