In DB2 10.5 with default WLM configuration, you can achieve something very special that you might not have given a thought. WLM is one of my favorite topic on which I can talk non-stop for few hours. Based upon my experiences working with large customers, here is a no-frills approach that will protect your database from failures and continue to give you the performance that you currently have. [Remember health care site – I wish I could have advised them had the contractor chosen DB2 instead of Oracle.] If they had used DB2 with WLM, they would not have encountered those database slowness regardless how much traffic was coming to the system.
Now, this is a bold statement and I will show how you can do this in your environment and not have those embarrassing moments. [But, it has a downsize also. How would your management appreciate you if there were no problems? So be practical – first let problems surface and then solve them using this skill and look better and possibly get some raises too.]
Before, I go into details – let me touch one point. No organization (Except NASA) on earth can build / procure unlimited hardware to meet ultimate business needs. Take same famous health care site where large number of people were trying to sign up. Say, to estimate a system that meets ultimate business needs, you might need a system with 1000 cores but you need it only less than 0.01 percent of time.
A smart CFO would not approve this type of investment to have 1000 cores to meet 0.01 % of business needs. May be, you only need 32 cores and build enough protection in the database so that it continues to function without crash or long delays and not become an embarrassment for the organization.
This article shows how you can build this in less than 10 minutes using no-nonsense default mechanism in DB2.
On a side note, you could implement a much comprehensive Work Load Management in DB2 but I will focus here what you can get out of the box and get yourself a promotion or at least save your paycheck.
Run this db2look command in your DB2 10.5 database to extract the current WLM DDLs.
$ db2look -d <databasename> -wlm
Here is the sample output (details removed).
ALTER WORK CLASS SET "SYSDEFAULTUSERWCS" ALTER WORK CLASS "SYSMANAGEDQUERIES" FOR TIMERONCOST FROM 150000 TO UNBOUNDED ; ALTER WORK ACTION SET "SYSDEFAULTUSERWAS" ALTER WORK ACTION "SYSMAPMANAGEDQUERIES" ENABLE; ALTER WORK ACTION SET "SYSDEFAULTUSERWAS" ENABLE; ALTER THRESHOLD "SYSDEFAULTCONCURRENT" DISABLE WHEN CONCURRENTDBCOORDACTIVITIES > 11 AND QUEUEDACTIVITIES UNBOUNDED STOP EXECUTION;
By looking at the above output, you would notice the default service super classes, sub classes, work loads, work class sets, work action sets, work action and thresholds, which were created automatically.
Please notice that we have SYSMANAGEDQUERIES work class which defines queries timeron costs between 150000 to unbounded. This work class and associated work action sets are all enabled.
What does this mean to you?
It does not do any thing good if there is no action associated with this. Let us investigate this further.
Please notice a threshold SYSDEFAULTCONCURRENT, which has the concurrent DB activities defined as 11 with no limit on queuing and the action is the STOP.
This is the Santry who protects your system. As a DBA, you should really pay attention to the thresholds in WLM. (The heart of WLM is thresholds management contrary to the popular belief of CPU% or CPU share. The CPU % etc play a role but thresholds are the basis of the best WLM practices.)
What does this mean to you?
It says that in the database, if there are more than 11 concurrent DB activities, WLM will start queuing other activities and they will not be allowed to execute. The threshold action defined as STOP will stop the execution.
You may also define CONTINUE action that will allow an activity to continue but it will be held in the queue if there are more than 11 concurrent DB activities at any given time.
Now, how could this have solved the problem of health care site when lot many users were logged in. The experience was that the site was extremely slow.
This would not be the case here.
Let us understand this with an analogy.
There is an auditorium jam packed with 3500 people (users connected to the database) in it. There is a fire in the hall now and there are only 11 exits (cores in CPU). Now, if you allow all people to exit at the same time, you can imagine the disaster you are going to witness. But, if you do an orderly job, and allow people to exit in a line in a calm fashion, you will empty the hall way faster than everyone trying to get out at the same time.
This is what happened in health care site when every user connected was trying to get the database resources and it was a disaster. Whereas, if we allow an orderly fashion, it would be a much different story. In distributed environments (Linux, Unix and Windows), the system will use time slicing to give CPU resources uniformly to ‘n’ number of activities. It is like everyone trying to get out of auditorium in case of a fire.
Now let us get back to the DB2. This thresholds is disabled. But, the above output does not tell to whom this threshold belongs to and what is the scope of this threshold.
Let us run the next command to check on this threshold.
$ db2pd -d <databasename> -thresholds Service Class Thresholds: Threshold Name = SYSDEFAULTCONCURRENT Threshold ID = 2147483647 Domain = 40 Domain ID = 4 Predicate ID = 90 Maximum Value = 11 Enforcement = D Queueing = Y Queue Size = -1 Collect Flags = N Partition Flags = C Execute Flags = S Enabled = N Check Interval (seconds) = 0 Remap Target Serv. Subclass = 0 Log Violation Evmon Record = Y Service Class Threshold Queues: Queue information for threshold: SYSDEFAULTCONCURRENT Max Concurrency = 11 Concurrency = 0 Max Queue Size = -1
The output above tells that this threshold is defined at the service class level (domain ID = 5) and the enforcement is at the DATABASE level, the action is STOP but the threshold is not enabled.
So, IBM DB2 provided you a default configuration with a green channel (Threshold disabled) where every query is allowed to run without any prohibition.
Let us look at the following picture to get a very high level picture of the default WLM configuration.
The above picture explains the following:
We have a service super class SYSDEFAULTUSERCLASS, which has two service subclasses defined SYSDEFAULTSUBCLASS and SYSDEFAULTMANAGEDSUBCLASS. All system activities run under SYSDEFAULTSUBCLASS and we are not showing them here.
The another service subclass SYSDEFAULTMANAGEDSUBCLASS is the real hero here, which has threshold SYSDEFAULTCONCURRENT defined in it with the scope at the database level.
We have a Work Action Sets SYSDEFAULTUSERWAS and underneath this is a work action SYSMAPMANAGEDQUERIES with a work class SYSMANAGEDQUERIES, which is like the red channel through which the queries have to go through. If there are more than 11 concurrent activities activities, the threshold action would come into force which would stop (or kill) the query. You could change this action to continue which will allow queries to queue.
This work class SYSMANAGEDQUERIES is associated with the same service subclass SYSDEFAULTMANAGEDSUBCLASS under which we have this threshold.
So what is next?
Now, we will enable this threshold that will act like a sentry in the system.
ALTER THRESHOLD SYSDEFAULTCONCURRENT ENABLE
As soon as you enable this threshold, it creates a red channel.
The analogy of green and red channel is this:
All queries less than 150000 cost are unbounded. All these queries less that 150000 timeron cost go through this green channel where there is no guard. (Like you clear through customs at the airport and just walk through the green channel where there is no custom officer.)
All queries greater than 150000 are flagged and they have to go through this red channel where they have to pass the scrutiny of a threshold.
We have the following rule defined: In the system at the database level, if there is any query having timeron cost above 150000 and if there are concurrent DB activities more than 11 (like more than using 11 cores) at any point, that query is a candidate to get axed with the use of the trigger through the threshold, which will STOP that query. This is like someone swiped the credit card and the clerk says – do it one more time as it did not go through. Chances are the WLM threshold violations triggered the STOP execution. This type of system can never put a burden on the system as we have happy 11 cores doing the work all the time and system does not have to do expensive time slicing for too many users.
What do you need to do now?
1. Define what is the red channel. Cost of the queries that you want to define the boundary or scope of the red channel.
2. Decide the concurrency limit. Ideally equal to the number of cores minus some cores to do the other database and OS activities to do the work. For example, in a 16 core system, we used 11 cores as concurrency limit. In 32 core system, you could use a value of 25 and in a 64 cores system, you could use a value of 52 cores. Start with a number and then adjust if your system is under utilized or over utilized by looking at the time spent in the queue.
So first things first.
1. Change the scope of the query cost.
2. Enable the threshold.
A. Change heavy query timeron threshold to 200000
ALTER WORK ACTION SET SYSDEFAULTUSERWAS DISABLE ALTER WORK CLASS SET SYSDEFAULTUSERWCS ALTER WORK CLASS SYSMANAGEDQUERIES FOR TIMERONCOST FROM 200000 TO UNBOUNDED ALTER WORK ACTION SET SYSDEFAULTUSERWAS ENABLE
B. Limit concurrent heavy queries to 25. (Assuming that you are using 32 cores for DB2.)
ALTER THRESHOLD SYSDEFAULTCONCURRENT DISABLE ALTER THRESHOLD SYSDEFAULTCONCURRENT WHEN CONCURRENTDBCOORDACTIVITIES > 30 STOP EXECUTION ALTER THRESHOLD SYSDEFAULTCONCURRENT ENABLE
With above, you have enabled rules in your system that will not put a burden on the system.
The following animation explains the following scenarios:
1. Green Channel – Queries < 50000 timeon cost will run unbounded. Red Channel – Queries > 50000 and
ALTER WORK ACTION SET SYSDEFAULTUSERWAS DISABLE ALTER WORK CLASS SET SYSDEFAULTUSERWCS ALTER WORK CLASS SYSMANAGEDQUERIES FOR TIMERONCOST FROM 50000 TO UNBOUNDED ALTER WORK ACTION SET SYSDEFAULTUSERWAS ENABLE
2. Concurrency – Only 4 queries will be allowed to run concurrently in the system having timeone cost > 50000 and they will be queued.
ALTER THRESHOLD SYSDEFAULTCONCURRENT DISABLE ALTER THRESHOLD SYSDEFAULTCONCURRENT WHEN CONCURRENTDBCOORDACTIVITIES > 4 CONTINUE EXECUTION ALTER THRESHOLD SYSDEFAULTCONCURRENT ENABLE