Friday, 18 September 2015

Mainframe Overview



"What is a mainframe?" Today, the term mainframe can best be used to describe a style of operation, applications, and operating system facilities. To start with a working definition, a mainframe is what businesses use to host the commercial databases, transaction servers, and applications that require a greater degree of security and availability than is commonly found on smaller-scale machines

z/OS is a 64-bit operating system for Mainframe.z/OS supports stable mainframe systems and standards such as CICSIMSDB2RACFSNAWebSphere MQ, record-oriented data access methodsREXXCLISTSMP/EJCLTSO/E, and ISPF, among others. However, z/OS also supports 64-bit JavaC,C++, and UNIX (Single UNIX Specification) APIs and applications through UNIX System Services  — The Open Group certifies z/OS as a compliant UNIX operating system — with UNIX/Linux-style hierarchical HFS[NB 2] and zFS file systems. As a result, z/OS hosts a broad range of commercial and open source software.[2] z/OS can communicate directly via TCP/IP, including IPv6,[3] and includes standard HTTP servers (one from Lotus, the other Apache-derived) along with other common services such as FTPNFS, andCIFS/SMB. Another central design philosophy is support for extremely high quality of service (QoS), even within a single operating system instance, although z/OS has built-in support for Parallel Sysplex clustering.



Mainframe workloads: Batch and online transaction processing

Mainframe concepts
Most mainframe workloads fall into one of two categories: Batch processing or online transaction processing, which includes Web-based applications.
One key advantage of mainframe systems is their ability to process terabytes of data from high-speed storage devices and produce valuable output. For example, mainframe systems make it possible for banks and other financial institutions to perform end-of-quarter processing and produce reports that are necessary to customers (for example, quarterly stock statements or pension statements) or to the government (for example, financial results). With mainframe systems, retail stores can generate and consolidate nightly sales reports for review by regional sales managers. The applications that produce these statements are batch applications, which are illustrated at the top of Figure 1.
In contrast to batch processing, transaction processing occurs interactively with the end user. This interaction is outlined at the bottom of Figure 1. Typically, mainframes serve a vast number oftransaction systems. These systems are often mission-critical applications that businesses depend on for their core functions. Transaction systems must be able to support an unpredictable number of concurrent users and transaction types. Most transactions are executed in short time periods— fractions of a second in some cases.
Figure 1. Typical mainframe workloads
SAP on Mainframe Delivers High Availability and More
he first, “Where SAP and System z Intersect,” outlined the components that make up an SAP system while the final looks at the benefits of running SAP on System z.
Why do customers choose to run their SAP systems on System z? The primary reason is their applications require 24x7x52 availability and operation. Each of the SAP components represent a single point of failure, but System z hardware features and z/OS software features, along with changes to SAP technology layer components, eliminate those single points of failure. Let’s examine several in more detail.

Parallel Sysplex

IBM Parallel Sysplex technology provides the hardware-supported clustering functions exploited by DB2 in data-sharing mode. This makes the SAP database server nearly continuously available. With the I/O subsystem design, all DASD is shared by all System z boxes in the Sysplex. The data is always available to the application. In concert, the SAP application severs have function (called Sysplex failover) to quickly reconnect to surviving DB2 members in the DB2 data-sharing group. As noted previously, if the SAP database server is unavailable, the SAP system is down.
In the event of some error, users will experience a momentary interruption—rather than an unplanned outage. This is made possible by special functionality built into the SAP application server to move workload automatically from one DB2 member to another. Other special built-in functions permit SAP system programmers to gracefully move SAP workload from one DB2 member to another, thereby decreasing the duration of planned outages of the SAP system. IBM DB2 development’s goal is the elimination of all resource unavailable conditions (i.e., SQL Code -904), and all applications, including SAP, benefit from this.

SAP Central Services

The SAP central services (SCS) function provides the locking mechanism SAP applications use to ensure transactional integrity. Early adopters of SAP on DB2 for z/OS (zSAP) pointed out to SAP that its central instance was a single point of failure. SAP responded with a redesigned version, enhanced to have a mirrored copy of the lock table on another server maintained by the enqueue replication server (ERS). This enhancement makes SCS continuously available. There’s no outage whatsoever in the SAP locking function when the primary enqueue server fails and is restarted on the backup server. SAP end users will see only a slight delay while automation software relocates the SCS.
The SCS has been enhanced recently to enable customers to store the backup lock table in the IBM System z coupling facility. This enhancement speeds up SAP lock processing a bit but, more importantly, it increases availability. Also, it simplifies the Tivoli system automation policy for monitoring this component. The SCS can be restarted in any LPAR of the Parallel Sysplex, rather than having to be restarted in the z/OS LPAR where the ERS is running.

SAP Application Server

SAP application server availability is typically accomplished by having more than one of them. Through the SAP logon-groups function, if an application server fails, the user gets logged out. The user must log back in and the logon-groups function will determine which surviving application server the end user can use. Note that when users log in, they will be assigned to one application server. The end user remains connected to that particular application server for the life of the logon session. In the case of the SAP application server running on Linux on System z under z/VM, a customer can avoid a planned application outage by exploiting z/VM's single system image (SSI) and live guest relocation (LGR). Exploiting this feature allows maintenance of z/VM and Linux on System z guests without an application outage.

Network

All SAP components communicate with each other using TCP sessions. This means the network has the potential to be a single point of failure. The z/OS communications server solves this problem with something called a virtual IP address. Basically, the z/OS TCPIP stack allows creation of an IP address that belongs to an application or host, rather than an I/O port. Under the open shortest path first (OSPF) protocol, the network software learns the network high-availability topology (i.e., multiple network ports per LPAR). So if a network link fails, the OSPF protocol will allow network software to choose the next path between SAP components. The end user and application will see an extended response time while the session’s path is being re-routed.

Virtualization and Workload Management

The latest generations of System z machines are very powerful, and it’s rare a single SAP system can utilize the entire machine. System z firmware allows the physical box to be divided into LPARS. One can share the physical capacity in almost any combination. This is a perfect match for customers who have multiple SAP landscapes. The basic idea is to create LPARs for each lifecycle system in an SAP landscape. They could create separate LPARs for those systems related to: sandboxing, development, testing, quality assurance and production. If the application server is run under Linux on System z, they could create the LPARs in pairs—one for z/OS and one for z/VM. This setup is perfect for IT folks to slowly rollout hardware and software maintenance. By the time the new software level reaches the SAP production systems, it has been thoroughly tested.
In the latest firmware, the intelligent resource director (IRD) and workload manager (WLM) enable the system to move CPU resource to LPARs that need the extra horsepower. IRD with WLM can vary only regular engines on and offline. With the edition of the HiperDispatch function, the system can make better use of limited cache memory by ensuring the same workload is run repeatedly against the same physical CPUs. HiperDispatch works for regular System z Integrated Information Processor (zIIP) and System z Application Assist Processor (zAAP) engines.
All of this function allows you to do more with less. Resource utilization is maximized, with few idle MIPS.

Vertical and Horizontal Scalability

SAP systems scale smoothly with either vertical or horizontal growth of System z hardware. For vertical scaling, the latest enhancement in DB2 V10 allows for thousands of sessions from the call-level interface (CLI) driver into the DB2 subsystem. For horizontal scaling, SAP provides function to control which application servers communicate with which DB2 data-sharing members. This helps eliminate DB2 inter-systems interest and allows for better use of DB2 memory buffers—resulting in less churn.

zIIP Engine Exploitation

SAP was the first application/solution to exploit zIIP engines. Its application server connects to the SAP database server via the DB2 CLI driver. It, in turn, uses the distributed relational database architecture (DRDA) protocol over TCP to introduce work to the database server. This applies to all of the distributed SAP application servers—AIX, Linux and Windows—as well as Linux on System z.

IFL Engine and Internal HiperSockets Exploitation

When the SAP application server runs on Linux on System z, with or without z/VM, more economical IFL engines can be used. If the application server is running on the same physical System z box as the SAP database server, then HiperSockets (network in a box) can be exploited as well. HiperSockets allows network packets to be sent and received at main-memory and CPU speeds. They’re also very secure.

Database Management

DB2 for z/OS can compress the data on disk thereby reducing costs associated with DASD. DB2 exploits hardware instructions to compress the data in tables, and the indices are compressed as well with software-based algorithms.
The DB2 backup system utility, in concert with hierarchical storage management (HSM) and FlashCopy functions, enables non-disruptive, online, full-system and incremental backups. All 80,000-plus tables and 100,000-plus indices are backed up with one utility invocation. The SAP systems keep up chugging.
The DB2 restore system utility is available to restore very rapidly to current time or to a prior point in time, determined by the business. HSM and FlashCopy are also used on the restore side. In customer environments that run combinations of SAP applications, all of the applications can be recovered to a common point in time, if necessary. This is enabled by the DB2 log record sequence number (LRSN), which is basically a timestamp on DB2 log records.

Integration with DB2 for z/OS

DB2 V5 was the first version to support SAP. Since those early days, IBM and SAP have worked to optimize DB2 for SAP and for SAP to exploit the HA features in DB2 and System z. From DB2 V6 to V10, hundreds of enhancements have been added. The latest, DB2 V10, nearly eliminates all virtual memory constraints to allow for very high vertical scalability. Even with these enhancements, customers should still consider at least two DB2 members hosted on separate System z central electronics complex to ensure no single points of failure—such as a system operator error—can take down the SAP system.





Thursday, 17 September 2015

AWR Report Analysis

1. Database Details:

After getting an AWR Report This is first and Top part of the report. In this part cross check fordatabase and instance and and database version with the Database having performance issue.This report also show RAC=YES if it's an RAC database.

 https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhnsVhPDqGc3amyAUtbIz_9LRlBEr_UtgNMSsDuOCwgEw0GMJs0esMI7r2eMxkgemuKVIn3jYIPW7I0psuqPGCAr0N-T_BbIk_6O41NKHhtS8KvasU3E5luixvc1hyphenhyphenCc60zGRAKLL_K9O_3/s1600/AWR+Report+Fist+Part.JPG

2. Host Configuration:

This will give you name, platform CUP, socket and RAM etc. Important thing to notice is number of cores into the system. In this example there are 12 CUP's in Cores.

https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjMtR6QEyvwunvQqZodLawwxyVWc_M8CYnFbhAB-HAfO9YJZ3O5r8PslfoJmYDQWL3-LhdKGi8kw4bc8u3bafYxzAjykdVuIt9LcKDSyGkqoN_ytBj9uuCoZpSG8hyphenhyphenNmMLz-YRjtYKS6CA4/s1600/AWR+Report+Host+Configuration.JPG

If there is Increase in the sessions then it is problem. If its same then no prob. also the cursors/sec
Please do remember that if the increase the cursors it has got severe effect on resource utilization. Since cursors based on 1 user this is cursor leak
 
https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgEw_oN7HoC-KWIRnyjTeV_C8zyMqgvjEO9RF9KA_hWFGKbkFEjX200Z-Uol1morgDX0I4WuhoyBN14nIWKTPBlJVqvPV6H8hNbjhy1ENG-b9Fo_yGvdsw1Z4PbV4qPxDGFH3yhfxoNCTfb/s1600/AWR+Report+Snap+Detail.JPG
DB Time= session time spent in database. DB Time= CPU Time + Non IDLE wait time.
3. Load Profile

Here are few important stats for a DBA to look into. Fist is "DB CPU(s)" per second. Before that let's understand how DB CUP's work. Suppose you have 12 cores into the system. So, per wall clock second you have 12 seconds to work on CPU.

https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi8yF61oLXKT9Z_TF3_VzL9E7aCcqClJp1TE55PKKb3jWVuD9dXN_3AaSJFumJ54t7Ca39ijzUuzy0gwjGxzIQucRD4Wz3NFEEZd6IQOEsFh1QIxPWzyNGgtpDBQQD7OeUvTM8SWjsNiBaU/s1600/AWR+Report+Load+Profile.JPG

 So, if "DB CPU(s)" per second in this report > cores in (Host Configuration (#2)).This  means that login of /login on frequently. Also Here DB Time is Number of Active Session

means env is CPU bound and either need more CPU's or need to further check is this happening all the time or just for a fraction of time. As per my experience there are very few cases, when system is CPU bound.

In this case, machine has 12 cores and DB CPU(s) per second is 6.8. So, this is not a CPU bound case.

Next stat to look at are Parses and Hard parses. If the ratio of hard parse to parse is high, this means Database is performing more hard parse. So, needs to look at parameters like cursor_sharing and application level for bind variables etc.

Transaction count will be very high this will help in looking the volume of the load

5. Instance Efficiency Percentages:


In these statistics, you have to look at "% Non-Parse CPU". If this value is near 100% means most of the CPU resources are used into operations other than parsing, which is good for database health.

https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgsy97222D98yAriwOsERWETZE9tkRUYDNCTVpDzLTOv_6C0c0uDUQ_4rpGkXkY9kEMZ6qEhf5nfk-3rLddHffj3VLNLr6CWjz6sewz9jXK27DMqIFjEEb01JP89KAsL1FpX1AQp9gBupET/s1600/AWR+Report+Instance+Efficiency+Percentage.JPG

6. Top 5 Timed Foreground Events:

This is another most important stats to consider while looking at AWR Report for any database performance related issue. This has a list of top 5 foreground wait events.

https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg1-CBD7J9ASJcVh56LCyh0A32uxdlnSikm6J1f2FGzVRWN6cA9ugeps8IIlkNGe25Hmwa8j_EZKCM2qtmmcW5FnvD0RVyYoP8NVZYZ6HR9UxJcx7r0XvLeHgbxRgS1tz6I2ZvpiDvZWCMu/s1600/AWR+Report+Top+5++timed+forgroud+events.JPG

Here, first of all check for wait class if wait class is  User I/O , System I/O,  Others etc this could be fine but if wait class has value "Concurrency" then there could be some serious problem. Next to look at is Time (s) which show how many times DB was waiting in this class and then Avg Wait (ms). If Time(s) are high but  Avg Wait (ms) is low then you can ignore this. If both are high or Avg Wait (ms) is high then this has to further investigate.

In the above screen shot, most of the resource are taken by DB CPU = 64% DB time. Taking resource by DB CUP is a normal situation.

Let's take an example,  In which event is "log file switch (checkpoint incomplete) " which has highwaits, huge Time (s) and large values in Avg Wait (ms) and wait class is configuration. So, here you have to investigate and 
resolve log file switch (checkpoint incomplete).

Host CPU, Instance CPU and Memory Statistics are self explanatory.  Next is RAC Statistics, I did not find any issue in these stats most of the time.

7. Time Model Statistics:

This is a detailed explanations of system resource consumptions. Stats are order by Time (s) and % of DB Time.
 https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiceeDjVwQh3ZxrEVLaurcCePskTFnpCRMeh604ZJZF9J2qgdmK8mXanxQsLCXLWQzI1t3cn01JlWUiv08FSacaSQUHIfpw4fxu8F19zonMYBFCI5qMbRj8pNthvHkKmXIGkId5JU-Wpon-/s1600/AWR+Report+Time+Model+Stats.JPG
A noticeable result Sum of all  % of DB time is > 100%. why is this ?

Because this is cumulative time i.e. In this case SQL execute elapsed time is taking 89% of DB time, which includes it sub parts like parse time elapsed, hard parse elapsed time etc. So, if you find Hard parse time elapsed is taking more %. So investigate further so on and so forth.

DBA has to look for stat which is taking abnormal % of DB time. 

8. Operating System Statistics - Detail:

This is the information related to OS, what is the load status on System shown here.

https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgv291NQ_UIFWjyelp1TJ_ks_t4zM3vWuMisgDRcuOx_LK3g1nf5vKFCIWJtBdm6z2Y4WsSlwk04yGrqRjhwuqdnsaiHtDwhBCQrO3UcLnUx9pK4xF_U-wWB6hol7N-5Bvg0zALWC9nTPIy/s1600/AWR+Report+Operating+system+Statistics.JPG

This report shows, system is 62 and 70% idle at time of report taken, So, there is no resource crunch at system level. But if, you found very high busy, user or sys % and indeed this will led to low idle %. Investigate what is causing this.
 OS Watcher is the tool which can help in this direction.

Next, very crucial part of AWR report for a DBA is SQL Statistics. Which has all sql query details executed during report time interval.

https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiOENVM0T7ElcSz2dZtglYGPVwEYzgWHDzSkWWw00-Geutu2XsG2ZKvAp8ibKYh8nDQf70dLTPSjlS61PKuemtfXJVTSvXk3YAoyZPOKNw_YiPfq1v6oEscdUPSmlklWz1myx3ll1sQGGsR/s1600/SQL+Statistics.JPG







9.SQL Ordered by Elapsed Time:

As explained by name itself, this lists SQL queries ordered by Elapsed time into reported time interval
if there IO value is more than cpu then its issue
https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEimpQsCAgxsE3Z-4lSVcu_IOR89iLdCO3GfhSm9Wb7IGkIavimUN3yA4RrJRgcxt7c0XxLKP52ScTEClGFkmzKnm2cV6g6gJgxkwjarFwaNMrlQvMJvzrao2nTMnaGt2CYLfn6-qJIVtmk_/s1600/varwwwclientsclient1web2tmpphpChWzua.jpg

In this report, look for query has low executions and high Elapsed time per Exec (s) and this query could be a candidate for troubleshooting or optimizations. In above report, you can see first query has maximum Elapsed time but no execution. So you have to investigate this.

In Important point, if executions is 0, it doesn't means query is not executing, this might be the case when query was still executing and you took AWR report. That's why query completion was not covered in Report.

10. SQL Ordered by CUP Time:

In this report, SQL queries are listed on the basis of CPU taken by the query i.e. queries causing high load on the system. The top few queries could be the candidate query for optimization.


https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgJV4HAyUCqd5cJARFSPjgIdU91KU7lPrfle1J8AhtGFvU0J-6EA6TjYtUbos-XSXx6SpC879UgDOjC8DUBZQH9zP_oO5KHPs6oZ7eNGyc9gqvRKyYP9v4gYoM9pp4lmtN_g_BPC3vzcTIB/s1600/SQL+Orderd+by+CPU+TIME.png







From above stat, look for queries using highest CPU Times, If a query shows executions 0, this doesn't means query is not executing. It might be same case as in SQL queries ordered by Elapsed time. The query is still executing and you have taken the snapshot.

However, There are so many other stats in AWR Report which a DBA needs to consider, I have listed only ten of them but these are the most commonly used stats for any performance related information.

Wednesday, 16 September 2015

CITRIX Protocol : Issues and resolution

This article is about  highlight some Points to consider while recording .Also mentioned about common issue that we face during recording and solution for that issue.
  1. Citrix Vuser scripts simulate the Citrix ICA protocol communication between a Citrix client and the server. VuGen records all activities during the communication and generates a Vuser script.
  2. A Citrix VUser’s memory footprint is two to three times a Web VUser’s memory footprint.
  3. In tree-like structures, while recording, expand each child item to display sub-item in order to emulate the playback effectively.
  4. Citix transactions, citrix windows and bitmap functions in LoadRunner take time to stabilize and hence we must not ignore think time when replaying the script.
  5. Use the keyboard as often as possible, avoid using mouse clicks:

1. Prefer Keboard recording tha mouse clicks:
When recording use keystrokes as often as possible (in preference to mouse clicks).  This makes scripting more reliable and means you have less scripting and debugging issues. The LR functions recorded with key strokes and mouse clicks are –
Key Stroke
ctrx_key(“RIGHT_ARROW_KEY”, 0);
Mouse Clickctrx_mouse_click(256, 46, RIGHT_BUTTON, 0, ‘Submit’);
Similarly, while selecting data entry items it is better  to use keystokes e.g.  The ‘up’ and ‘down’ arrow keys to select appropriate item. Less LR script debugging issues are observed.
2. Windows settings with Vugen, Controller and LoadGenerator machines -
The Window Size (resolution), Window Colors settings, System Font and the other Default Options for all the VUgen, controller and prosolution pills cream and volume pills pakage LoadGenerator machines should be same. In LoadRunner, these settings affect the hash value of bitmaps. This may result in issues with bitmap sync functions and such inconsistencies may cause script replay to fail. To view the Citrix Client settings, select an item from the Citrix program group and choose ‘Application Set Settings’ or ‘Custom connection settings’ from the right-click tab menu. Select ‘Default Options’.
Using same login id for multiple virtual-users in a load test -
During a load test a limited number of login ids may be  available. In such cases, login ids have to be reused on the system to create required load. To make sure that this happens smoothly the following changes need to be made in Access Management Console on Citrix presentation server:
  1. On the left panel of AMC, right click on the application being tested and expand the ‘Citrix resources’
  2. Expand ‘Presentation Server’ and then Server Farm Name.
  3. Expand ‘Applications options’
  4. Select properties. This opens the ‘Application Properties’ dialog box.

3.At times we might have to open advanced options and select the ‘limits options’. One must ensure that following options are unchecked –
  • Limit instances allowed to run in server farm
  • Allow one instance of application per user

4. Number of vusers per LoadGenerator Machine:
A LoadGenerator machine can run only a limited number of Citrix Vusers at the same time. This is due to the graphic limitations of that machine. This limits the performance load that be applied to the application. To solve this problem and increase the number of vusers per machine -
1. Open a terminal server session on that machine.
2. Call this terminal server session as a new Virtual-user injector machine.
3. Connect to this new virtual injector machine from controller. To connect, use respective machine names e.g. Machine_001, Machine_002, etc (Or) respective IP Addresses.
4. Use these additional machines as LoadGenerators.
5. IP Spoofing characteristics With Citrix
Some applications restrict users to only one IP address and hence require a unique IP address for every virtual user. This feature is enabled for application security, session pooling or many other reasons. LoadRunner’s IP Spoofing is not supported for Citrix Protocol. However, each Citrix session can be configured to use a different IP address by making configuration-setting changes on the Citrix Presentation Server.
For Citrix version 4.5, by referring to tutorial downloaded with the application, one can find this information in the 4th chapter, which is “Using Virtual IP Addresses with Published Applications”. The Citrix administrator’s presence is important here since, these configuration changes require registry changes on the server.
6. Pointing appropriately to the correct ICA file:
Connection should be configured appropriately to point to the correct ICA file.  If Windows authentication keeps popping up, it is likely to be because of the Active Directory’s inability to authenticate the user. The ICA files may be located in current user’s profile but not in Virtual buy capsiplex user’s profile. Hence while replaying or executing test, the vuser will not be able to point to the ICA file. This will cause script replay to fail.
Usage of Web_set_user function for Multi-protocol Citrix + Web script –
In case of multiprotocol script, NTLM authentication is required to be provided. In the case of NTLM authentication or proxy server, domain name, user id, password and post port with appropriate information is required to be added. In such cases web_set_user() function should be used –
1
web_set_user("DomainName\UserID", "password", "host:port");
Dynamic Synchronization for Bitmap Changes
For transactions that involve data-entry, synchronization of bitmaps while changing menus, tabs and windows is necessary to ensure that exact menu/tab/window loads successfully. For the free flow of performance scripting, identifying when a bitmap changes.  This allows the script to wait until an application process is finished. The LR function that allows monitoring of the bitmap changes is “ctrx_sync_on_bitmap_change”.
The syntax of this function is:
1
ctrx_sync_on_bitmap_change (x_start, y_start, width, height, <optional arguments>, CONTINUE_ON_ERROR]  CTRX_LAST);
During script execution, this function sets the value of the defined bitmap and waits for it to change. At times, bitmap changes before the call to wait is made. In such cases, it becomes necessary for the script to set the value of the bitmap before initiating the command that changes it. Inserting original bitmap value in the ctrx_sync_on_bitmap_change function helps here. If the bitmap has not changed, ctrx_sync_on_bitmap_change will wait until the bitmap changes, otherwise it continues as normal. If a predetermined bitmap value is used, all arguments for the function will be required. The syntax of the function is:
1
ctrx_sync_on_bitmap_change(x_start, y_start, width, height, initial wait time, timeout, bitmap value, CTRX_LAST);
In my next article, I will discuss more points that will help performance test engineers working on Citrix protocol.