How to investigate performance issues in software?

Post date: Sep 29, 2018 8:18:40 PM

Performance is one of the most important quality attribute of software systems. Often issues related to performance are reported/detected late and in that situation finding the root cause is very important. Without a clear approach often project teams keep looking at random places and try to apply random solutions which don't work.

Few common types of performance issues reported

First of all understanding difference of following is very important which is often given name of performance issue in general:

Response time

When response of the application is so much time taking that it becomes difficult for a user to perform his tasks efficiently. Definition of good response time if not specified in requirement specification leads to argument. Expecting any response time is not reasonable.

Some of the companies for their internal web applications adapt 2 second for login page, 9 seconds for other normal pages, 2 seconds for data base queries to execute at maximum. This is not a world wide standard and it can't be because an application may be self contained and independent but another may be dependent on other systems beyond its boundary to complete its own responsibilities.

Important thing to consider is factors like network latency, response time from a dependent system or database before blaming an application for poor performance.

Reasons behind poor response time

- Network
- Hardware
- Software (which may be due to wrong architecture, design or code algorithm)

Scalability

This is the case when an application works fine in normal user load but it starts taking more time in response, stops responding or crashes when user load increases.

Sometimes teams want to solve this problem by increasing the resources like memory or adding more node (scaling horizontally) but this may not solve the problem forever if there is something wrong in the architecture, design or code.

Reasons behind scalability issues

- Hardware
- Bad architecture & design
- Code

Availability

System crashes at certain point of time or after some time. Some times is is just robustness issue which may expressed as performance issue by the users but some times it is not due to robustness but due to performance issues.

Reasons behind availability issues

- Memory leakage in code
- Robustness of application due to issues in code
- Hardware
- Network

How to investigate performance issues in software application?

Before moving to conduct investigation there are few questions which need to be clearly asked and answered because often the performance issue reported by an end user gets diluted by the time it reaches an architect or developer.

As part of the investigation too couple of questions needs to be answered.

Questions to answer

- How many users can work on the application with comfort?
- What is the maximum, minimum and average user load? What is the number of concurrent as well as total in a day or so.
- Which use-cases/scenarios have performance problem?
- What is performance problem? Examples are: Long initial loading, response on subsequent requests or server not able to cater to high volume of users?
- What are these numbers (takes 5 minutes to load or subsequent requests take 3 minute, server crashes if users reach to 25 concurrent?
- What are the server statistics right now? Total and used during memory during peak load. What is the processor utilization during peak load?
- What is the performance, application is designed for?

Most of the performance issues lie in database and network round trip. The network round trip may be from application layer to database or application communicating with other systems beyond its boundary. Hence any investigations should start with one of these areas (assuming capacity and latency related investigations are already conducted and it is sufficient)

Investigation for response time

Step 1: Database investigation

What to do?

- Database profiling
- Static review of the DB code (automated + architect)

How to do?

- Static Analysis of code (DB)
- Database Architecture review e.g. db queries/PLSQL, indexes, configuration.
- DB Profiler /Tracing on. Tool available with MSSQL or Oracle

Examples checks

- Time taken by queries to return the results in single user and concurrent user scenarios.
- Checking whether right type of indexes are used properly.
- Queries are not going in full table scan. Cartesian products are not the result of queries where we intentionally don't want it.
- Indexes are not made in a table column which is frequently updated. If such an index is created every now and then re-indexing will be done by DB and it will impact the performance of the DB.
- Select * is not happening without need. (It must be avoided as much as possible)

Example recommendations

- Use indexes on columns which are used in queries
- Use joins instead of simply merging two queries which are making Cartesian product.
- Avoid too many indexes which are not used
- Avoid select *

Step 2: Web/App Server side investigation

What to do?

- Static review of the code for architecture design and code
- Performance testing per scenario
- Memory profiling
- Statistics of the system (processor utilization, memory utilizations etc)

How to do?

- Static Analysis Tools
- Profiler
- Architect review

Example checks

- Remote calls in loops
- Common data is fetched each time from the database server or kept in user sessions which is bad.
- Lots of data is kept in session scope
- Pagination technique is used properly.
- Chatty remote interfaces accessed from client side code (e.g. Angular layer accessing remote Rest services which are just wrapper of fine grained objects making couple of round trips for an operation.
- Fine grained access of database from server side code is minimum

Example recommendations

- For coarse grained access from client, server side can provide Synthetic coarse grained objects wrapping fine grained objects
- Avoiding remote method invocations as much as possible by software architecture changes.
- Make use of batch queries, prepared statements etc
- Use caching for common data across application
- Avoid heavy session state
- Use pagination techniques (but don't load all the data to paginate in one go and don't keep it in session)
- Co-located app and db service and having multiple nodes of such servers if we don't have any other choice left. In co-located scenario horizontal scalability is not possible (multiple nodes of collocated servers).

Step 3 : Client side investigation

What to do?

- Static review of the code
- Network communication

How to do?

- Static Analysis Tools
- Browser console/tools
- Architect review

Example checks

- Check if client side code is making multiple calls to remote services for a single operation.
- In Angular: Watchers, $watch(), ng-repeat are expensive

Example recommendations

- Java webstart in place of applets to avoid every time loading of the applets if this is the case (again even converting applets to standalone client will be a tricky task). These are ages old scenarios and hard to find in today's world
- To avoid network round trip, fine grained access can be converted in to coarse grained access of the server side (if there is any such opportunity, in most of the cases this is already taken care of)
- Thin client with RIA frameworks.
- If you are using UI/UX framework you must need to refer to best practices related to the framework for example prefer $watchCollection() instead of $watch(),

Investigation for scalability

What to do?

- Load Testing
- After load test, architect review and even static analysis may be required to find out the root cause of scalability limitations found by load testing.

How to do?

- Load Runner or similar
- Jmeter (open source)
- OpenSTA
- Manual testing scenario for small user load (when purpose is initial investigation only)
- Architect review

Example recommendations

- Architecture changes
- Adding more nodes (clustering and load balancing etc.)
- Adding more resources and going to 64 bit if 32 bit architecture is the limitation due to which we cannot add more resources.

Fact collection

Collection of facts is important for an architect to find the solution of performance and provide recommendations. Here are some sample formats of fact collection templates:

Define performance objectives

Please describe in measurable units, goals or objectives of performance:

Normally this should be part of non functional requirements of the specification. These are the needs of application owner/users.

Expected Response Time

Throughput

Resource utilization

Workload

Key scenarios

Gather inputs from various teams

Application development/management team

IT operations team

Inputs related to user load

Inputs related to data volume

Inputs related to various systems

Business users

Information about poor performing scenarios

Testing team

Results of performance testing of various scenarios/screens/transaction

Database management team (DBAs)

Interesting resources

1. https://ieeexplore.ieee.org/document/5752531
2. https://cdn.oreillystatic.com/en/assets/1/event/134/Forensic%20tools%20for%20in-depth%20performance%20investigation%20Presentation.pdf
3. https://www.datadoghq.com/blog/monitoring-101-investigation/
4. https://support.solarwinds.com/Success_Center/Server_Application_Monitor_(SAM)/SAM_Documentation/Server_Application_Monitor_Getting_Started_Guide/040_Monitor/Investigate_application_performance_with_Performance_Analysis
5. https://techbeacon.com/perfguild-5-insights-your-performance-testing-team
6. https://www.comparitech.com/net-admin/application-performance-management/