Diagnosing Performance Issues: Using strace with PHP

Problem Scenario

Home page is exhibiting slow load times (40 to 50 seconds).
Server logs log/stderr and log/system do not indicate unusual behavior.

Platform:
LiteSpeed with cPanel EA3 on CloudLinux, Drupal website.
Drupal site running PHP 5.6 without a full page cache solution.
PHP Opcode cache enabled in php.ini.
CloudLinux PHP selector installed but not enabled.
The user account is outside of the cage.
No mod_bw limitation found in the configuration.

Attempted Resolution Actions:
Increasing PHP_SuEXEC_max_connection for the domain did not help.
Moving account user into the cage and changing PHP to 7.0 using php selector did not help – loading just as slow as PHP 5.6, with no PHP out of memory errors or core files generated.
Increasing CloudLinux LVEs for the account did not help. Output from top command showed no significant traffic, server load was well below 1.
Most steps in How to improve my site’s performance were tried but none worked.

Diagnosis

The strace command provides fine-grained information on an executing process. Using strace allows us to examine the behavior of PHP when handling requests. We’ll need the process id of the relevant PHP process.

Since the platform uses LiteSpeed, we’ll be looking for running lsphp processes. There should be one parent lsphp process and several children processes. The lsphp parent id can be found using

ps -ef | grep lsphp | grep user1

Now we can strace the process, redirect output to a file, and reproduce the problem (in this case reloading the homepage for the site) while strace is running:

strace -tt -T -p &> output1

The arguments to strace include printing microsecond data (-tt) and showing the time spent in system calls (-T). Stop the strace process (hit Ctrl-C or your interrupt key) when done.

The strace output is typically massive and includes large amounts of data that is not directly relevant to the problem we are trying to diagnose. We need to look for the symptom that is impacting performance, in our case a long delay between subsequent trace lines for the same process.

To make this information easier to narrow down, we can apply some tools – first we can remove non pid trace lines and sort the output by process id and then timestamp. We create a simple shell script:

$ cat sort_strace
#!/bin/sh
grep '^\[pid' | sort -k2n -k3n

and run it:

$ ./sort_strace < straceparent.log > strace.sorted

(Note this is just a simple tool that does not account for times wrapping around midnight – strace output does not provide date information. We may get false positives when examining the sorted output if the time trace wraps around, and we’ll have to account for these cases).

With the data sorted, we are now looking for traces that show long delays in process execution.
There are many possible ways to do this, we chose to write a simple perl script to locate such lines in the sorted trace file:

$ ./timefilter 20 < strace.sorted > strace.problems

Our timefilter script identifies lines that are spaced apart by more than its argument (20 seconds in this case). The output from the script shows us where we might want to investigate further:

$ less strace.problems
TIME DIFF 29.982 in line 10486:
[pid 867836] 23:37:24.190415 poll([{fd=7, events=POLLIN|POLLPRI}], 1, 29953 <...unfinished>
[pid 867836] 23:37:54.172538 <...poll resumed> ) = 0 (Timeout) <29.982116>
TIME DIFF 29.966 in line 19888:
[pid 867837] 23:37:24.180188 poll([{fd=7, events=POLLIN|POLLPRI}], 1, 29937 <...unfinished>
[pid 867837] 23:37:54.146488 <...poll resumed> ) = 0 (Timeout) <29.966291>
.
.
TIME DIFF 29.914 in line 284062:
[pid 868142] 23:37:55.427591 poll([{fd=7, events=POLLIN|POLLPRI}], 1, 29884 <...unfinished>
[pid 868142] 23:38:25.341395 <...poll resumed> ) = 0 (Timeout) <29.913795>
.
.
.

We note that the delays are very close to 30 seconds, so most likely we are dealing with some kind of timeout. We also see that these seem to center around fd=7.

We select one of the problem areas for further examination – let’s say process id 868142, at line 284062 in our sorted file:

$ vi +284062 strace.sorted

Searching back to determine what is being done with fd 7 in this PHP process, at line 283541 we find the initial usage (address masked for privacy):

[pid 868142] 23:37:55.321059 connect(7, {sa_family=AF_INET, sin_port=htons(443), sin_addr=inet_addr("___.__.___.___")}, 16 <...unfinished>

So we have a tcp connection to port 443 on the masked IP address, which just times out after 30 seconds, slowing down the page load.

To solve the slow loading issue, the web site PHP code has to be modified to avoid such blocking connections that can significantly slow down site performance.

Summary

The strace utility provides invaluable information in diagnosing and debugging performance issues. Simple tools can help navigate through the output from strace to locate problem areas. Focused examination of the trouble spots can help identify the culprit – usually improper coding practices.

Unabashed Commercial Plug

LiteSpeed Cache Solutions can avoid PHP process execution and accelerate your Drupal site to make it breathtakingly faster. Visit LSCache + Drupal 7 Configuration for more information.

One Response to “Diagnosing Performance Issues: Using strace with PHP”

  1. Very good information here. We have been doing it this way for years for all sorts of applications. This will work for more than just PHP. It’s also useful for debugging applications that are not working properly.

Leave a Reply