Linux - System Performance Analysis - CPU Profiling

Subscribe Send me a message home page tags


#linux  #system performance  #cpu  #cpu profiling  #bash  #tool 

Table of Content

1. Introduction

In this pose, we will present a couple linux tools that are useful for CPU profiling. There are too many tools available in Linux and it takes time to master most of them. However, what really matters is the ability to find answers quickly when we debug a system performance related issue.

2. Related Readings

3. A Template for CPU Profiling

In this section we presents a template for basic CPU profling. In general, a CPU profiling task has two parts

There are different types of metrics

Before using any tools, the first thing we need to know is the number of CPUs available on the machine. Keep in mind that some system tool outputs are not normalized by the number of CPUs. For exmaple, we may see 200% CPU utilization on a 2 CPU machine.

1. Get the number of CPUs on a machine

Command: lscpu

This command prints out the details of CPUs on the machine. The line CPU(s) gives the number of CPUs.

Sample Output (truncated):

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   40 bits physical, 48 bits virtual
CPU(s):                          1
On-line CPU(s) list:             0
Thread(s) per core:              1
Core(s) per socket:              1
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           63
Model name:                      DO-Regular
Stepping:                        2
CPU MHz:                         1797.917
BogoMIPS:                        3595.83

2. Check the CPU utilization

Command: mpstat -P ALL [delay] [count]

This command samples utilization across all CPUs. The delay is the time interval between two consecutive samplings and the count is the number of sampling we want.

For example, to sample the CPU utilization every 1 seconds for 3 times, we use mpstat -P ALL 1 3.

Output:

Linux 5.4.0-72-generic (ubuntu-s-01) 	05/06/21 	_x86_64_	(1 CPU)

03:10:56     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
03:10:57     all   43.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   57.00
03:10:57       0   43.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   57.00

03:10:57     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
03:10:58     all    9.00    0.00    1.00    0.00    0.00    0.00    0.00    0.00    0.00   90.00
03:10:58       0    9.00    0.00    1.00    0.00    0.00    0.00    0.00    0.00    0.00   90.00

03:10:58     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
03:10:59     all    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
03:10:59       0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00

Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
Average:     all   17.39    0.00    0.33    0.00    0.00    0.00    0.00    0.00    0.00   82.27
Average:       0   17.39    0.00    0.33    0.00    0.00    0.00    0.00    0.00    0.00   82.27

Explanation of field (copied from man page)

field description
CPU "Processor number. The keyword all indicates that statistics are calculated as averages among all processors."
%usr "Show the percentage of CPU utilization that occurred while executing at the user level (application)."
%nice "Show the percentage of CPU utilization that occurred while executing at the user level with nice priority."
%sys "Show the percentage of CPU utilization that occurred while executing at the system level (kernel). Note that this does not include time spent servicing hardware and software interrupts."
%iowait "Show the percentage of time that the CPU or CPUs were idle during which the system had an outstanding disk I/O request."
%irq "Show the percentage of time spent by the CPU or CPUs to service hardware interrupts."
%soft "Show the percentage of time spent by the CPU or CPUs to service software interrupts."
%steal "Show the percentage of time spent in involuntary wait by the virtual CPU or CPUs while the hypervisor was servicing another virtual processor."
%guest "Show the percentage of time spent by the CPU or CPUs to run a virtual processor."
%gnice "Show the percentage of time spent by the CPU or CPUs to run a niced guest."
%idle "Show the percentage of time that the CPU or CPUs were idle and the system did not have an outstanding disk I/O request."

3. Check hot processes

Command: top

This command monitors the CPU utilization by each process and the results are sorted by the utilization. This command allows us to find the hot processes that are consuming lots of CPU resource.

Output:

top - 03:29:18 up 16 days,  1:23,  1 user,  load average: 0.28, 0.18, 0.17
Tasks: 114 total,   1 running, 112 sleeping,   0 stopped,   1 zombie
%Cpu(s): 17.3 us,  0.7 sy,  0.0 ni, 82.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :    981.3 total,     67.2 free,    390.2 used,    524.0 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.    427.9 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 293109 root      20   0  182380  27108   7496 S  16.3   2.7 273:31.36 python3.8
    329 root      19  -1  296892  87692  86416 S   0.3   8.7  11:05.28 systemd-journal
 341038 www-data  20   0 1070808 187212  84212 S   0.3  18.6   0:17.28 apache2
 342763 root      20   0   13780   8868   7436 S   0.3   0.9   0:00.11 sshd

4. Check cumulative CPU utilization of processes.

Command:

ps aux or ps -eo user,pid,ppid,%cpu,%mem,cputime,cmd

The above two commands are almost the same, only the format is slightly different.

Output:

USER         PID    PPID %CPU %MEM     TIME CMD
root           1       0  0.0  1.0 00:02:07 /sbin/init
root           2       0  0.0  0.0 00:00:00 [kthreadd]
root           3       2  0.0  0.0 00:00:00 [rcu_gp]
root           4       2  0.0  0.0 00:00:00 [rcu_par_gp]
root           6       2  0.0  0.0 00:00:00 [kworker/0:0H-kblockd]
root           9       2  0.0  0.0 00:00:00 [mm_percpu_wq]

Here the TIME column is the cumulative cpu time.

5. Check Threads of a Process

We can command ps -eL -q <pid> to check the threads spawn from the given process. If we are lucky, we may see the thread name of the thread. Other available commands are

(Note that -f option will replace thread name with full command line which is the same for all threads.)

There are many parameters that have impact on the output format. Check the man page for more accurate information.

Here is the sample Tread Display section:

THREAD DISPLAY
   H      Show threads as if they were processes.
   -L     Show threads, possibly with LWP and NLWP columns.
   m      Show threads after processes.
   -m     Show threads after processes.
   -T     Show threads, possibly with SPID column.

6. Check process and threads performance

Multi-threading programs are very command and when we perform CPU profiling, we need to check the performance of both processes and threads.

Command:

Note:

Output (truncated):

Linux 5.4.0-72-generic (ubuntu-s-01) 	05/06/21 	_x86_64_	(1 CPU)

14:07:40      UID      TGID       TID    %usr %system  %guest   %wait    %CPU   CPU  Command
14:07:40        0    293109         -    1.39    0.01    0.00    0.00    1.40     0  python3.8 Multithreading_Program.py
14:07:40        0         -    293109    0.00    0.00    0.00    0.00    0.00     0  |__python3.8
14:07:40        0         -    293111    0.69    0.00    0.00    0.47    0.70     0  |__python3.8
14:07:40        0         -    293112    0.69    0.00    0.00    0.47    0.70     0  |__python3.8
14:07:40        0    335692         -    0.00    0.00    0.00    0.00    0.00     0  loop12
14:07:40        0         -    335692    0.00    0.00    0.00    0.00    0.00     0  |__loop12
14:07:40       33    341036         -    0.00    0.00    0.00    0.00    0.00     0  /usr/sbin/apache2 -k start
14:07:40       33         -    341036    0.00    0.00    0.00    0.00    0.00     0  |__apache2

4. Linux perf tool

perf is a very powerful tool. The usage of perf usually has two parts

Process Profiling

Collect process performance stats 20 times/second in the 10 seconds period.

Command: perf record -F 20 -a -g -- sleep 10

Generate process performance reports

Command: perf script

We can also specify the fields: perf script -F pid,tid,cpu,time

Output (truncated):

perf 356806 [000] 1455993.703208:          1 cycles:
        ffffffffa0a73596 native_write_msr+0x6 ([kernel.kallsyms])
        ffffffffa0a0d1a0 intel_pmu_enable_all+0x10 ([kernel.kallsyms])
        ffffffffa0a07588 x86_pmu_enable+0x118 ([kernel.kallsyms])
        ffffffffa0bf94dd perf_pmu_enable.part.0+0xd ([kernel.kallsyms])
        ffffffffa0bf9b3e ctx_resched+0x8e ([kernel.kallsyms])
        ffffffffa0bf9cfe __perf_event_enable+0x18e ([kernel.kallsyms])

perf 356806 [000] 1455993.703262:          1 cycles:
        ffffffffa0a73596 native_write_msr+0x6 ([kernel.kallsyms])
        ffffffffa0a0d1a0 intel_pmu_enable_all+0x10 ([kernel.kallsyms])
        ffffffffa0a07588 x86_pmu_enable+0x118 ([kernel.kallsyms])
        ffffffffa0bf94dd perf_pmu_enable.part.0+0xd ([kernel.kallsyms])
        ffffffffa0bf9b3e ctx_resched+0x8e ([kernel.kallsyms])
        ffffffffa0bf9cfe __perf_event_enable+0x18e ([kernel.kallsyms])
        ffffffffa0bfd12c event_function+0x9c ([kernel.kallsyms])
        ffffffffa0bf7ae3 remote_function+0x43 ([kernel.kallsyms])
        ffffffffa0b3e68d generic_exec_single+0x9d ([kernel.kallsyms])
        ffffffffa0b3e791 smp_call_function_single+0xd1 ([kernel.kallsyms])

Get the time line of events of a process

Command: perf trace -p [pid]

root@ubuntu-s-01:~# perf trace -p 293109
         ? (         ): python3.8/293111  ... [continued]: select())                                           = 0 (Timeout)
     0.017 ( 2.191 ms): python3.8/293111 futex(uaddr: 0x93cb4c, op: WAIT_BITSET|PRIVATE_FLAG, utime: 0x7f65e52bb690, val3: MATCH_ANY) = 0
     2.225 ( 1.114 ms): python3.8/293111 futex(uaddr: 0x93cb50, op: WAIT|PRIVATE_FLAG, val: 2)                 = 0
     3.354 ( 0.010 ms): python3.8/293111 futex(uaddr: 0x93cb50, op: WAKE|PRIVATE_FLAG, val: 1)                 = 0
     3.796 ( 0.027 ms): python3.8/293111 stat(filename: 0xe426e6b0, statbuf: 0x7f65e52ba740)                   = -1 ENOENT (No such file or directory)
     3.886 ( 0.017 ms): python3.8/293111 stat(filename: 0xe426e6b0, statbuf: 0x7f65e52ba740)                   = -1 ENOENT (No such file or directory)
     6.711 ( 0.035 ms): python3.8/293111 stat(filename: 0xe5f83e30, statbuf: 0x7f65e52ba110)                   = 0
     6.782 ( 0.017 ms): python3.8/293111 stat(filename: 0xe5f83e30, statbuf: 0x7f65e52ba2a0)                   = 0
     6.822 ( 0.015 ms): python3.8/293111 stat(filename: 0xe5f83e30, statbuf: 0x7f65e52ba2a0)                   = 0
     7.295 ( 0.035 ms): python3.8/293111 openat(dfd: CWD, filename: 0xe42b1077, flags: RDONLY|CLOEXEC)         = 3
     7.342 ( 0.007 ms): python3.8/293111 lseek(fd: 3, whence: CUR)                           = 0
     7.357 ( 0.008 ms): python3.8/293111 fstat(fd: 3, statbuf: 0x7f65e52b8930)               = 0
     7.375 ( 0.013 ms): python3.8/293111 read(fd: 3, buf: 0x7f65e00d8a00, count: 4096)       = 635
     7.404 ( 0.006 ms): python3.8/293111 lseek(fd: 3, whence: CUR)                           = 635
     7.418 ( 0.007 ms): python3.8/293111 read(fd: 3, buf: 0x7f65e00d8a00, count: 4096)       = 0
     7.433 ( 0.007 ms): python3.8/293111 close(fd: 3)                                        = 0
     7.461 ( 0.032 ms): python3.8/293111 socket(family: INET, type: DGRAM|CLOEXEC|NONBLOCK)                    = 3
     7.511 ( 0.014 ms): python3.8/293111 setsockopt(fd: 3, optname: 11, optval: 0x7f65e52b7464, optlen: 4) = 0

CPU Scheduler Profiling

Record a scheduler profile for 10 seconds

Command: perf sched record -- sleep 10

Show per-process scheduler latency from a scheduler profile:

Command: perf sched latency

Output:

 -----------------------------------------------------------------------------------------------------------------
  Task                  |   Runtime ms  | Switches | Average delay ms | Maximum delay ms | Maximum delay at       |
 -----------------------------------------------------------------------------------------------------------------
  python3.8:(2)         |   1025.733 ms |      646 | avg:    0.982 ms | max:    8.049 ms | max at: 1459418.199920 s
  apache2:(20)          |     32.639 ms |      140 | avg:    0.341 ms | max:    4.169 ms | max at: 1459418.192300 s
  perf:(2)              |      4.186 ms |       12 | avg:    0.138 ms | max:    1.333 ms | max at: 1459410.706060 s
  redis-server:(2)      |     47.554 ms |      204 | avg:    0.084 ms | max:    0.381 ms | max at: 1459410.976269 s
  migration/0:12        |      0.000 ms |        3 | avg:    0.040 ms | max:    0.044 ms | max at: 1459415.065854 s
  ksoftirqd/0:10        |      0.222 ms |        9 | avg:    0.034 ms | max:    0.220 ms | max at: 1459412.581232 s
  khugepaged:24         |      0.025 ms |        1 | avg:    0.031 ms | max:    0.031 ms | max at: 1459416.913769 s
  gmain:(2)             |      0.700 ms |        5 | avg:    0.031 ms | max:    0.035 ms | max at: 1459419.868737 s
  multipathd:(3)        |      0.000 ms |       22 | avg:    0.025 ms | max:    0.071 ms | max at: 1459419.274724 s
  GUsbEventThread:356934 |      0.465 ms |        5 | avg:    0.023 ms | max:    0.027 ms | max at: 1459420.548544 s
  rcu_sched:11          |      6.503 ms |      376 | avg:    0.023 ms | max:    2.346 ms | max at: 1459418.176095 s
  sleep:357569          |      1.624 ms |        3 | avg:    0.021 ms | max:    0.029 ms | max at: 1459410.705767 s
  kworker/0:1-eve:348267 |      1.817 ms |       63 | avg:    0.020 ms | max:    0.065 ms | max at: 1459412.561843 s
  kworker/u2:1-ev:357563 |      1.967 ms |       85 | avg:    0.020 ms | max:    0.056 ms | max at: 1459418.801900 s
  jbd2/vda1-8:258       |      0.517 ms |        6 | avg:    0.018 ms | max:    0.034 ms | max at: 1459417.938340 s
  khungtaskd:19         |      0.078 ms |        1 | avg:    0.016 ms | max:    0.016 ms | max at: 1459412.817816 s
  kworker/0:1H-kb:155   |      0.273 ms |        7 | avg:    0.015 ms | max:    0.038 ms | max at: 1459415.889882 s
  kworker/u2:2-ev:357409 |      0.559 ms |       31 | avg:    0.011 ms | max:    0.045 ms | max at: 1459412.337855 s
  systemd-resolve:608   |      4.247 ms |       12 | avg:    0.011 ms | max:    0.017 ms | max at: 1459412.581018 s
 -----------------------------------------------------------------------------------------------------------------
  TOTAL:                |   1129.108 ms |     1631 |
 ---------------------------------------------------

List per-event scheduler latency from a scheduler profile

Command: perf sched timehist

Output

           time    cpu  task name                       wait time  sch delay   run time
                        [tid/pid]                          (msec)     (msec)     (msec)
--------------- ------  ------------------------------  ---------  ---------  ---------
 1459410.704727 [0000]  perf[357566]                        0.000      0.000      0.000
 1459410.705738 [0000]  sleep[357569]                       0.000      0.004      1.011
 1459410.705746 [0000]  rcu_sched[11]                       0.000      0.010      0.008
 1459410.705762 [0000]  ksoftirqd/0[10]                     0.000      0.013      0.015
 1459410.705767 [0000]  rcu_sched[11]                       0.015      0.011      0.005
 1459410.706060 [0000]  sleep[357569]                       0.028      0.000      0.293
 1459410.706128 [0000]  perf[357566]                        1.333      0.000      0.067
 1459410.709783 [0000]                                0.000      0.000      3.655

5. Other Recipes

How to check the CPU run-queue length?

Command: vmstat [delay] [count]

For example, if we execute command mvstat 1 3, it will sample the virtual statistics every 1 seconds for 3 times.

Output:

procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  0      0 124272  76404 407076    0    0   717    24    3    1  3  1 96  0  0
 0  0      0 124304  76404 407092    0    0     0     0  106  246  7  0 93  0  0
 0  0      0 124304  76404 407092    0    0     0     0   65  144  0  0 100  0  0

The first colum r in the output is the run-queue length. On Linux, the r colum is the total number of tasks waiting plus those running.

----- END -----