Linux - Performance Tuning 1

Linux - Performance Tuning 1

Useful top and pstree commands to show process/threads

#show process tree with pid -p owned by user -u
$pstree -aG -p -u username

#same as above but use ASCII chars to draw the tree. -A
$pstree -aA -p -u username


#top in batch mode -b , show 3 iternations -n and delay 2.5sec -d for user -u
$top -b -n 3 -d 2.5 -u username

#same as above but display threads instead of just process -H
$top -H -b -n 3 -d 2.5 -u sjing

$top -p pid

#Linux : check if Processor (hardware) is 64 bit lm (long mode)
$grep lm /proc/cpuinfo

(inside top, type "A" will enter multi-windows mode)

vmstat -d -p partition

vmstat -m


Probably sysctl is the best tool for tuning linux systems. It can configure a lot of system parameters while the linux kernel
is running. It handles this, by reading and writing to kernel variables through procfs files.

sysctl - configure kernel parameters at runtime. The parameters available are those listed under /proc/sys/.

sysctl -a

sysctl -w variable=value. For example, sysctl -w net.ipv6.conf.all.forwarding=1

sysctl -w kernel.shmmax=63554432


sysctl -p

root@ubuntu:~# sysctl -a
kernel.sched_min_granularity_ns = 8000000
kernel.sched_latency_ns = 40000000
kernel.sched_wakeup_granularity_ns = 10000000
kernel.sched_shares_ratelimit = 500000
kernel.sched_shares_thresh = 4
kernel.sched_child_runs_first = 1
kernel.sched_features = 113916
kernel.sched_migration_cost = 500000
kernel.sched_nr_migrate = 32
kernel.timer_migration = 1
kernel.sched_rt_period_us = 1000000
kernel.sched_rt_runtime_us = 950000
kernel.sched_compat_yield = 0
kernel.panic = 0
...

sysctl displays kernel setting in the following categories:

kernel
vm
fs
debug
dev
net
crypto




A very good guide to "top" utility is at here.


[root@localhost ~]# top -b -n 1
top - 12:50:59 up 1:11, 5 users, load average: 0.22, 0.24, 0.17
Tasks: 129 total, 2 running, 126 sleeping, 0 stopped, 1 zombie
Cpu(s): 1.9% us, 2.8% sy, 0.1% ni, 89.0% id, 3.7% wa, 0.1% hi, 2.4% si
Mem: 514516k total, 321012k used, 193504k free, 36688k buffers
Swap: 1048568k total, 0k used, 1048568k free, 173132k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
7438 root 16 0 2552 968 740 S 3.9 0.2 0:04.36 top
8662 root 15 0 3784 856 648 R 3.9 0.2 0:00.03 top
1 root 16 0 2372 548 468 S 0.0 0.1 0:04.49 init
2 root RT 0 0 0 0 S 0.0 0.0 0:00.71 migration/0
3 root 34 19 0 0 0 S 0.0 0.0 0:00.00 ksoftirqd/0
4 root RT 0 0 0 0 S 0.0 0.0 0:00.62 migration/1
5 root 34 19 0 0 0 S 0.0 0.0 0:00.02 ksoftirqd/1
6 root 5 -10 0 0 0 S 0.0 0.0 0:00.78 events/0
7 root 5 -10 0 0 0 S 0.0 0.0 0:00.52 events/1
8 root 5 -10 0 0 0 S 0.0 0.0 0:00.11 khelper
9 root 15 -10 0 0 0 S 0.0 0.0 0:00.00 kacpid
90 root 5 -10 0 0 0 S 0.0 0.0 0:00.55 kblockd/0
91 root 5 -10 0 0 0 S 0.0 0.0 0:00.51 kblockd/1
92 root 15 0 0 0 0 S 0.0 0.0 0:00.00 khubd
101 root 20 0 0 0 0 S 0.0 0.0 0:00.00 pdflush
102 root 15 0 0 0 0 S 0.0 0.0 0:01.38 pdflush
104 root 7 -10 0 0 0 S 0.0 0.0 0:00.00 aio/0
105 root 5 -10 0 0 0 S 0.0 0.0 0:00.00 aio/1
103 root 25 0 0 0 0 S 0.0 0.0 0:00.00 kswapd0
179 root 25 0 0 0 0 S 0.0 0.0 0:00.00 kseriod
250 root 19 0 0 0 0 S 0.0 0.0 0:00.00 scsi_eh_0
267 root 6 -10 0 0 0 S 0.0 0.0 0:00.00 kmirrord/0


Field Description
PID : Process ID
USER : Effective User ID
PR : Dynamic priority
NI : Nice value, also known as base priority
VIRT : Virtual Size of the task. This includes the size of process's executable binary, the data area and all the loaded shared libraries.
RES : The size of RAM currently consumed by the task. Swapped out portion of the task is not included.
SHR : Some memory areas could be shared between two or more task, this field reflects that shared areas. The example of shared area are shared library and SysV shared memory.
S : Task status
%CPU : The percentage of CPU time dedicated to run the task since the last top's screen update.
%MEM : The percentage of RAM currently consumed by the task.
TIME+ : The total CPU time the task has been used since it started. "+" sign means it is displayed with hundreth of a second granularity. By default, TIME/TIME+ doesn't account the CPU time used by the task's dead children.
COMMAND : Showing program names.


$ top -p 4360,4358


top + "A" to see multiple windows mode

1:Def - 13:16:02 up 1:37, 4 users, load average: 0.06, 0.04, 0.03
Tasks: 122 total, 1 running, 120 sleeping, 0 stopped, 1 zombie
Cpu(s): 0.7% us, 1.4% sy, 0.0% ni, 96.5% id, 0.0% wa, 0.0% hi, 1.4% si
Mem: 514516k total, 320228k used, 194288k free, 37336k buffers
Swap: 1048568k total, 0k used, 1048568k free, 174564k cached

1 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
9234 root 16 0 3020 964 740 R 2.6 0.2 0:00.04 top
3408 root 16 0 9768 7188 1628 S 1.3 1.4 0:46.53 hald
8601 shan 15 0 7344 2228 1812 S 1.3 0.4 0:00.51 sshd
1 root 16 0 2372 548 468 S 0.0 0.1 0:04.50 init
2 root RT 0 0 0 0 S 0.0 0.0 0:00.76 migration/0
3 root 34 19 0 0 0 S 0.0 0.0 0:00.00 ksoftirqd/0
4 root RT 0 0 0 0 S 0.0 0.0 0:00.66 migration/1
5 root 34 19 0 0 0 S 0.0 0.0 0:00.02 ksoftirqd/1
6 root 5 -10 0 0 0 S 0.0 0.0 0:00.80 events/0
7 root 5 -10 0 0 0 S 0.0 0.0 0:00.53 events/1
8 root 5 -10 0 0 0 S 0.0 0.0 0:00.11 khelper
9 root 15 -10 0 0 0 S 0.0 0.0 0:00.00 kacpid
2 PID PPID TIME+ %CPU %MEM PR NI S VIRT SWAP RES UID COMMAND
9234 8630 0:00.04 2.6 0.2 16 0 R 3020 2056 964 0 top
8630 8629 0:00.17 0.0 0.3 16 0 S 5820 4396 1424 0 bash
8629 8628 0:00.01 0.0 0.2 19 0 S 5036 3900 1136 0 su
8628 8602 0:00.02 0.0 0.0 17 0 S 2680 2428 252 0 sesh
8602 8601 0:00.05 0.0 0.3 16 0 S 5836 4460 1376 500 bash
8601 8599 0:00.51 1.3 0.4 15 0 S 7344 5116 2228 500 sshd
8599 3241 0:00.07 0.0 0.4 17 0 S 7164 5028 2136 0 sshd
8537 1 0:00.00 0.0 0.2 16 0 S 3328 2336 992 0 dhclient
7232 1 0:00.01 0.0 0.2 16 0 S 3004 2004 1000 0 dhclient
6932 6931 0:00.21 0.0 0.3 16 0 S 4828 3400 1428 0 bash
6931 6930 0:00.01 0.0 0.2 19 0 S 5228 4092 1136 0 su
6930 6893 0:00.02 0.0 0.0 16 0 S 2392 2136 256 0 sesh
3 PID %MEM VIRT SWAP RES CODE DATA SHR nFLT nDRT S PR NI %CPU COMMAND
3937 4.2 155m 133m 21m 1520 16m 5232 0 0 S 16 0 0.0 X
4838 3.5 40640 21m 17m 588 19m 9920 23 0 S 15 0 0.0 nautilus
6891 2.4 36388 23m 12m 252 15m 8228 0 0 S 15 0 0.0 gnome-terminal
4834 2.2 22848 11m 11m 436 4332 8220 8 0 S 15 0 0.0 gnome-panel
5294 2.0 21696 11m 9.8m 28 2872 7544 9 0 S 15 0 0.0 mixer_applet2
5292 1.9 21208 11m 9576 64 3036 7472 3 0 S 16 0 0.0 wnck-applet
4555 1.8 20696 10m 9472 120 2776 6980 11 0 S 15 0 0.0 gnome-session
5296 1.5 20496 12m 7932 92 2132 6680 6 0 S 16 0 0.0 clock-applet
4620 1.5 12280 4700 7580 44 7272 1668 1 0 S 16 0 0.0 gconfd-2
4868 1.4 40444 32m 7336 104 22m 6268 4 0 S 16 0 0.0 eggcups
3408 1.4 9768 2580 7188 196 6712 1628 2 0 S 16 0 1.3 hald
4696 1.4 14616 7436 7180 428 2304 6024 6 0 S 15 0 0.0 metacity
4627 1.4 18552 11m 7008 140 1272 5924 7 0 S 15 0 0.0 gnome-settings-
4 PID PPID UID USER RUSER TTY TIME+ %CPU %MEM S COMMAND
3338 1 43 xfs xfs ? 0:00.17 0.0 0.3 S xfs
3290 1 51 smmsp smmsp ? 0:00.00 0.0 0.5 S sendmail
4838 1 500 shan shan ? 0:18.78 0.0 3.5 S nautilus
6891 1 500 shan shan ? 0:06.95 0.0 2.4 S gnome-terminal
4834 1 500 shan shan ? 0:11.20 0.0 2.2 S gnome-panel
5294 1 500 shan shan ? 0:02.36 0.0 2.0 S mixer_applet2
5292 1 500 shan shan ? 0:10.48 0.0 1.9 S wnck-applet
4555 3893 500 shan shan ? 0:03.31 0.0 1.8 S gnome-session
5296 1 500 shan shan ? 0:00.87 0.0 1.5 S clock-applet
4620 1 500 shan shan ? 0:10.79 0.0 1.5 S gconfd-2
4868 1 500 shan shan ? 0:01.52 0.0 1.4 S eggcups
4696 1 500 shan shan ? 0:04.49 0.0 1.4 S metacity
4627 1 500 shan shan ? 0:00.48 0.0 1.4 S gnome-settings-


vmstat -d reports disk

Procs
r: The number of processes waiting for run time.
b: The number of processes in uninterruptible sleep.

Memory
swpd: the amount of virtual memory used.
free: the amount of idle memory.
buff: the amount of memory used as buffers.
cache: the amount of memory used as cache.
inact: the amount of inactive memory. (-a option)
active: the amount of active memory. (-a option)

Swap
si: Amount of memory swapped in from disk (/s).
so: Amount of memory swapped to disk (/s).

IO
bi: Blocks received from a block device (blocks/s).
bo: Blocks sent to a block device (blocks/s).

System
in: The number of interrupts per second, including the clock.
cs: The number of context switches per second.

CPU
These are percentages of total CPU time.
us: Time spent running non-kernel code. (user time, including nice time)
sy: Time spent running kernel code. (system time)
id: Time spent idle. Prior to Linux 2.5.41, this includes IO-wait time.
wa: Time spent waiting for IO. Prior to Linux 2.5.41, included in idle.

st: Time stolen from a virtual machine. Prior to Linux 2.6.11, unknown.



Page Fault:


Technically, page fault happens when the task access a non existant page in its address space.
A page fault is said as "major" if kernel needs to access the disk to make the page available.
On the contrary, soft minor page fault means the kernel only need to allocate pages in RAM without
reading anything from disk.


For illustration, consider the size of program ABC is 8 kB and assume the page size is 4 kB.
When the program is fully loaded to RAM, there will be 2 times major page fault (2 * 4 kB).

The program itself allocates another 8 kB for temporary data storage in RAM. Thus, there will be 2 minor page fault.


A high number of nFLT could mean:

1. The task is aggressively load some portions of its executable or library from the disk.

2. The task is accessing a page that is swapped ou

3. It is normal if you see a high number of major page fault when a program is run for first time.
On the next invocations, buffer is utilized so likely you will see "0" or low number of nFLT.
But, if a program is continously triggerring major page fault, big chance your program needs larger RAM size than currently installed.



Dirty Pages:

The number of dirty pages since they are written back to the disk.

Maybe you wonder, what is dirty page? First, a little bac ground. As you know, Linux employ caching mechanism, so everything that is read
from disk is also cached in RAM. The advantage of this action is, subsequent read to the same disk block can be served from RAM thus reading
completes faster.

But it also costs something. If the buffer's content is modified, it needs to be synchronized. Thus, sooner or la this modified buffer (dirty page)
must be written back. The failure on the synchronization might cause data inconsistency on related disk.

On mostly idle to fairly loaded system, nDRT is usually below 10 (this is just a raw prediction)or mostly zero. If it is constantly bigger than that:

1. The task is aggresively write something to file(s). It is so often that disk I/O can't keep up with it

2. The disk suffers I/O congestion, thus even the task only modifies small portion of file(s), it must wait a bit longer to be synchronized.
Congestion happens when many processes access the disk at a time but cache hit is low.

These days, (1) unlikely happens because I/O speed is getting faster and less CPU demanding (thanks to DMA). So (2) has bigger probability.

Note: On 2.6.x, this field is always zero without unknown reason.


The configuration file consists of records that look identical to the output of sysctl -a.


Here's an example configuration file:

# Controls IP packet forwarding
net.ipv4.ip_forward = 0

# Controls source route verification
net.ipv4.conf.default.rp_filter = 1

# Controls the System Request debugging functionality of the kernel
kernel.sysrq = 0

# Controls whether core dumps will append the PID to the core filename.
# Useful for debugging multi-threaded applications.
kernel.core_uses_pid = 1


Network Performance Tuning:

# Decrease the time default value for tcp_fin_timeout connection.
net.ipv4.tcp_fin_timeout = 30

# Decrease the time default value for tcp_keepalive_time connection
net.ipv4.tcp_keepalive_time = 1800

# Turn off tcp_window_scaling
net.ipv4.tcp_window_scaling = 0

# Turn off the tcp_sack
net.ipv4.tcp_sack = 0

# Turn off tcp_timestamps
net.ipv4.tcp_timestamps = 0


NFS Performance Tuning:

# Increase transport socket buffers to improve performance of nfs (and
networking in general)

# 'rmem' is 'read memory', 'wmem' is 'write memory'.

net.core.rmem_max = 262143

net.core.rmem_default = 262143

net.core.wmem_max = 262143

net.core.wmem_default = 262143

net.ipv4.tcp_rmem = 4096 87380 8388608

net.ipv4.tcp_wmem = 4096 87380 8388608



# These are for both security and performance

net.ipv4.icmp_echo_ignore_broadcasts = 1
net.ipv4.icmp_ignore_bogus_error_responses = 1