最近在做redis的相关工作,无意中发现Brendan Gregg大神写的度量redis处理请求延时的DTrace脚本:redisslat.d(https://github.com/brendangregg/dtrace-cloud-tools/blob/master/redis/redislat.d),觉得有点意思,就在这里和大家分享一下(省略了一些无关痛痒的注释):
#!/usr/sbin/dtrace -s /* * redislat.d - summarize redis request latency * * This measures redis request latency as the time from a read() return to a * write() entry on the same file descriptor. The intent is to measure the * time between readQueryFromClient() to sendReplyToClient(), but by using * the syscall provider to minimize overhead. If this strategy turns out to * be invalid, switch this to the pid provider. * * ...... */ syscall::read:entry /execname == "redis-server"/ { self->fd = arg0; self->ok = 1; } syscall::read:return /self->ok/ { self->start[self->fd] = timestamp; self->fd = 0; self->ok = 0; } syscall::write:entry /(this->start = self->start[arg0]) > 0/ { @d["read->write dist (ns), for pid:", pid] = quantize(timestamp - this->start); @s[pid, tid] = sum(timestamp - this->start); self->start[arg0] = 0; } syscall::close:entry /self->start[arg0]/ { self->start[arg0] = 0; } profile:::tick-1s { normalize(@s, 1000000); printf("\nread->write sum by pid, tid (ms):\n"); printa(@s); trunc(@s); } dtrace:::END { printa(@d); trunc(@d); trunc(@s); }
下面简单分析一下这个脚本:
(1)这个脚本计算的是针对同一个文件描述符,redis-server程序从退出read()到进入write()的时间,可以近似地认为这是redis-server响应一次request请求的时间:从read()退出,认为收到请求,然后处理(访问内存数据库),直到进入write()(发response);
(2)在进入write()系统调用后,脚本会用aggregation记录响应时间的分布(@d)和总和(@s);
(3)脚本每秒(profile:::tick-1s)会打印一次响应时间的总和并清空(调用normalize函数,显示的是毫秒),最后脚本退出时会打印响应时间的分布。
以下的输出是我在Solaris上执行脚本的情况:
root@Soaris# ./redislat.d dtrace: script './redislat.d' matched 6 probes CPU ID FUNCTION:NAME 0 58107 :tick-1s read->write sum by pid, tid (ms): 1094 1 1 0 58107 :tick-1s read->write sum by pid, tid (ms): 1094 1 0 0 58107 :tick-1s read->write sum by pid, tid (ms): 1094 1 0 0 58107 :tick-1s read->write sum by pid, tid (ms): 1094 1 1 0 2 :END read->write dist (ns), for pid: 1094 value ------------- Distribution ------------- count 8192 | 0 16384 |@@@@@@@@@@@@@@@@@@@@@ 19 32768 |@@@@@ 5 65536 |@@ 2 131072 | 0 262144 |@@@@@@@@@@@@ 11 524288 | 0
可以看到大部分的响应都介于0.01毫秒和0.03毫秒之间完成。
另外,这个脚本可以推而广之,只要做很少的修改,完全可以作为度量其它任何client-server程序performance的工具。