1 Star2 Stars3 Stars4 Stars5 Stars (还没有评分)
Loading...

使用DTrace度量redis处理请求的延时

最近在做redis的相关工作,无意中发现Brendan Gregg大神写的度量redis处理请求延时的DTrace脚本:redisslat.d(https://github.com/brendangregg/dtrace-cloud-tools/blob/master/redis/redislat.d),觉得有点意思,就在这里和大家分享一下(省略了一些无关痛痒的注释):

#!/usr/sbin/dtrace -s
/*
 * redislat.d - summarize redis request latency
 *
 * This measures redis request latency as the time from a read() return to a
 * write() entry on the same file descriptor. The intent is to measure the
 * time between readQueryFromClient() to sendReplyToClient(), but by using
 * the syscall provider to minimize overhead. If this strategy turns out to
 * be invalid, switch this to the pid provider.
 *
 * ......
 */

syscall::read:entry
/execname == "redis-server"/
{
    self->fd = arg0;
    self->ok = 1;
}

syscall::read:return
/self->ok/
{
    self->start[self->fd] = timestamp;
    self->fd = 0;
    self->ok = 0;
}

syscall::write:entry
/(this->start = self->start[arg0]) > 0/
{
    @d["read->write dist (ns), for pid:", pid] =
        quantize(timestamp - this->start);
    @s[pid, tid] = sum(timestamp - this->start);
    self->start[arg0] = 0;
}

syscall::close:entry
/self->start[arg0]/
{
    self->start[arg0] = 0;
}

profile:::tick-1s
{
    normalize(@s, 1000000);
    printf("\nread->write sum by pid, tid (ms):\n");
    printa(@s);
    trunc(@s);
}

dtrace:::END
{
    printa(@d);
    trunc(@d);
    trunc(@s);
}

下面简单分析一下这个脚本:
(1)这个脚本计算的是针对同一个文件描述符,redis-server程序从退出read()到进入write()的时间,可以近似地认为这是redis-server响应一次request请求的时间:从read()退出,认为收到请求,然后处理(访问内存数据库),直到进入write()(发response);
(2)在进入write()系统调用后,脚本会用aggregation记录响应时间的分布(@d)和总和(@s);
(3)脚本每秒(profile:::tick-1s)会打印一次响应时间的总和并清空(调用normalize函数,显示的是毫秒),最后脚本退出时会打印响应时间的分布。

以下的输出是我在Solaris上执行脚本的情况:

root@Soaris# ./redislat.d
dtrace: script './redislat.d' matched 6 probes
CPU ID FUNCTION:NAME
  0 58107 :tick-1s
read->write sum by pid, tid (ms):
     1094 1 1
  0 58107 :tick-1s
read->write sum by pid, tid (ms):
     1094 1 0
  0 58107 :tick-1s
read->write sum by pid, tid (ms):
     1094 1 0
  0 58107 :tick-1s
read->write sum by pid, tid (ms):
     1094 1 1
  0 2 :END
read->write dist (ns), for pid: 1094

           value ------------- Distribution ------------- count
            8192 | 0
           16384 |@@@@@@@@@@@@@@@@@@@@@ 19
           32768 |@@@@@ 5
           65536 |@@ 2
          131072 | 0
          262144 |@@@@@@@@@@@@ 11
          524288 | 0

可以看到大部分的响应都介于0.01毫秒和0.03毫秒之间完成。

另外,这个脚本可以推而广之,只要做很少的修改,完全可以作为度量其它任何client-server程序performance的工具。

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.