Cephfs测试中IO hang问题分析

问题

为了测试cephfs的可用性，我们对cephfs进行了几组stripe的测试，在跑自动化测试时，发现fio总是会跑一部分后hang住，而此时ceph -s输出显示read值太大，但实际环境中查看并没这么大的流量。

ceph输出

# ceph -w
    cluster 16f59233-16da-4864-8c5a-1fec71d119ad
     health HEALTH_OK
     monmap e1: 3 mons at {cephcluster-server1=10.10.1.6:6789/0,cephcluster-server2=10.10.1.7:6789/0,cephcluster-server3=10.10.1.8:6789/0}
            election epoch 12, quorum 0,1,2 cephcluster-server1,cephcluster-server2,cephcluster-server3
      fsmap e17: 1/1/1 up {0=mds-daemon-38=up:active}, 1 up:standby
     osdmap e180: 30 osds: 30 up, 30 in
            flags sortbitwise,require_jewel_osds
      pgmap v55504: 1088 pgs, 3 pools, 81924 MB data, 10910 objects
            241 GB used, 108 TB / 109 TB avail
                1088 active+clean
  client io 68228 MB/s rd, 1066 op/s rd, 0 op/s wr

分析：cephfs client io显示的信息是正确的。从后面的分析得知cephfs client一直loop着尝试read object，单次read就有64MB，object数据缓存在osd的内存里，所以ceph统计的read速率很大。

查看网络

用iftop看网络的流量只有10MB/s

# iftop -B -i eth4
                           25.5MB                    50.9MB                    76.4MB                    102MB                    127MB
└──────────────────────────┴─────────────────────────┴─────────────────────────┴─────────────────────────┴──────────────────────────
cephcluster-server2.****.com                     => 10.10.1.18                                              4.46MB  4.51MB  4.52MB
                                                 <=                                                         220KB   221KB   222KB

对比几个stripe

发现对64MB的无条带化的object读时，会触发这个bug：

1
2
3

# getfattr -n ceph.file.layout dir4/tstfile
# file: dir4/tstfile
ceph.file.layout="stripe_unit=67108864 stripe_count=1 object_size=67108864 pool=cephfs_data”

bs=64M的fio hang住

# fio -filename=/home/yangguanjun/cephfs/dir4/tstfile -size=20G -thread -group_reporting -direct=1 -ioengine=libaio -bs=64M -rw=read -iodepth=64 -iodepth_batch_submit=8 -iodepth_batch_complete=8 -name=write_64k_64q
write_64k_64q: (g=0): rw=read, bs=64M-64M/64M-64M/64M-64M, ioengine=libaio, iodepth=64
fio-2.2.8
Starting 1 thread
^Cbs: 1 (f=1): [R(1)] [inf% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta 1158050441d:06h:59m:55s]
fio: terminating on signal 2
...

bs=32M的fio hang住

# fio -filename=/home/yangguanjun/cephfs/dir4/tstfile -size=20G -thread -group_reporting -direct=1 -ioengine=libaio -bs=32M -rw=read -iodepth=64 -iodepth_batch_submit=8 -iodepth_batch_complete=8 -name=write_64k_64q
write_64k_64q: (g=0): rw=read, bs=32M-32M/32M-32M/32M-32M, ioengine=libaio, iodepth=64
fio-2.2.8
Starting 1 thread
^Cbs: 1 (f=1): [R(1)] [inf% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta 1158050441d:07h:00m:09s]

bs=16M的dd hang住

1
2
3

# dd if=tstfile of=/dev/null bs=16M count=200

root     12763  0.0  0.0 124344   664 pts/0    D+   14:29   0:00 dd if=tstfile of=/dev/null bs=16M count=200

# ceph -w
    cluster 16f59233-16da-4864-8c5a-1fec71d119ad
     health HEALTH_OK
     monmap e1: 3 mons at {cephcluster-server1=10.10.1.6:6789/0,cephcluster-server2=10.10.1.7:6789/0,cephcluster-server3=10.10.1.8:6789/0}
            election epoch 12, quorum 0,1,2 cephcluster-server1,cephcluster-server2,cephcluster-server3
      fsmap e23: 1/1/1 up {0=mds-daemon-37=up:active}
     osdmap e183: 30 osds: 30 up, 30 in
            flags sortbitwise,require_jewel_osds
      pgmap v63806: 1088 pgs, 3 pools, 93200 MB data, 13729 objects
            274 GB used, 108 TB / 109 TB avail
                1088 active+clean
  client io 1330 MB/s rd, 41 op/s rd, 0 op/s wr

2017-07-05 14:30:11.483974 mon.0 [INF] pgmap v63806: 1088 pgs: 1088 active+clean; 93200 MB data, 274 GB used, 108 TB / 109 TB avail; 1330 MB/s rd, 41 op/s
2017-07-05 14:30:16.484856 mon.0 [INF] pgmap v63807: 1088 pgs: 1088 active+clean; 93200 MB data, 274 GB used, 108 TB / 109 TB avail; 1212 MB/s rd, 37 op/s
2017-07-05 14:30:21.485295 mon.0 [INF] pgmap v63808: 1088 pgs: 1088 active+clean; 93200 MB data, 274 GB used, 108 TB / 109 TB avail; 1206 MB/s rd, 37 op/s
2017-07-05 14:30:26.485953 mon.0 [INF] pgmap v63809: 1088 pgs: 1088 active+clean; 93200 MB data, 274 GB used, 108 TB / 109 TB avail; 1212 MB/s rd, 37 op/s

ceph cluster节点信息

在ceph cluster的一个节点上看到测试节点有读数据请求

# iftop -n -i eth4
10.10.1.7               => 10.10.1.18                       12.3Mb  11.6Mb  11.6Mb
                        <=                                  285Kb   284Kb   284Kb

打开source port后
10.10.1.7:6810          => 10.10.1.18                       12.5Mb  11.5Mb  11.5Mb
                        <=                                  278Kb   255Kb   255Kb

# netstat -nap | grep 6810
tcp        0      0 10.10.1.7:6810      0.0.0.0:*           LISTEN      79360/ceph-osd
tcp        0      0 10.10.1.7:6810      0.0.0.0:*           LISTEN      79360/ceph-osd
tcp        0      0 10.10.1.7:6810      10.10.1.7:57266     ESTABLISHED 79360/ceph-osd

# ps aux | grep -w 79360
ceph       79360 85.9  0.5 4338588 709312 ?      Ssl  Jul04 1556:54 /usr/bin/ceph-osd -f --cluster ceph --id 15 --setuser ceph --setgroup ceph

# df
/dev/sdg1      3905070088  9079428 3895990660   1% /var/lib/ceph/osd/ceph-15

查看sdg没有io

# iostat -kx 2 sdg
Linux 3.10.0-514.10.2.el7.x86_64 (cephcluster-server2.****.com)     07/05/2017     _x86_64_    (32 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          12.47    0.00    1.64    0.03    0.00   85.86

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdg               0.01     0.07    2.48    2.01    95.78   140.98   105.44     0.08   17.79    9.69   27.79   3.44   1.54

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.80    0.00    2.96    0.02    0.00   96.22

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdg               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.86    0.00    2.99    0.00    0.00   96.15

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdg               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

查看文件location

测试文件的location信息，正好第一块数据就在osd 15上

# cephfs tstfile show_location
WARNING: This tool is deprecated.  Use the layout.* xattrs to query and modify layouts.
location.file_offset:  0
location.object_offset:0
location.object_no:    0
location.object_size:  67108864
location.object_name:  10000000805.00000000
location.block_offset: 0
location.block_size:   67108864
location.osd:          15

# ceph osd map cephfs_data 10000000805.00000000
osdmap e188 pool 'cephfs_data' (1) object '10000000805.00000000' -> pg 1.12d00fd5 (1.1d5) -> up ([15,0,26], p15) acting ([15,0,26], p15)

停止ceph osd 15

1	# systemctl stop ceph-osd@15.service

过一会查看osd 15所在节点没有与测试节点的数据流量了
这时在ceph osd 0节点，出现了与测试节点的数据流量，查看发现正好是osd 0与测试节点的数据通信
查看此时测试文件第一个object的map信息如下：

1 2	# ceph osd map cephfs_data 10000000805.00000000 osdmap e185 pool 'cephfs_data' (1) object '10000000805.00000000' -> pg 1.12d00fd5 (1.1d5) -> up ([0,26], p0) acting ([0,26], p0)

问题明确

现在这个问题很好重现了

创建一个 stripe_unit=67108864 stripe_count=1 object_size=67108864 的文件
dd if=/dev/zero of=foxfile bs=64M count=1 写成功
dd if=foxfile of=/dev/null bs=64M count=1 读就会hang住
查看测试节点与对应的object所在的osd有数据流量

通过strace命令查看dd读数据时确实hang在read函数：

# strace dd if=foxfile of=/dev/null bs=64M count=1
...
open("foxfile", O_RDONLY) = 3
dup2(3, 0) = 0
close(3) = 0
lseek(0, 0, SEEK_CUR) = 0
open("/dev/null", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
dup2(3, 1) = 1
close(3) = 0
mmap(NULL, 67121152, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f01712d8000
read(0,

收集client log

参考文章 cephfs kernel client debug

收集osd log

1	# ceph daemon /var/run/ceph/ceph-osd.<osdid>.asok config set debug_osd "20/20"

分析log

在cephfs client的log中能循环发现如下：

Jul  6 11:30:29 cephfs-client kernel: [ 2010.231008] ceph_sock_data_ready on ffff88202c987030 state = 5, queueing work
Jul  6 11:30:29 cephfs-client kernel: [ 2010.231012] get_osd ffff88202c987000 3 -> 4
Jul  6 11:30:29 cephfs-client kernel: [ 2010.231015] queue_con_delay ffff88202c987030 0
Jul  6 11:30:29 cephfs-client kernel: [ 2010.231034] try_read start on ffff88202c987030 state 5
Jul  6 11:30:29 cephfs-client kernel: [ 2010.231035] try_read tag 1 in_base_pos 0
Jul  6 11:30:29 cephfs-client kernel: [ 2010.231036] try_read got tag 8
Jul  6 11:30:29 cephfs-client kernel: [ 2010.231036] prepare_read_ack ffff88202c987030
Jul  6 11:30:29 cephfs-client kernel: [ 2010.231041] ceph_sock_data_ready on ffff88202c987030 state = 5, queueing work
Jul  6 11:30:29 cephfs-client kernel: [ 2010.231041] get_osd ffff88202c987000 4 -> 5
Jul  6 11:30:29 cephfs-client kernel: [ 2010.231053] queue_con_delay ffff88202c987030 0
Jul  6 11:30:29 cephfs-client kernel: [ 2010.231062] got ack for seq 1 type 42 at ffff881008668100
Jul  6 11:30:29 cephfs-client kernel: [ 2010.231066] ceph_msg_put ffff881008668100 (was 2)
Jul  6 11:30:29 cephfs-client kernel: [ 2010.231081] prepare_read_tag ffff88202c987030
Jul  6 11:30:29 cephfs-client kernel: [ 2010.231083] try_read start on ffff88202c987030 state 5
Jul  6 11:30:29 cephfs-client kernel: [ 2010.231086] try_read tag 1 in_base_pos 0
Jul  6 11:30:29 cephfs-client kernel: [ 2010.231088] try_read got tag 7
Jul  6 11:30:29 cephfs-client kernel: [ 2010.231088] prepare_read_message ffff88202c987030
Jul  6 11:30:29 cephfs-client kernel: [ 2010.231089] read_partial_message con ffff88202c987030 msg           (null)
Jul  6 11:30:29 cephfs-client kernel: [ 2010.231090] try_read done on ffff88202c987030 ret -5

证明cephfs client一直在尝试读取osd数据，但收到数据后却认为msg为null

在ceph osd的log中能循环发现如下：

2017-07-06 11:30:29.185056 7f1788eb2700  1 osd.20 188 ms_handle_reset con 0x7f1748c39d80 session 0x7f170f4f31c0
2017-07-06 11:30:29.185683 7f17719fc700 10 osd.20 188  new session 0x7f174d435380 con=0x7f1748c39f00 addr=10.10.1.18:0/144324133
2017-07-06 11:30:29.185839 7f17719fc700 20 osd.20 188 should_share_map client.24357 10.10.1.18:0/144324133 188
2017-07-06 11:30:29.185854 7f17719fc700 15 osd.20 188 enqueue_op 0x7f174d42f100 prio 127 cost 0 latency 0.000047 osd_op(client.24357.1:3 1.1bdf9243 1000000095b.00000000 [read 0~67108864 [1@-1]] snapc 0=[] RETRY=6 ondisk+retry+read+known_if_redirected e188) v4
2017-07-06 11:30:29.185891 7f1777bff700 10 osd.20 188 dequeue_op 0x7f174d42f100 prio 127 cost 0 latency 0.000085 osd_op(client.24357.1:3 1.1bdf9243 1000000095b.00000000 [read 0~67108864 [1@-1]] snapc 0=[] RETRY=6 ondisk+retry+read+known_if_redirected e188) v4 pg pg[1.43( v 188'6597 (180'3500,188'6597] local-les=157 n=10 ec=136 les/c/f 157/157/0 151/156/136) [20,10,7] r=0 lpr=156 crt=188'6595 lcod 188'6596 mlcod 188'6596 active+clean]
2017-07-06 11:30:29.185937 7f1777bff700 10 osd.20 pg_epoch: 188 pg[1.43( v 188'6597 (180'3500,188'6597] local-les=157 n=10 ec=136 les/c/f 157/157/0 151/156/136) [20,10,7] r=0 lpr=156 crt=188'6595 lcod 188'6596 mlcod 188'6596 active+clean] handle_message: 0x7f174d42f100
2017-07-06 11:30:29.185951 7f1777bff700 20 osd.20 pg_epoch: 188 pg[1.43( v 188'6597 (180'3500,188'6597] local-les=157 n=10 ec=136 les/c/f 157/157/0 151/156/136) [20,10,7] r=0 lpr=156 crt=188'6595 lcod 188'6596 mlcod 188'6596 active+clean] op_has_sufficient_caps pool=1 (cephfs_data ) owner=0 need_read_cap=1 need_write_cap=0 need_class_read_cap=0 need_class_write_cap=0 -> yes
2017-07-06 11:30:29.185970 7f1777bff700 10 osd.20 pg_epoch: 188 pg[1.43( v 188'6597 (180'3500,188'6597] local-les=157 n=10 ec=136 les/c/f 157/157/0 151/156/136) [20,10,7] r=0 lpr=156 crt=188'6595 lcod 188'6596 mlcod 188'6596 active+clean] do_op osd_op(client.24357.1:3 1.1bdf9243 1000000095b.00000000 [read 0~67108864 [1@-1]] snapc 0=[] RETRY=6 ondisk+retry+read+known_if_redirected e188) v4 may_read -> read-ordered flags ondisk+retry+read+known_if_redirected
2017-07-06 11:30:29.185990 7f1777bff700 10 osd.20 pg_epoch: 188 pg[1.43( v 188'6597 (180'3500,188'6597] local-les=157 n=10 ec=136 les/c/f 157/157/0 151/156/136) [20,10,7] r=0 lpr=156 crt=188'6595 lcod 188'6596 mlcod 188'6596 active+clean] get_object_context: found obc in cache: 0x7f171e423180
2017-07-06 11:30:29.185997 7f1777bff700 10 osd.20 pg_epoch: 188 pg[1.43( v 188'6597 (180'3500,188'6597] local-les=157 n=10 ec=136 les/c/f 157/157/0 151/156/136) [20,10,7] r=0 lpr=156 crt=188'6595 lcod 188'6596 mlcod 188'6596 active+clean] get_object_context: 0x7f171e423180 1:c249fbd8:::1000000095b.00000000:head rwstate(none n=0 w=0) oi: 1:c249fbd8:::1000000095b.00000000:head(188'4499 mds.0.20:66259 dirty|data_digest|omap_digest s 67108864 uv 4499 dd cdba94a2 od ffffffff) ssc: 0x7f172dcfce40 snapset: 1=[]:[]+head
2017-07-06 11:30:29.186010 7f1777bff700 10 osd.20 pg_epoch: 188 pg[1.43( v 188'6597 (180'3500,188'6597] local-les=157 n=10 ec=136 les/c/f 157/157/0 151/156/136) [20,10,7] r=0 lpr=156 crt=188'6595 lcod 188'6596 mlcod 188'6596 active+clean] find_object_context 1:c249fbd8:::1000000095b.00000000:head @head oi=1:c249fbd8:::1000000095b.00000000:head(188'4499 mds.0.20:66259 dirty|data_digest|omap_digest s 67108864 uv 4499 dd cdba94a2 od ffffffff)
2017-07-06 11:30:29.186026 7f1777bff700 10 osd.20 pg_epoch: 188 pg[1.43( v 188'6597 (180'3500,188'6597] local-les=157 n=10 ec=136 les/c/f 157/157/0 151/156/136) [20,10,7] r=0 lpr=156 crt=188'6595 lcod 188'6596 mlcod 188'6596 active+clean] execute_ctx 0x7f1776beed00
2017-07-06 11:30:29.186035 7f1777bff700 10 osd.20 pg_epoch: 188 pg[1.43( v 188'6597 (180'3500,188'6597] local-les=157 n=10 ec=136 les/c/f 157/157/0 151/156/136) [20,10,7] r=0 lpr=156 crt=188'6595 lcod 188'6596 mlcod 188'6596 active+clean] do_op 1:c249fbd8:::1000000095b.00000000:head [read 0~67108864 [1@-1]] ov 188'4499
2017-07-06 11:30:29.186041 7f1777bff700 10 osd.20 pg_epoch: 188 pg[1.43( v 188'6597 (180'3500,188'6597] local-les=157 n=10 ec=136 les/c/f 157/157/0 151/156/136) [20,10,7] r=0 lpr=156 crt=188'6595 lcod 188'6596 mlcod 188'6596 active+clean]  taking ondisk_read_lock
2017-07-06 11:30:29.186047 7f1777bff700 10 osd.20 pg_epoch: 188 pg[1.43( v 188'6597 (180'3500,188'6597] local-les=157 n=10 ec=136 les/c/f 157/157/0 151/156/136) [20,10,7] r=0 lpr=156 crt=188'6595 lcod 188'6596 mlcod 188'6596 active+clean] do_osd_op 1:c249fbd8:::1000000095b.00000000:head [read 0~67108864 [1@-1]]
2017-07-06 11:30:29.186056 7f1777bff700 10 osd.20 pg_epoch: 188 pg[1.43( v 188'6597 (180'3500,188'6597] local-les=157 n=10 ec=136 les/c/f 157/157/0 151/156/136) [20,10,7] r=0 lpr=156 crt=188'6595 lcod 188'6596 mlcod 188'6596 active+clean] do_osd_op  read 0~67108864 [1@-1]

分析看出cephfs client发送read 64MB的request，ceph osd读64M数据return，但是cephfs client认为msg为null，然后重试
一直重复上面的请求，所以cephfs client的IO就hang住了

提交bug

提交该bug到ceph社区：
http://tracker.ceph.com/issues/20528

cephfs client端的配置限制了read message的最大size为16M。

in net/ceph/messenger.c

static int read_partial_message(struct ceph_connection *con)
{
...
        front_len = le32_to_cpu(con->in_hdr.front_len);
        if (front_len > CEPH_MSG_MAX_FRONT_LEN)
                return -EIO;
        middle_len = le32_to_cpu(con->in_hdr.middle_len);
        if (middle_len > CEPH_MSG_MAX_MIDDLE_LEN)
                return -EIO;
        data_len = le32_to_cpu(con->in_hdr.data_len);
        if (data_len > CEPH_MSG_MAX_DATA_LEN)
                return -EIO;

所以我们使用cephfs中，不能配置文件的stripe_unit大于16M。