问题
ceph集群状态为HEALTH_ERR
,ceph -s
显示有pg状态不一致,ceph health detail
输出如下:
1 | # ceph health detail |
分析 & 解决
手动执行pg修复
ceph pg repair 14.16a
ceph pg deep-scrub 14.16a
结果:集群状态依旧HEALTH_ERR
重启对应osd daemon
systemctl restart ceph-osd@<osdid>.service
结果:集群状态依旧HEALTH_ERR
检查ceph-osd log
1
2
3
4
5
6
7
8
9
10# vim ceph-osd.1.log-20170531.gz
...
2017-05-31 02:30:14.166348 7f7aeefff700 -1 log_channel(cluster) log [ERR] : 14.16a shard 27: soid 14:56a14fcc:::10000002380.0000061d:head candidate had a read error
2017-05-31 02:30:14.166358 7f7aeefff700 -1 log_channel(cluster) log [ERR] : 14.16a shard 27: soid 14:56a7ab13:::10000002380.00000446:head candidate had a read error
2017-05-31 02:30:14.166361 7f7aeefff700 -1 log_channel(cluster) log [ERR] : 14.16a shard 27: soid 14:56b34a7a:::10000002356.00000218:head candidate had a read error
2017-05-31 02:30:14.166363 7f7aeefff700 -1 log_channel(cluster) log [ERR] : 14.16a shard 27: soid 14:56c557e8:::10000002380.0000007a:head candidate had a read error
2017-05-31 02:30:14.166366 7f7aeefff700 -1 log_channel(cluster) log [ERR] : 14.16a shard 27: soid 14:56c8ceeb:::10000002380.00000854:head candidate had a read error
2017-05-31 02:30:14.166372 7f7aeefff700 -1 log_channel(cluster) log [ERR] : 14.16a shard 27: soid 14:56f50799:::1000000237c.0000019a:head candidate had a read error
2017-05-31 02:30:14.168485 7f7acebff700 -1 log_channel(cluster) log [ERR] : 14.16a deep-scrub 0 missing, 6 inconsistent objects
2017-05-31 02:30:14.168499 7f7acebff700 -1 log_channel(cluster) log [ERR] : 14.16a deep-scrub 6 errors上面可以看出
pg 14.16a
里有几个objects报告candidate had a read error
查看出错的object的md5值
1
2
3
4
5
6
7
8# md5sum /var/lib/ceph/osd/ceph-1/current/14.16a_head/10000002380.0000061d__head_33F2856A__e
dc191d3144c49077952ed059425d68b1 /var/lib/ceph/osd/ceph-1/current/14.16a_head/10000002380.0000061d__head_33F2856A__e
# md5sum /var/lib/ceph/osd/ceph-16/current/14.16a_head/10000002380.0000061d__head_33F2856A__e
dc191d3144c49077952ed059425d68b1 /var/lib/ceph/osd/ceph-16/current/14.16a_head/10000002380.0000061d__head_33F2856A__e
# md5sum /var/lib/ceph/osd/ceph-27/current/14.16a_head/10000002380.0000061d__head_33F2856A__e
md5sum: /var/lib/ceph/osd/ceph-27/current/14.16a_head/10000002380.0000061d__head_33F2856A__e: Input/output error上面得出
ceph-27
上的object获取md5值失败,报Input/output error
,这里猜测其对应的磁盘有问题;
查看ceph-27
上pg 14.16a
里别的object的md5值,输出正常,则证明磁盘还能工作,可能是文件有损坏,也可能磁盘上有部分坏道;删除md5sum报IO错误的文件,然后执行
ceph pg repair 14.16a
,pg状态恢复正常;
磁盘检查与恢复
那磁盘是否有问题呢?执行以下步骤来检测并恢复:
1 | # systemctl stop ceph-osd@27.service |
上述检查会花费比较长时间,一般SATA盘的read速度为100-120MB/s,4TB大小的SATA盘,约需要4*1024*1024/120/3600 = 9.7
小时;
不过你可以在别的窗口查看输出文件badblocks.log
的内容,如果里面有信息,那证明你的磁盘确实有坏块,就需要尝试恢复检查了:
1 | # head -n 10 badblocks.log |
通过badblocks修复坏块,注意会丢失坏块里的数据
1 | # badblocks -ws /dev/sdi1 10633 10624 |
通过上述步骤恢复了磁盘中的坏块,有如下两种情况:
坏块中有xfs的元数据信息
执行xfs_repair修复磁盘上的xfs系统,很可能报错;
没有办法,只能重新格式化磁盘,然后执行ceph的recovery了;坏块中无xfs的元数据信息
通过xfs_repair修复磁盘上的xfs系统,报告done1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97# xfs_repair -f /dev/sdi1
Phase 1 - find and verify superblock...
Cannot get host filesystem geometry.
Repair may fail if there is a sector size mismatch between
the image and the host filesystem.
- reporting progress in intervals of 15 minutes
Phase 2 - using internal log
- zero log...
- scan filesystem freespace and inode maps...
- 15:23:37: scanning filesystem freespace - 32 of 32 allocation groups done
- found root inode chunk
Phase 3 - for each AG...
- scan and clear agi unlinked lists...
- 15:23:37: scanning agi unlinked lists - 32 of 32 allocation groups done
- process known inodes and perform inode discovery...
- agno = 15
- agno = 0
- agno = 30
- agno = 31
- agno = 16
- agno = 1
- agno = 2
- agno = 17
- agno = 18
- agno = 19
- agno = 3
- agno = 4
- agno = 5
- agno = 6
- agno = 7
- agno = 20
- agno = 21
- agno = 22
- agno = 23
- agno = 24
- agno = 25
- agno = 26
- agno = 8
- agno = 9
- agno = 10
- agno = 11
- agno = 12
- agno = 27
- agno = 28
- agno = 29
- agno = 13
- agno = 14
- 15:23:42: process known inodes and inode discovery - 3584 of 3584 inodes done
- process newly discovered inodes...
- 15:23:42: process newly discovered inodes - 32 of 32 allocation groups done
Phase 4 - check for duplicate blocks...
- setting up duplicate extent list...
- 15:23:42: setting up duplicate extent list - 32 of 32 allocation groups done
- check for inodes claiming duplicate blocks...
- agno = 0
- agno = 1
- agno = 2
- agno = 3
- agno = 4
- agno = 5
- agno = 7
- agno = 6
- agno = 8
- agno = 11
- agno = 15
- agno = 19
- agno = 22
- agno = 24
- agno = 14
- agno = 28
- agno = 16
- agno = 18
- agno = 31
- agno = 20
- agno = 9
- agno = 10
- agno = 21
- agno = 23
- agno = 12
- agno = 13
- agno = 25
- agno = 26
- agno = 27
- agno = 17
- agno = 30
- agno = 29
- 15:23:42: check for inodes claiming duplicate blocks - 3584 of 3584 inodes done
Phase 5 - rebuild AG headers and trees...
- 15:23:42: rebuild AG headers and trees - 32 of 32 allocation groups done
- reset superblock...
Phase 6 - check inode connectivity...
- resetting contents of realtime bitmap and summary inodes
- traversing filesystem ...
- traversal finished ...
- moving disconnected inodes to lost+found ...
Phase 7 - verify and correct link counts...
done这时mount,start ceph-osd damon后,ceph报
HEALTH_OK
,但其实osd上的部分数据已经丢失,这个等ceph的deep scrub触发后会自动修复;
参考
http://ceph.com/planet/ceph-manually-repair-object/
https://linux.cn/article-7961-1.html