基于Cinder的云硬盘动态限速框架设计

当前状况

当前我们仅仅通过cinder创建type,指定typeIO限制来限制不同类型的云硬盘的IO值

具体方案和调研,参考:云硬盘动态限速调研

目的

  • 充分利用系统的资源,在系统负载和Ceph负载小的情况下,动态调整云硬盘的IO值,给用户提供更好的体验。
  • 设计一个框架,能满足云硬盘动态限速的要求;以后也能方便的加入网络的动态限速;

框架设计

需求

  • 统一配置选项
  • 模块化设计,能方便添加别的动态限速需求
  • 统一调度,各物理节点独立运行模式

方案1

分散式架构设计,独立实现,各物理节点独立负责监控负载和动态限速;

arch1

优点:

  • 独立实现,比较简洁
  • 物理节点单独计算负载,通信量减少
  • Master只需要维护配置项,可在内存中维护该数据

缺点:

  • 不能与现有监控系统结合

方案2

结合Openstack的设计,Master端集中维护系统负载,存入DBMaster-Agent通过Rabbitmq通信;

arch2

优点:

  • 结合OpenstackDBrabbitmqrpc通信wsgi等模块
  • 系统负载信息通过DB统一维护
  • 方便与现有监控系统集合

缺点:

  • 结合Openstack的几个通用模块,代码量和工作量比较大
  • 系统负载信息集中保存到DB,当前看是没有必要性

方案选择

个人倾向 方案1,不需要把openstack的各个模块引入进来,实现会相对简单,也满足我们项目的需求;

云硬盘限速部分设计

概述

前面在调研中指出,我们可以在nova instance具体的物理机上,通过virsh的命令调整虚拟机指定云硬盘的io限制
所以具体执行动态限速的命令需要在nova instance的物理机上执行,也就是说每个运行nova instance的物理机需要跑一个限速的agent进程。

我们期望云硬盘动态限速,会受到以下两个因素影响:

  • 物理机的负载
  • 云硬盘对应Ceph PoolIO负载

所以比较简单的设计为:

每个Agent单独运行,负责本物理机上的nova instances的云硬盘IO动态调整。Agent可以获取本机的实时负载和Ceph对应Pool的IO负载。

考虑的因素

  • 通过配置可以关闭/开启动态限速
  • 支持白名单方式
  • 支持黑名单方式
  • 通过配置指定支持的云硬盘类型
  • 通过配置指定动态限速模式:
    • 物理机负载 + Ceph Pool负载
    • 物理机负载
    • Ceph Pool负载
  • 配置动态调整限速间隔
  • 配置指定类型云硬盘动态限速的范围: [begin, end]

配置项示例

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
[DEFAULT]
# Tune dynamic resource limitation to True/False
dynamic_tune = True/False
# interval to check and reset dynamic tune, value in seconds
interval = 60
# Tune modules which would do dynamic tune
# Keep pace with the below configurations
tune_modules = iotune

### io dynamic tune related
[iotune]
# Which type of volumes supported, default means tune all volume types;
# Keep pace with the configuration of cinder type
# Example values: default/sata/ssd/sata, ssd
volume_types = default
# Model to do tune, value maybe: default/local/ceph
# default model: local + ceph
# local: only check local load
# ceph: only check ceph load
tune_model = default

# Set the iotune range for specify volume type to [begin, end]
iotune_bw_range = {'sata' : {'read' : [80, 200], 'write' : [50, 100]}, 'ssd' : {'read' : [100, 300], 'write' : [80, 200]}}
iotune_iops_range = {'sata' : {'read' : [100, 1000], 'write' : [100, 800]}, 'ssd' : {'read' : [1000, 5000], 'write' : [600, 300]}}

# Specify disks tune info
iotune_range = {'d-5pc4hq7y' : {'bw' : {'read' : [80, 200], 'write' : [50, 100]}, 'iops' : {'read' : [100, 1000], 'write' : [100, 800]}},
'd-ymqagtez' : {'bw' : {'read' : [100, 300], 'write' : [80, 200]}, 'iops' : {'read' : [1000, 5000], 'write' : [600, 300]}} }

# Blacklist of nova instances which would not do dynamic iotune
#blacklist = nova-ins1, nova-ins2, ...
# Whitelist of nova instances which ONLY do dynamic iotune on them
#whitelist = nova-ins3, nova-ins4, ...

### specify user tune info
[usr-kndz857b]
iotune_bw_range = {'sata' : {'read' : [80, 200], 'write' : [50, 100]}, 'ssd' : {'read' : [100, 300], 'write' : [80, 200]}}
iotune_iops_range = {'sata' : {'read' : [100, 1000], 'write' : [100, 800]}, 'ssd' : {'read' : [1000, 5000], 'write' : [600, 300]}}

代码中读取上诉配置,在脚本中生成conf的字典信息;

涉及到的命令与解释

1)、获取系统cinder volume的信息

cinder list --all | grep "in-use" | awk '{print $2,$4,$8,$12}'

1
2
3
4
5
6
7
8
(.venv)openstack@Server-01-01:~$ cinder list --all | grep "in-use" | awk '{print $2,$4,$8,$12}'
...
| ID | Tenant ID | Name | Volume Type |
04960b37-7a07-439d-9fd4-dc0664870245 a72514ab47524d00b0b43551466c7d67 d-6a3x2u8j sata
06996e93-47ad-4003-acb4-b46aa185d479 f634c4182b634a7ca4a4c2ac72837cb7 d-ibvar57t sata
0f45c78c-743b-4d42-b098-b2ee5f95df94 462dc39145324bb1bda7263368d176aa d-f8wkacjy ssd
103a0e6d-382c-4caa-b313-63a95eeb351d 616eae54884147d786a3beb39ea6bcbb d-va5fe6bw sata
...

2)、找到配置文件中user对应的 Tenant ID

openstack project list | grep "user-id"

1
2
openstack@Server-01-01:~$ openstack user list | grep usr-kndz857b
| 63719e972d7d4afe87b7127c9daa120f | usr-kndz857b |

3)、列出本物理机上的virsh instances

virsh list:list domains

1
2
3
4
5
6
7
8
9
openstack@Server-01-01:~$ virsh list
Id Name State
----------------------------------------------------
40 instance-0000028d running
52 instance-0000033d running
122 instance-00000721 running
140 instance-000007a5 running
141 instance-000007a3 running
...

4)、列出指定virsh instances的所有blocks

virsh domblklist <domain>:list all domain blocks

1
2
3
4
5
openstack@Server-01-01:~$ virsh domblklist instance-00000921
Target Source
------------------------------------------------
vda /var/lib/nova/instances/7ef2900b-0653-443b-a2c0-3da7a5d7a10f/disk
vdb volumes/volume-7b94ccc5-6b14-4352-b4f3-74a71893f246

5)、查看/修改block device的iotune

virsh blkdeviotune <domain> <device>:Set or query a block device I/O tuning parameters

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
openstack@Server-01-01:~$ virsh blkdeviotune instance-00000921 vdb
total_bytes_sec: 0
read_bytes_sec : 104857600
write_bytes_sec: 62914560
total_iops_sec : 0
read_iops_sec : 1500
write_iops_sec : 1000
total_bytes_sec_max: 0
read_bytes_sec_max: 10485760
write_bytes_sec_max: 6291456
total_iops_sec_max: 0
read_iops_sec_max: 150
write_iops_sec_max: 100
size_iops_sec : 0
openstack@Server-01-01:~$ virsh blkdeviotune instance-00000921 vdb --read_iops_sec 5000 --write_iops_sec 3000

Agent框架流程

agent

获取负载信息

1)、本机负载信息

1
2
$ uptime
14:35:25 up 32 days, 14:54, 5 users, load average: 2.53, 2.82, 2.80

获取最近1,5,15分钟负载分别为:2.53, 2.82, 2.80

2)、Ceph负载信息

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
$ ceph osd tree
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-10 3.41998 root hp-default
-7 1.70999 host hp-Server-01-02
33 0.64999 osd.33 up 1.00000 1.00000
32 0.67000 osd.32 up 1.00000 1.00000
31 0.39000 osd.31 up 1.00000 1.00000
-9 1.70999 host hp-Server-02-02
38 0.67000 osd.38 up 1.00000 1.00000
39 0.64999 osd.39 up 1.00000 1.00000
37 0.39000 osd.37 up 1.00000 1.00000
-1 101.91998 root default
-2 25.48000 host Server-01-01
6 3.64000 osd.6 up 1.00000 1.00000
5 3.64000 osd.5 up 1.00000 1.00000
4 3.64000 osd.4 up 1.00000 1.00000
3 3.64000 osd.3 up 1.00000 1.00000
2 3.64000 osd.2 up 1.00000 1.00000
1 3.64000 osd.1 up 1.00000 1.00000
0 3.64000 osd.0 up 1.00000 1.00000
-3 25.48000 host Server-01-02
13 3.64000 osd.13 up 1.00000 1.00000
12 3.64000 osd.12 up 1.00000 1.00000
11 3.64000 osd.11 up 1.00000 1.00000
10 3.64000 osd.10 up 1.00000 1.00000
9 3.64000 osd.9 up 1.00000 1.00000
8 3.64000 osd.8 up 1.00000 1.00000
7 3.64000 osd.7 up 1.00000 1.00000
-4 25.48000 host Server-02-01
20 3.64000 osd.20 up 1.00000 1.00000
19 3.64000 osd.19 up 1.00000 1.00000
18 3.64000 osd.18 up 1.00000 1.00000
17 3.64000 osd.17 up 1.00000 1.00000
16 3.64000 osd.16 up 1.00000 1.00000
15 3.64000 osd.15 up 1.00000 1.00000
14 3.64000 osd.14 up 1.00000 1.00000
-5 25.48000 host Server-02-02
27 3.64000 osd.27 up 1.00000 1.00000
26 3.64000 osd.26 up 1.00000 1.00000
25 3.64000 osd.25 up 1.00000 1.00000
24 3.64000 osd.24 up 1.00000 1.00000
23 3.64000 osd.23 up 1.00000 1.00000
22 3.64000 osd.22 up 1.00000 1.00000
21 3.64000 osd.21 up 1.00000 1.00000

从上述输出中获取不同性质磁盘对应的OSD如下信息:

1
2
{"sata" : (0,1,2,3,4,5,6,7,...,26,27)}
{"ssd": (31,32,33,37,38,39)}

【注:这个信息是固定的,只需要获取一次即可】

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
$ ceph osd perf
osd fs_commit_latency(ms) fs_apply_latency(ms)
0 0 1
1 0 1
2 0 1
3 0 1
4 0 1
5 0 1
6 0 1
7 2 4
8 0 1
9 0 0
10 0 1
11 0 1
12 0 1
13 0 0
14 0 1
15 0 1
16 0 1
17 0 1
18 0 1
19 0 1
20 0 1
21 0 1
22 0 0
23 0 0
24 0 0
25 0 1
26 0 0
27 0 1
31 1 2
32 0 7
33 0 6
37 0 6
38 0 8
39 0 3

分别获取SATA和SSD磁盘组内的平均负载信息:

1
{"sata" : [1, 2], "ssd" : [1, 6]}

【注:这个信息需要周期性获取】

动态调整IO限速

公式:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
sys_load_throttle = 20
cepy_load_throttle1 = 10
cepy_load_throttle2 = 20

系统负载,从uptime输出获取,sys_load: [2.53, 2.82, 2.80]
ceph负载,从ceph osd perf输出获取,ceph_load: {"sata" : [1, 2], "ssd" : [1, 6]}
根据volume_type,获取ceph对应磁盘组的负载信息: cload
根据sys_load和cload的值决定增加还是减少bw和iops
if (cload[0] < $cepy_load_throttle1) && (cload[1] < $cepy_load_throttle1):
if (sys_load[0] < $sys_load_throttle) && (sys_load[1] < $sys_load_throttle):
#增大
else
#保持不变
elif ($cepy_load_throttle2 >= cload[0] >= $cepy_load_throttle1) && ($cepy_load_throttle2 >= cload[1] >= $cepy_load_throttle1):
if (sys_load[0] > $sys_load_throttle) || (sys_load[1] > $sys_load_throttle):
#减少
else
#保持不变
else
#减少

根据ceph和sys的负载信息,可以计算出SATA/SSD磁盘的iotune是增加,减少,还是保持不变;

loop每个instance,针对volume类型是需要增加/减少的volume,设置volume新的iotune值:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
根据conf配置信息获取volume对应的bw_range和iops_range,然后调用下面的函数执行iotune调整。

set_volume_dynamic_iotune()
输入:volume target - 卷的盘符,示例:vdb
bw_range - 带宽范围,从配置文件获取,示例:{'read' : [80, 200], 'write' : [50, 100]} #单位:MB
iops_range - iops范围,从配置文件获取,示例:{'read' : [100, 300], 'write' : [80, 200]}
iotune - iotune的趋势,0: 保持不变;>0: 增加;<0: 减少

1、获取volume现在对应的bw和iops值
2、根据输入计算volume新的bw和iops值
3、判断新的bw和iops值是不是在bw_range, iops_range范围内?是的话设置对应值为新值,否则保留旧值;
4、通过virsh blkdeviotune命令设置read_bytes_sec,write_bytes_sec,read_iops_sec,write_iops_sec

注释:read_bytes_sec - 每秒读bytes
write_bytes_sec - 每秒写bytes
read_iops_sec - 每秒读iops
write_iops_sec - 每秒写iops

bw: 增大减少幅度10MB (10485760 bytes)
iops: 增大减少幅度100

云硬盘默认限速

在动态限速程序退出前,需要调整相关云硬盘到默认限速值,具体的默认值如下:

1
2
3
4
5
6
7
8
(.venv)openstack@Server-01-01:~$ cinder qos-list
+--------------------------------------+--------------+-----------+----------------------------------------------------------------------------------------------------------------------------+
| ID | Name | Consumer | specs |
+--------------------------------------+--------------+-----------+----------------------------------------------------------------------------------------------------------------------------+
| 06661a25-27d4-4f64-93d4-8b74e3fda218 | sata-qos | front-end | {u'read_bytes_sec': u'104857600', u'write_iops_sec': u'1000', u'write_bytes_sec': u'62914560', u'read_iops_sec': u'1500'} |
...
| d560db69-e8f3-4a5f-a80c-fff41140f48c | ssd-qos | front-end | {u'read_bytes_sec': u'157286400', u'write_iops_sec': u'3000', u'write_bytes_sec': u'104857600', u'read_iops_sec': u'5000'} |
+--------------------------------------+--------------+-----------+----------------------------------------------------------------------------------------------------------------------------+
支持原创