Ceph osd write op tracker analysis

简介

Ceph OSD op tracker是osd的一个调试手段,通过op tracker可以分析osd的op在各个操作端的时间开销,进而分析出系统的性能瓶颈和需要调优的参数。

我们基于Hammer版本的op tracker添加了几个events,能更细粒度的切分op的各个阶段,下面是我们通过op tracker的log来分析OSD端性能的实例。

tracker events

Primary write

这里把op的处理过程分为5个阶段,后面会对每一阶段进行分析;

op: osd_op

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
 tracker -- seq: 329476, time: 2015-08-06 13:42:32.301230, event: header_read                         - OpRequest::OpRequest
tracker -- seq: 329476, time: 2015-08-06 13:42:32.301232, event: throttled - OpRequest::OpRequest
tracker -- seq: 329476, time: 2015-08-06 13:42:32.301726, event: all_read - OpRequest::OpRequest
tracker -- seq: 329476, time: 0.000000, event: dispatched - OpRequest::OpRequest
tracker -- seq: 329476, time: 2015-08-06 13:42:33.429736, event: reached_pg - OSD::dequeue_op - OSD::ShardedOpWQ::_process
1. 上面这部分延时是osd op queue的处理时延

tracker -- seq: 329476, time: 2015-08-06 13:42:33.430201, event: started - ReplicatedPG::do_pg_op
tracker -- seq: 329476, time: 2015-08-06 13:42:33.430800, event: waiting for subops from 6,47 - ReplicatedBackend::issue_op
-> ReplicatedPG::send_message_osd_cluster
tracker -- seq: 329476, time: 2015-08-06 13:42:33.510099, event: commit_queued_for_journal_write - FileJournal::submit_entry
2. 上面这部分延时主要是reserve filestore throttle等待的时间,往replica osd发送请求应该比较快。

tracker -- seq: 329476, time: 2015-08-06 13:42:33.510596, event: sub_op_commit_rec - ReplicatedBackend::sub_op_modify_reply
tracker -- seq: 329476, time: 2015-08-06 13:42:33.510983, event: write_thread_in_journal_buffer - FileJournal::prepare_single_write [-> journalq.push_back()]
<- FileJournal::prepare_multi_write
<- FileJournal::write_thread_entry
tracker -- seq: 329476, time: 2015-08-06 13:42:35.071590, event: journaled_completion_queued - FileJournal::queue_completions_thru
<- FileJournal::committed_thru
<- JournalingObjectStore::ApplyManager::commit_finish
3. 上面这部分延时是写journal的花费时间

tracker -- seq: 329476, time: 2015-08-06 13:42:35.074449, event: sub_op_commit_rec - ReplicatedBackend::sub_op_modify_reply
tracker -- seq: 329476, time: 2015-08-06 13:42:35.906469, event: op_applied - ReplicatedBackend::op_applied
<- C_OSD_OnOpApplied.finish()
4. 上面这部分延时是filestore数据罗盘的花费时间

tracker -- seq: 329476, time: 2015-08-06 13:42:35.947938, event: op_commit - ReplicatedBackend::op_commit
tracker -- seq: 329476, time: 2015-08-06 13:42:35.947989, event: commit_sent - ReplicatedPG::eval_repop
tracker -- seq: 329476, time: 2015-08-06 13:42:35.948057, event: done
5. 上面这部分是发送reply给client的时延

Secondary write

op: osd_repop

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
tracker -- seq: 329473, time: 2015-08-06 13:42:32.292964, event: header_read                         - OpRequest::OpRequest
tracker -- seq: 329473, time: 2015-08-06 13:42:32.292965, event: throttled - OpRequest::OpRequest
tracker -- seq: 329473, time: 2015-08-06 13:42:32.293070, event: all_read - OpRequest::OpRequest
tracker -- seq: 329473, time: 0.000000, event: dispatched - OpRequest::OpRequest
tracker -- seq: 329473, time: 2015-08-06 13:42:33.427436, event: reached_pg - OSD::dequeue_op - OSD::ShardedOpWQ::_process
tracker -- seq: 329473, time: 2015-08-06 13:42:33.427459, event: started - ReplicatedPG::do_pg_op
tracker -- seq: 329473, time: 2015-08-06 13:42:33.428036, event: started
tracker -- seq: 329473, time: 2015-08-06 13:42:33.428083, event: commit_queued_for_journal_write - FileJournal::submit_entry
tracker -- seq: 329473, time: 2015-08-06 13:42:33.507348, event: write_thread_in_journal_buffer - FileJournal::prepare_single_write
<- FileJournal::prepare_multi_write
<- FileJournal::write_thread_entry
tracker -- seq: 329473, time: 2015-08-06 13:42:35.071399, event: journaled_completion_queued - FileJournal::queue_completions_thru
<- FileJournal::committed_thru
<- JournalingObjectStore::ApplyManager::commit_finish
tracker -- seq: 329473, time: 2015-08-06 13:42:35.724058, event: sub_op_applied - ReplicatedBackend::sub_op_modify_applied
tracker -- seq: 329473, time: 2015-08-06 13:42:35.878877, event: commit_sent - ReplicatedBackend::sub_op_modify_commit
tracker -- seq: 329473, time: 2015-08-06 13:42:35.878919, event: done

分析

针对上述primary osd上的op tracker events,我们对前4个阶段的分析如下:

  1. 第1阶段:osd op queue的处理时延

    处理osd op的workqueue和threadpool为op_shardedwq, osd_op_tp;

    1
    2
    3
    4
    op_shardedwq(cct->_conf->osd_op_num_shards, this,
    cct->_conf->osd_op_thread_timeout, &osd_op_tp),
    osd_op_tp(cct, "OSD::osd_op_tp",
    cct->_conf->osd_op_num_threads_per_shard * cct->_conf->osd_op_num_shards),

    可以调大如下值来提高osd的处理能力,减少在这个阶段的时延

    1
    2
    OPTION(osd_op_num_threads_per_shard, OPT_INT, 2)
    OPTION(osd_op_num_shards, OPT_INT, 5)
  2. 第2阶段:reserve filestore throttle等待的时间

    所耗时间包括往replica osd发送message的时间 + reserve filestore queue throttle wait时间 + filejournal throttle wait时间

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
     void ReplicatedBackend::submit_transaction()
    |-- void ReplicatedBackend::issue_op()
    | |-- void ReplicatedPG::send_message_osd_cluster() // send write data to replica osds
    | |-- void OSDService::send_message_osd_cluster(int peer, Message *m, epoch_t from_epoch)
    |-- parent->queue_transaction(op_t, op.op);
    | |-- int FileStore::queue_transactions() // 没有journal或需要同时写journal和disk时,会调用 queue_op(osr, o);
    | |-- void JournalingObjectStore::_op_journal_transactions()
    | |-- void FileJournal::submit_entry() //添加到 deque<write_item> writeq;

    FileStore::queue_transactions()里会调用:(获取filestore和filejournal的throttle)
    op_queue_reserve_throttle(o, handle);
    journal->throttle();

    调整这些参数应该可以减少这部分的处理时间

    1
    2
    3
    4
    5
    OPTION(filestore_queue_max_ops, OPT_INT, 50)
    OPTION(filestore_queue_max_bytes, OPT_INT, 100 << 20)

    OPTION(journal_queue_max_ops, OPT_INT, 300)
    OPTION(journal_queue_max_bytes, OPT_INT, 32 << 20)
  3. 第3阶段:写journal的时间

    journal写数据:

    1
    FileJournal.Writer.entry() -> FileJournal::write_thread_entry() -> FileJournal::do_write() -> FileJournal::write_bl()

    使用高性能ssd作为journal应该能减少这部分时间。

  4. 第4阶段:写filestore的时间

    所耗时间为所有osd写filestore的时间,这里测试发现数据写filestore比写journal更快,可能因为journal带的盘数较多的缘故。

    filestore的处理线程配置参数:

    1
    OPTION(filestore_op_threads, OPT_INT, 2)
支持原创