1. Ceph rbd 与 rgw的寻址(rbd块/对象存储文件的下载)
1.1. 索引的存储
ceph的索引都存储在omap中
- rbd – 每个rbd池有一个
rbd_directory
文件 - rgw – 每一个bucket有一个或者多个index文件
1.2. rbd 的寻址
-
接收rbd客户端请求
请求中带着
- rbd 认证信息
- rbd 池名称
- rbd 名称
- 如何通过请求信息找到osd上对应的4m对象文件
-
每个rbd池都有一个
rbd_directory
文件 该文件中/文件的元数据中(format 1/format 2)存储这个池所有的 image id和image name的双向映射例如
rados -p test001 listomapvals rbd_directory id_603d6b8b4567 value (8 bytes) : 00000000 04 00 00 00 74 65 73 74 |....test| 00000008 id_60486b8b4567 value (8 bytes) : 00000000 04 00 00 00 61 61 61 61 |....aaaa| 00000008 name_aaaa value (16 bytes) : 00000000 0c 00 00 00 36 30 34 38 36 62 38 62 34 35 36 37 |....60486b8b4567| 00000010 name_test value (16 bytes) : 00000000 0c 00 00 00 36 30 33 64 36 62 38 62 34 35 36 37 |....603d6b8b4567| 00000010
可以通过rbd name /rbd id快速找到对应的 rbd id/rbd name
-
rbd 的对象名都为 rbd_head.{id}.{序列号} 所以知道id之后就可以找到这个卷对应的所有对象文件
-
关于rbd其他信息,例如快照等,通过获取rbd_header.{id}的omap信息可以得到
- create_timestamp – 创建的时间戳
- features – rbd features
- object_prefix – rbd prefix(rbd_data.{id})
- order – 2**order = rbd block
- parent – 父卷id 和 snapid,(父卷 -> snap -> clone(子卷))
- size – rbd size
- snap_seq – 快照信息
create_timestamp value (8 bytes) : 00000000 bb 39 68 60 72 aa 49 1b |.9h`r.I.| 00000008 features value (8 bytes) : 00000000 3d 00 00 00 00 00 00 00 |=.......| 00000008 object_prefix value (25 bytes) : 00000000 15 00 00 00 72 62 64 5f 64 61 74 61 2e 36 30 34 |....rbd_data.604| 00000010 38 36 62 38 62 34 35 36 37 |86b8b4567| 00000019 order value (1 bytes) : 00000000 16 |.| 00000001 parent value (48 bytes) : 00000000 01 01 2a 00 00 00 08 00 00 00 00 00 00 00 0e 00 |..*.............| 00000010 00 00 66 62 64 66 61 63 35 63 65 66 61 31 66 38 |..fbdfac5cefa1f8| 00000020 de 00 00 00 00 00 00 00 00 00 00 00 19 00 00 00 |................| 00000030 size value (8 bytes) : 00000000 00 00 00 80 02 00 00 00 |........| 00000008 snap_seq snapshot_0000000000000173 value (113 bytes) : 00000000 06 01 6b 00 00 00 73 01 00 00 00 00 00 00 04 00 |..k...s.........| 00000010 00 00 74 65 73 74 00 00 00 00 19 00 00 00 3d 00 |..test........=.| 00000020 00 00 00 00 00 00 01 01 2a 00 00 00 08 00 00 00 |........*.......| 00000030 00 00 00 00 0e 00 00 00 66 62 64 66 61 63 35 63 |........fbdfac5c| 00000040 65 66 61 31 66 38 de 00 00 00 00 00 00 00 00 00 |efa1f8..........| 00000050 00 00 19 00 00 00 00 00 00 00 00 00 00 00 00 01 |................| 00000060 01 04 00 00 00 00 00 00 00 c4 7e 69 60 bc 83 2c |..........~i`..,| 00000070 0b |.| 00000071
-
- 找到之后返回给客户端组装,就能得到一个完整的rbd
1.3. 小笔记
listompvals命令
顺便记一下 rados listompvals
等omap操作命令的流程
- mon使用 crush 算法算出对象所在的osd
- 然后去osd上查omap,查完给客户端返回结果
配置rbd客户端asok和日志
前提 安装ceph-common才会生成asok文件
# /etc/ceph/ceph.conf
[client]
debug_rbd=1
debug_rados=1
log_file=/tmp/ceph-client-$cluster-$type-${id}-${pid}.$cctid.log
admin_socket = /tmp/ceph-client-$cluster-$type.$id.$pid.$cctid.asok
常用的几个debug项 debug_objector
debug_ms
debug_rbd
debug_auth
客户端常见的问题是卡,可以通过以下命令查看正在进行那个io, 挂载多个磁盘的时候先要找到asok和磁盘的对应关系
ceph daemon ./ceph-client-ceph-client.cinder.135232.93839795954376.asok objecter_requests
# output
{
"ops": [],
"linger_ops": [
{
"linger_id": 1,
"pg": "1.7b816156",
"osd": 87,
"object_id": "rbd_header.11f321e6d16ef1",
"object_locator": "@1",
"target_object_id": "rbd_header.11f321e6d16ef1",
"target_object_locator": "@1",
"paused": 0,
"used_replica": 0,
"precalc_pgid": 0,
"snapid": "head",
"registered": "1"
}
],
"pool_ops": [],
"pool_stat_ops": [],
"statfs_ops": [],
"command_ops": []
}
这个linger_ops是常驻的心跳,通过rbd_header即可知道imageid
rados getomapval rbd_directory id_11f321e6d16ef1 -p volumes
# output
00000000 2b 00 00 00 76 6f 6c 75 6d 65 2d 38 61 62 62 35 |+...volume-8abb5|
00000010 33 31 38 2d 31 63 32 35 2d 34 36 33 35 2d 62 32 |318-1c25-4635-b2|
00000020 34 36 2d 30 63 39 66 36 34 32 36 37 65 63 33 |46-0c9f64267ec3|
0000002f
1.4. rgw的寻址
-
接收客户端请求
请求中带有
- 认证信息
- bucket名
- 对象名
-
如何通过请求信息获取到该rgw对象对应的所有4m块的小文件对象
- 通过bucket可以获取到index对象id
- 通过crush算出index所在osd
- 通过查询omap信息获取所有的对象信息
- 返回rgw
- rgw组装返回
1.5. 数据恢复思路
1.5.1. 场景
机房掉电后,数据损坏,omap损坏,无法启动服务 集群无法读写,也读不到rbd信息
1.5.2. 思路
新建集群,将元数据导入,再将数据导入
- 新建一个和当前集群规模(osd个数)一样的集群, 保证crush之后的结果相同
- 创建池 池的信息可以通过
meta
目录 获取osdmap crushmap也可以通过osdmap获取
# crush获取
osdmaptool osdmap.43__0_641716DC__none --export-crush /tmp/crushmap.bin
crushtool -d /tmp/crushmap.bin -o /tmp/crushmap.txt
# omap
osdmaptool --print osdmap.43__0_641716DC__none
#----
epoch 43
fsid 8685ec71-96a6-413a-9e4d-ff47071dc4f5
created 2020-12-22 16:57:37.845173
modified 2021-04-04 15:51:46.729929
flags sortbitwise,recovery_deletes,purged_snapdirs
crush_version 6
full_ratio 0.95
backfillfull_ratio 0.9
nearfull_ratio 0.85
require_min_compat_client jewel
min_compat_client jewel
require_osd_release luminous
pool 1 '.rgw.root' replicated size 1 min_size 1 crush_rule 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 7 owner 18446744073709551615 flags hashpspool stripe_width 0 application rgw
pool 2 'test001' replicated size 1 min_size 1 crush_rule 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 40 flags hashpspool stripe_width 0
removed_snaps [1~3]
pool 3 'default.rgw.control' replicated size 1 min_size 1 crush_rule 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 21 owner 18446744073709551615 flags hashpspool stripe_width 0 application rgw
pool 4 'default.rgw.meta' replicated size 1 min_size 1 crush_rule 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 23 owner 18446744073709551615 flags hashpspool stripe_width 0 application rgw
pool 5 'default.rgw.log' replicated size 1 min_size 1 crush_rule 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 25 owner 18446744073709551615 flags hashpspool stripe_width 0 application rgw
pool 6 'default.rgw.buckets.index' replicated size 1 min_size 1 crush_rule 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 28 owner 18446744073709551615 flags hashpspool stripe_width 0 application rgw
pool 7 'default.rgw.buckets.data' replicated size 1 min_size 1 crush_rule 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 31 owner 18446744073709551615 flags hashpspool stripe_width 0 application rgw
pool 8 'default.rgw.buckets.non-ec' replicated size 1 min_size 1 crush_rule 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 34 owner 18446744073709551615 flags hashpspool stripe_width 0 application rgw
max_osd 1
osd.0 up in weight 1 up_from 42 up_thru 42 down_at 41 last_clean_interval [37,40) 172.16.10.10:6801/12622 172.16.10.10:6802/12622 172.16.10.10:6803/12622 172.16.10.10:6804/12622 exists,up 59ebf8d5-e7f7-4c46-8e05-bac5140eee89
- 获取池的rbd_directory元数据
- 获取rbd_header元数据
- 通过map获取rbd_directory rbd_header 所在osd,将元数据put到omap
- 重建元数据之后 可以omap替换到旧的集群,也可以将旧集群的数据拷贝到新集群,启动osd
看似可行,但pg与osdmap的关系,omap版本问题看着不太好解决,有场景或者有需求的时候再来看