Ceph rbd 与 rgw的寻址(rbd块/对象存储文件的下载)

Posted by elrond on June 26, 2021

1. Ceph rbd 与 rgw的寻址(rbd块/对象存储文件的下载)

1.1. 索引的存储

ceph的索引都存储在omap中

  • rbd – 每个rbd池有一个 rbd_directory 文件
  • rgw – 每一个bucket有一个或者多个index文件

1.2. rbd 的寻址

  • 接收rbd客户端请求

    请求中带着

    • rbd 认证信息
    • rbd 池名称
    • rbd 名称
  • 如何通过请求信息找到osd上对应的4m对象文件
    • 每个rbd池都有一个 rbd_directory 文件 该文件中/文件的元数据中(format 1/format 2)存储这个池所有的 image id和image name的双向映射

      例如

      rados -p test001  listomapvals rbd_directory
      id_603d6b8b4567
      value (8 bytes) :
      00000000  04 00 00 00 74 65 73 74                           |....test|
      00000008
      
      id_60486b8b4567
      value (8 bytes) :
      00000000  04 00 00 00 61 61 61 61                           |....aaaa|
      00000008
      
      name_aaaa
      value (16 bytes) :
      00000000  0c 00 00 00 36 30 34 38  36 62 38 62 34 35 36 37  |....60486b8b4567|
      00000010
      
      name_test
      value (16 bytes) :
      00000000  0c 00 00 00 36 30 33 64  36 62 38 62 34 35 36 37  |....603d6b8b4567|
      00000010
      

      可以通过rbd name /rbd id快速找到对应的 rbd id/rbd name

    • rbd 的对象名都为 rbd_head.{id}.{序列号} 所以知道id之后就可以找到这个卷对应的所有对象文件

    • 关于rbd其他信息,例如快照等,通过获取rbd_header.{id}的omap信息可以得到

      • create_timestamp – 创建的时间戳
      • features – rbd features
      • object_prefix – rbd prefix(rbd_data.{id})
      • order – 2**order = rbd block
      • parent – 父卷id 和 snapid,(父卷 -> snap -> clone(子卷))
      • size – rbd size
      • snap_seq – 快照信息
      create_timestamp
      value (8 bytes) :
      00000000  bb 39 68 60 72 aa 49 1b                           |.9h`r.I.|
      00000008
      
      features
      value (8 bytes) :
      00000000  3d 00 00 00 00 00 00 00                           |=.......|
      00000008
      
      object_prefix
      value (25 bytes) :
      00000000  15 00 00 00 72 62 64 5f  64 61 74 61 2e 36 30 34  |....rbd_data.604|
      00000010  38 36 62 38 62 34 35 36  37                       |86b8b4567|
      00000019
      
      order
      value (1 bytes) :
      00000000  16                                                |.|
      00000001
      
      parent
      value (48 bytes) :
      00000000  01 01 2a 00 00 00 08 00  00 00 00 00 00 00 0e 00  |..*.............|
      00000010  00 00 66 62 64 66 61 63  35 63 65 66 61 31 66 38  |..fbdfac5cefa1f8|
      00000020  de 00 00 00 00 00 00 00  00 00 00 00 19 00 00 00  |................|
      00000030
      
      size
      value (8 bytes) :
      00000000  00 00 00 80 02 00 00 00                           |........|
      00000008
      
      snap_seq
      snapshot_0000000000000173
      value (113 bytes) :
      00000000  06 01 6b 00 00 00 73 01  00 00 00 00 00 00 04 00  |..k...s.........|
      00000010  00 00 74 65 73 74 00 00  00 00 19 00 00 00 3d 00  |..test........=.|
      00000020  00 00 00 00 00 00 01 01  2a 00 00 00 08 00 00 00  |........*.......|
      00000030  00 00 00 00 0e 00 00 00  66 62 64 66 61 63 35 63  |........fbdfac5c|
      00000040  65 66 61 31 66 38 de 00  00 00 00 00 00 00 00 00  |efa1f8..........|
      00000050  00 00 19 00 00 00 00 00  00 00 00 00 00 00 00 01  |................|
      00000060  01 04 00 00 00 00 00 00  00 c4 7e 69 60 bc 83 2c  |..........~i`..,|
      00000070  0b                                                |.|
      00000071
      
  • 找到之后返回给客户端组装,就能得到一个完整的rbd

1.3. 小笔记

listompvals命令

顺便记一下 rados listompvals 等omap操作命令的流程

  • mon使用 crush 算法算出对象所在的osd
  • 然后去osd上查omap,查完给客户端返回结果

配置rbd客户端asok和日志

前提 安装ceph-common才会生成asok文件

# /etc/ceph/ceph.conf
[client]
debug_rbd=1
debug_rados=1
log_file=/tmp/ceph-client-$cluster-$type-${id}-${pid}.$cctid.log
admin_socket = /tmp/ceph-client-$cluster-$type.$id.$pid.$cctid.asok

常用的几个debug项 debug_objector debug_ms debug_rbd debug_auth

客户端常见的问题是卡,可以通过以下命令查看正在进行那个io, 挂载多个磁盘的时候先要找到asok和磁盘的对应关系

ceph daemon ./ceph-client-ceph-client.cinder.135232.93839795954376.asok objecter_requests
# output
{
    "ops": [],
    "linger_ops": [
        {
            "linger_id": 1,
            "pg": "1.7b816156",
            "osd": 87,
            "object_id": "rbd_header.11f321e6d16ef1",
            "object_locator": "@1",
            "target_object_id": "rbd_header.11f321e6d16ef1",
            "target_object_locator": "@1",
            "paused": 0,
            "used_replica": 0,
            "precalc_pgid": 0,
            "snapid": "head",
            "registered": "1"
        }
    ],
    "pool_ops": [],
    "pool_stat_ops": [],
    "statfs_ops": [],
    "command_ops": []
}

这个linger_ops是常驻的心跳,通过rbd_header即可知道imageid

rados getomapval rbd_directory id_11f321e6d16ef1 -p volumes
# output
00000000  2b 00 00 00 76 6f 6c 75  6d 65 2d 38 61 62 62 35  |+...volume-8abb5|
00000010  33 31 38 2d 31 63 32 35  2d 34 36 33 35 2d 62 32  |318-1c25-4635-b2|
00000020  34 36 2d 30 63 39 66 36  34 32 36 37 65 63 33     |46-0c9f64267ec3|
0000002f

1.4. rgw的寻址

  • 接收客户端请求

    请求中带有

    • 认证信息
    • bucket名
    • 对象名
  • 如何通过请求信息获取到该rgw对象对应的所有4m块的小文件对象

    • 通过bucket可以获取到index对象id
    • 通过crush算出index所在osd
    • 通过查询omap信息获取所有的对象信息
    • 返回rgw
    • rgw组装返回

1.5. 数据恢复思路

1.5.1. 场景

机房掉电后,数据损坏,omap损坏,无法启动服务 集群无法读写,也读不到rbd信息

1.5.2. 思路

新建集群,将元数据导入,再将数据导入

  • 新建一个和当前集群规模(osd个数)一样的集群, 保证crush之后的结果相同
  • 创建池 池的信息可以通过 meta 目录 获取osdmap crushmap也可以通过osdmap获取
# crush获取
osdmaptool osdmap.43__0_641716DC__none --export-crush /tmp/crushmap.bin
crushtool -d /tmp/crushmap.bin -o /tmp/crushmap.txt
# omap
osdmaptool --print osdmap.43__0_641716DC__none
#----
epoch 43
fsid 8685ec71-96a6-413a-9e4d-ff47071dc4f5
created 2020-12-22 16:57:37.845173
modified 2021-04-04 15:51:46.729929
flags sortbitwise,recovery_deletes,purged_snapdirs
crush_version 6
full_ratio 0.95
backfillfull_ratio 0.9
nearfull_ratio 0.85
require_min_compat_client jewel
min_compat_client jewel
require_osd_release luminous

pool 1 '.rgw.root' replicated size 1 min_size 1 crush_rule 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 7 owner 18446744073709551615 flags hashpspool stripe_width 0 application rgw
pool 2 'test001' replicated size 1 min_size 1 crush_rule 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 40 flags hashpspool stripe_width 0
	removed_snaps [1~3]
pool 3 'default.rgw.control' replicated size 1 min_size 1 crush_rule 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 21 owner 18446744073709551615 flags hashpspool stripe_width 0 application rgw
pool 4 'default.rgw.meta' replicated size 1 min_size 1 crush_rule 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 23 owner 18446744073709551615 flags hashpspool stripe_width 0 application rgw
pool 5 'default.rgw.log' replicated size 1 min_size 1 crush_rule 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 25 owner 18446744073709551615 flags hashpspool stripe_width 0 application rgw
pool 6 'default.rgw.buckets.index' replicated size 1 min_size 1 crush_rule 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 28 owner 18446744073709551615 flags hashpspool stripe_width 0 application rgw
pool 7 'default.rgw.buckets.data' replicated size 1 min_size 1 crush_rule 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 31 owner 18446744073709551615 flags hashpspool stripe_width 0 application rgw
pool 8 'default.rgw.buckets.non-ec' replicated size 1 min_size 1 crush_rule 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 34 owner 18446744073709551615 flags hashpspool stripe_width 0 application rgw

max_osd 1
osd.0 up   in  weight 1 up_from 42 up_thru 42 down_at 41 last_clean_interval [37,40) 172.16.10.10:6801/12622 172.16.10.10:6802/12622 172.16.10.10:6803/12622 172.16.10.10:6804/12622 exists,up 59ebf8d5-e7f7-4c46-8e05-bac5140eee89
  • 获取池的rbd_directory元数据
  • 获取rbd_header元数据
  • 通过map获取rbd_directory rbd_header 所在osd,将元数据put到omap
  • 重建元数据之后 可以omap替换到旧的集群,也可以将旧集群的数据拷贝到新集群,启动osd

看似可行,但pg与osdmap的关系,omap版本问题看着不太好解决,有场景或者有需求的时候再来看