客户 OceanBase 集群的理任备份策略是每天全备,备份文件最近保留 4 天。份清经过观察,理任实际保留的份清备份文件多达 13 个,导致磁盘空间占用较高。理任
正常情况下,理任会有一个备份文件清理任务定期执行。份清备份文件超过了预设数量,理任猜测可能是份清文件清理任务出现问题。

先看一下自动清理时间间隔的计算公式:
自动清理的时间间隔 = min(recovery_window/2, 1d)
故正常每天会发起一次备份文件的清理任务。
注意:对于 oceanbase-3.1.2-20211230114204 之前的份清版本(不含该版本),云服务器提供商自动清理的理任时间间隔为 recovery_window/2。
备份文件的份清清理任务卡住了!
信息采集时间为 22 日。
发现 19 日发起备份清理任务,22 日还未完成,判断为清理任务卡住了。
复制MySQL [oceanbase]> select * from __all_tenant_backup_clean_info; +----------------------------+----------------------------+-----------+--------+----------------------------+----------------------------+-------------+------------------------+--------+------------------+-----------+---------+------------------+--------+---------+ | gmt_create | gmt_modified | tenant_id | job_id | start_time | end_time | incarnation | type | status | parameter | error_msg | comment | clog_gc_snapshot | result | copy_id | +----------------------------+----------------------------+-----------+--------+----------------------------+----------------------------+-------------+------------------------+--------+------------------+-----------+---------+------------------+--------+---------+ | 2021-03-19 14:18:50.841475 | 2024-10-19 00:28:50.021733 | 1 | 1187 | 2024-10-19 00:28:49.998736 | 2024-10-19 00:27:09.663132 | 1 | DELETE OBSOLETE BACKUP | DOING | 1728923329993025 | | | 1728845993342237 | 0 | 0 | | 2024-10-19 00:28:50.030195 | 2024-10-19 00:28:50.030195 | 1001 | 1187 | 2024-10-19 00:28:49.998736 | 2024-10-19 00:27:09.663132 | 1 | DELETE OBSOLETE BACKUP | DOING | 1728923329993025 | | | 1728845993342237 | 0 | 0 | +----------------------------+----------------------------+-----------+--------+----------------------------+----------------------------+-------------+------------------------+--------+------------------+-----------+---------+------------------+--------+---------+ 2 rows in set (0.01 sec)1.2.3.4.5.6.7.8.集群备份清理有关的历史事件信息。
复制MySQL [oceanbase]> select * from __all_rootservice_event_history where module=backup_clean; +----------------------------+--------------+------------+-----------+--------+---------------+--------+---------+--------+----------+-------------------------------+-------+--------+-------+--------+------------+----------------+-------------+ | gmt_create | module | event | name1 | value1 | name2 | value2 | name3 | value3 | name4 | value4 | name5 | value5 | name6 | value6 | extra_info | rs_svr_ip | rs_svr_port | +----------------------------+--------------+------------+-----------+--------+---------------+--------+---------+--------+----------+-------------------------------+-------+--------+-------+--------+------------+----------------+-------------+ | 2024-04-16 18:55:49.799602 | backup_clean | backup_set | tenant_id | 1 | backup_set_id | 1099 | copy_id | 0 | trace_id | YB42B6B49660-0005E8F1C2B2E2CD | | | | | | 182.180.150.96 | 2882 | ----------- 输出信息过多,省略部分信息 ----------- | 2024-10-19 00:28:50.010504 | backup_clean | backup_set | tenant_id | 1 | backup_set_id | 1288 | copy_id | 0 | trace_id | YB42B6B49660-0005E8F1C2B2E845 | | | | | | 182.180.150.96 | 2882 | | 2024-10-19 00:28:50.010521 | backup_clean | backup_set | tenant_id | 1 | backup_set_id | 1289 | copy_id | 0 | trace_id | YB42B6B49660-0005E8F1C2B2E845 | | | | | | 182.180.150.96 | 2882 | +----------------------------+--------------+------------+-----------+--------+---------------+--------+---------+--------+----------+-------------------------------+-------+--------+-------+--------+------------+----------------+-------------+ 191 rows in set (29.60 sec)1.2.3.4.5.6.7.8.9.10.11.12.判断关键字:sys tenant clean info status is not stop, can do scheduler
查看日志可推断 sys 租户备份清理任务状态不是 stop 的状态,故备份清理调度任务失败。
与上文中 __all_tenant_backup_clean_info 视图表查询结果形成呼应,判定为备份清理任务异常。
清理任务失败原因:上一次的备份清理任务状态一直为 doing。
2.2 无法识别到路径通过备份清理任务的 trace_id
复制[root@hwc-ocp RS_cl_bigdata]# grep "YB42B6B49660-0005E8F1C2B2E845" rootservice.log.20241022001100 |grep WARN [2024-10-21 23:37:26.194250] WARN [ARCHIVE] get_file_id_range_ (ob_archive_log_file_store.cpp:501) [101266][774][YB42B6B49660-0005E8F1C2B2E845] [lt=9] [dc=0] failed get_file_range(uri=file:///obbak_bigdata/cl_bigdata_new/cl_bigdata/2/incarnation_1/1003/clog/13/index/1102810163026406/13, ret=-4018, pg_key={tid:1102810163026406, partition_id:13, part_cnt:0}) [2024-10-21 23:37:26.194257] WARN [ARCHIVE] get_index_file_id_range (ob_archive_log_file_store.cpp:281) [101266][774][YB42B6B49660-0005E8F1C2B2E845] [lt=5] [dc=0] failed to get_index_file_id_range(ret=-4018, pg_key={tid:1102810163026406, partition_id:13, part_cnt:0}) [2024-10-21 23:37:26.194264] WARN [ARCHIVE] locate_file_by_log_ts_for_clear (ob_archive_log_file_store.cpp:252) [101266][774][YB42B6B49660-0005E8F1C2B2E845] [lt=5] [dc=0] get_index_file_id_range fail(ret=-4018, pg_key={tid:1102810163026406, partition_id:13, part_cnt:0}) [2024-10-21 23:37:26.217948] WARN [ARCHIVE] get_file_id_range_ (ob_archive_log_file_store.cpp:501) [101266][774][YB42B6B49660-0005E8F1C2B2E845] [lt=8] [dc=0] failed get_file_range(uri=file:///obbak_bigdata/cl_bigdata_new/cl_bigdata/2/incarnation_1/1003/clog/13/index/1102810163026406/216, ret=-4018, pg_key={tid:1102810163026406, partition_id:216, part_cnt:0})1.2.3.4.5.判断关键字:failed get_file_range
2.3 IO 调用异常继续排查,发现通过该 trace_id
复制[root@hwc-ocp RS_cl_bigdata]# grep "access storage" rootservice.log.20241022040332 | grep "YB42B6B49660-0005E8F1C2B2E845" rootservice.log.20241022040332:[2024-10-22 04:02:23.121397] WARN [STORAGE] print_access_storage_log (ob_storage.cpp:39) [101266][774][YB42B6B49660-0005E8F1C2B2E845] [lt=10] [dc=0] access storage op=list_files uri=file:///obbak_bigdata/cl_bigdata_new/cl_bigdata/2/incarnation_1/1003/clog/13/index/1102810162740407/88 size=0 Byte cost_ts=2129326 us speed=0.00 MB/s rootservice.log.20241022040332:[2024-10-22 04:02:24.660028] WARN [STORAGE] print_access_storage_log (ob_storage.cpp:39) [101266][774][YB42B6B49660-0005E8F1C2B2E845] [lt=9] [dc=0] access storage op=list_files uri=file:///obbak_bigdata/cl_bigdata_new/cl_bigdata/2/incarnation_1/1003/clog/13/index/1102810162744871/4 size=0 Byte cost_ts=1537551 us speed=0.00 MB/s rootservice.log.20241022040332:[2024-10-22 04:02:33.558367] WARN [STORAGE] print_access_storage_log (ob_storage.cpp:39) [101266][774][YB42B6B49660-0005E8F1C2B2E845] [lt=13] [dc=0] access storage op=storage reader uri=file:///obbak_bigdata/cl_bigdata_new/cl_bigdata/2/incarnation_1/1003/clog/13/index/1102810162738281/141/2 size=144 Byte cost_ts=1243114 us speed=0.00 MB/s rootservice.log.20241022040332:[2024-10-22 04:02:38.908843] WARN [STORAGE] print_access_storage_log (ob_storage.cpp:39) [101263][768][YB42B6B49660-0005E8CB48109532] [lt=9] [dc=0] access storage op=get_file_length uri=file:///obbak_bigdata/cl_bigdata_new/cl_bigdata/2/incarnation_1/cluster_clog_backup_info size=0 Byte cost_ts=1176124 us speed=0.00 MB/s1.2.3.4.5.判断关键词:access storage,免费源码下载us speed=0.00 MB/s
生产环境的 NFS 版本为 3.x,原来 3 版本为 OB 侧非标部署[1]。
结论:生产未严格按官方文档的要求配置 NFS,可能存在很多异常问题,包括备份清理问题。
严格按官方文档的要求配置 NFS,修改 NFS 版本为 4.x,备份卡住的问题得到解决。
[1]OceanBase NFS 部署建议: https://www.oceanbase.com/docs/enterprise-oceanbase-database-cn-10000000000360550
作者:何文超,爱可生南区交付服务部 DBA 团队成员。主要负责 MySQL 故障处理,MySQL 高可用架构改造,OceanBase 相关技术支持。爱好足球,羽毛球。
高防服务器(责任编辑:系统运维)