問題:未清空磁盤被添加到磁盤組觸發壞塊

原文鏈接:https://www.modb.pro/topic/5927 (複製至瀏覽器,即可查看)

導讀:當我們生產系統中遇到ASM磁盤組容量快被耗盡時,添加磁盤擴容是處理該問題較為常用的手段之一,幾乎每個專業的DBA都操作過。但是設想一下,如果添加到ASM磁盤組的磁盤沒有提前被清空,會出現什麼樣的情況呢?本文分享一起客戶近期碰到的未清空磁盤被添加到磁盤組觸發壞塊(Read datafile mirror)的案例,在此提醒大家注意。


問題描述

收到系統維護人員通知,Oracle數據庫軟件目錄突然異常爆滿,需要及時清理。登陸環境後檢查發現告警日誌不斷的刷新日誌,刷新的內容為檢測到有壞塊。

部分告警日誌內容如下:

<code>Reading datafile '+xx01/xxx85' for corruption at rdba: 0x1c4b3afc (file x3,  block 474692348)
Read datafile mirror 'xxx02' (file x3, block 47xx48) found same corrupt data (no logical check)
Read datafile mirror ' xxx 53' (file x3, block 47xx48) found valid data
Hex dump of (file x3, block 47xx48) in trace file /xxx130931.trc
Repaired corruption at (file x3, block 47 xxx 48)
Hex dump of (file x3, block 47xxx24) in trace file /xx931.trc
Corrupt block relative dba: 0x1c308c08 (file x3, block 47xx4)
Bad header found during buffer read
Data in bad block:
type: 0 format: 6 rdba: 0x34363835
last change scn: 0x3833.35313431 seq: 0x30 flg: 0x37
spare1: 0x31 spare2: 0x36 spare3: 0xf00
consistency value in tail: 0x30520300
check value in block header: 0x36

computed block checksum: 0x6060
Reading datafile '+xxxx8685' for corruption at rdba: 0x1c308c08 (file x3, block 47xx24)
Read datafile mirror 'xxxx2' (file x3, block 47xx24) found same corrupt data (no logical check)
Read datafile mirror 'xxx3' (file x3, block 47xx24) found valid data
Hex dump of (file x3, block 47xxx24) in trace file /xxx0931.trc
Sat Nov 09 12:48:17 2019
Hex dump of (file x3, block 14xxx7) in trace file /xxx22.trc
Corrupt block relative dba: 0x1ed647db (file x3, block 14xxx7)
Bad header found during buffer read
Data in bad block:
type: 73 format: 6 rdba: 0x5454415f
last change scn: 0x0e00.00440052 seq: 0x0 flg: 0x00
spare1: 0x53 spare2: 0x54 spare3: 0x0
consistency value in tail: 0x01006541
check value in block header: 0xa00
block checksum disabled
Reading datafile '+xxxx17527' for corruption at rdba: 0x1ed647db (file x3, block 14xxx7)
Read datafile mirror 'xx002' (file x3, block 14xxx7) found same corrupt data (no logical check)
Read datafile mirror 'xxx0' (file x3, block 14xxx7) found valid data
Hex dump of (file x3, block 14xx7) in trace file /xxx2.trc
Repaired corruption at (file x3, block 14xxx7)/<code>


問題分析

通過告警日誌中出現的信息,我們查看這些問題數據塊發現,涉及的類型包含表和索引等。

<code>select  relative_fno,owner,segment_name ,segment_type   from dba_extents where file_id = x3 and 35xxxx9 between block_id and  block_id + blocks -1;
RELATIVE_FNO OWNER SEGMENT_NAME SEGMENT_TYPE
---------------------- -------------- ------------------------- ---------------------
1024 IxxxL PxxxT INDEX

RELATIVE_FNO OWNER SEGMENT_NAME SEGMENT_TYPE
---------------------- -------------- ------------------------- ---------------------
124 IxxxM OxxxT TABLE /<code>


使用DBV 進行檢查校驗:

<code>
……
Page 278199 is marked corrupt
Corrupt block relative dba: 0x21843eb7 (file x4, block 2xx9)
Bad header found during dbv:
Data in bad block:
type: 0 format: 4 rdba: 0x0000ffff
last change scn: 0x0000.00000000 seq: 0x0 flg: 0x1d
spare1: 0x0 spare2: 0xa spare3: 0x0
consistency value in tail: 0x31040000
check value in block header: 0x1500
computed block checksum: 0xe403

Page 278200 is marked corrupt
Corrupt block relative dba: 0x21843eb8 (file x4, block 2xx0)
Bad header found during dbv:
Data in bad block:
type: 48 format: 0 rdba: 0x000a0018
last change scn: 0x3031.31060000 seq: 0x30 flg: 0x30
spare1: 0x30 spare2: 0x0 spare3: 0x19
consistency value in tail: 0x000b0000
check value in block header: 0x31
block checksum disabled
…………此處省略n行/<code>

相關Trace 中記錄:

<code>
Corrupt block relative dba: 0x2180ba80 (file x4, block 4xx4)
Bad header found during user buffer read
Data in bad block:
type: 82 format: 0 rdba: 0x534e4901
last change scn: 0x4546.464f2e54 seq: 0x52 flg: 0x5f
spare1: 0x0 spare2: 0x0 spare3: 0x5453
consistency value in tail: 0x0908bdf2
check value in block header: 0x4e49
computed block checksum: 0x66c6
Reading datafile '+xxx05' for corruption at rdba: 0x2180ba80 (file x4, block 4xx4)
ksfdrfms:Mirror Read file=+xxx905 fob=0x246076cb80 bufp=0x7f9a07619c00 blkno=47744 nbytes=8192
ksfdrfms: Read success from mirror side=1 logical extent number=0 disk=xxx2 path=/dev/axxx1
Mirror I/O done from ASM disk /dev/axxx1
Read datafile mirror 'xxx02' (file x4, block 4xx4) found same corrupt data (no logical check)
ksfdrnms:Mirror Read file=+xxx7905 fob=0x246076cb80 bufp=0x7f9a07619c00 nbytes=8192
ksfdrnms: Read success from mirror side=2 logical extent number=1 disk=xxx3 path=/dev/axxx4
Mirror I/O done from ASM disk /dev/axxx4
Read datafile mirror 'xxx3' (file x4, block 4xx4) found valid data
Hex dump of (file x4, block 4xx4)/<code>

仔細觀察發現,每次的壞塊報錯都十分相似,如下所示:

Read datafile mirror 'xxx2'(file x3, block 47xxx48) found same corrupt data (no logical check)

我們進一步細看日誌,發現有一共同特點是基本都是磁盤名為 xxx2與其他磁盤名中都發現了相同的數據塊, 並且這些數據塊中有效的數據塊都在其他磁盤中,反而無效的數據壞塊卻全都在磁盤/dev/axxx1 (也就是磁盤名:xxx2) , 因此猜測可能和這塊磁盤的相關操作有關,進一步瞭解與發現,這塊磁盤之前原本就是磁盤組xxx1 中的一塊盤,但由於某些原因導致這塊磁盤不在該磁盤組,然後他們在異常時間的前一天又重新添加該磁盤,最後真相浮出水面,由於 /dev/axxx1 的舊數據尚未被清空,導致添加磁盤後,舊塊與新塊衝突,數據庫異常報錯,撐爆軟件目錄。

而xxx1 磁盤組的冗餘度是 NORMAL ,簡單舉例說明下 ,oracle根據鏡像個數不同,磁盤組的冗餘度被劃分為以下3種:

1)外部冗餘(External redundancy):數據沒有鏡像。這種情況適用於已經使用底層存儲軟件對數據做過鏡像的系統。

2)普通冗餘(Normal redundancy): 1路鏡像。這種冗餘度適用於大部分系統。

3)高冗餘(High redundancy) : 2路鏡像。這種冗餘度適合保存系統的重要數據,當然這也意味著會佔用更多的空間。

Oracle鏡像數據是通過failuregroup (失敗組)的方式來實現的。也就是說由於xxx1 磁盤組是normal 冗餘,在保留一份鏡像的同時Oracle會保證每一個Extent和它對應的鏡像不會保存在相同的failure group中,從而確保了當failure group中的某一個或多個磁盤,甚至整個failure group全部丟失時也不會有數據丟失;當磁盤/dev/axxx1重新加入到磁盤組中時,ASM再平衡功能會讓磁盤組中所有磁盤上的文件extent 均衡的分佈,該過程是由後臺進程RBAL進行處理。當分佈的鏡像與磁盤/dev/axxx1 中的舊數據存在衝突時,將報錯。

問題解決

直接剔除問題磁盤,dd磁盤,清除舊數據,再重新添加回來,問題解決,故障恢復。

<code>
alter diskgroup xxx1 drop disk 'Oxxxx2';

dd if=/dev/zero of=/dev/asxxx1 bs=1M count=256/<code>


分享到:


相關文章: