ASM Fast Mirror Resync Explained
Experimentally
ASM
fast mirror resync is a new 11g ASM feature to restore the redundancy quickly
after a transient disk path failure. With 10g ASM, if redundancy is used and a
disk path fails, the disk is taken offline and shortly after, dropped. In order
to preserve the redundancy all the existing redundant mirror extent copies are
resynchronized to newly created extents in the remaining disks of the disk
group. It is a very costly operation. With 11g ASM fast mirror resync, when a
transient disk path failure occurs disks are also taken offline but they are
not dropped during period of time that is determined by the value of a new disk
group attribute DISK_REPAIR_TIME. ASM keeps tracks of all modified extents
during the failure. After the failure is fixed; only modified extents are
resynchronized. Therefore restoring the redundancy is performed very fast. It
is important to note that the failure is a transient failure and it is a
failure while ASM tries to write to/read from an extent in a disk. So there is
no mention of damaged disk or corruption of data. Your data stays stale in the
disk. But it can not be accessed due to a failure of cable, controller etc.
DISK_REPAIR_TIME
attribute of the disk group determines the outage duration. In other words,
this is the acceptable time of duration during you need to fix the failure.
This setting is also the countdown timer of ASM to drop the disks that has been
taken offline. So keep an eye on it during the repair of the failure. If
remaining time is not enough, it must be extended as long as you believe that
you can fix the failure and tolerate more lacking redundancy.
Note
that, it is always possible to take the disks offline in advance using ALTER
DISKGROUP OFFLINE SQL STATEMENT for preventative planned maintenance purposes.
After
giving this brief introduction, I would like to make this note more useful by
doing something practical instead of explaining the concepts which are already
done properly in the Oracle Documentation. That is why I have prepared
following scenario.
Scenario
I will
create a transient disk path failure, fix the failure and finally put the
failed disks back online. The objective is to setup “ASM Fast Mirror Resync”
and monitor it before and after a transient disk path failure. In this
scenario, I have used 11gR2 Oracle Database and 11gR2 ASM instance on an OEL5.4
box. If you would like to replay this scenario yourself then create or find a
sand box, in other words, a box which does not have any significant importance
to others. Otherwise some tasks performed here may cause terrible consequences
that you would never want it to happen to you.
Following
tasks will be performed in the order given.
1.
Perform prerequisite actions to use ASM fast mirror resync
2. Simulate a transient failure
3. Fix the transient failure
4. Monitor the Fast Mirror Reync in action
2. Simulate a transient failure
3. Fix the transient failure
4. Monitor the Fast Mirror Reync in action
All
the scripts used in the scenario are listed at the end of this post.
1. Perform prerequisite
actions to use ASM fast mirror resync
As
mentioned above we need to use ASM redundancy (not external redundancy). Create
a disk group having normal redundancy and a tablespace having datafile using
this disk group
create diskgroup dg1 normal redundancy
failgroup dg1_fg01 disk
'/dev/raw/raw1' name raw_disk1,
'/dev/raw/raw2' name raw_disk2,
'/dev/raw/raw5' name raw_disk5
failgroup dg1_fg02 disk
'/dev/raw/raw3' name raw_disk3,
'/dev/raw/raw4' name raw_disk4,
'/dev/raw/raw6' name raw_disk6;
create tablespace asmfmr_ts datafile '+DG1/asmfmr_ts_01.dbf' size 128M;
ASM
disk group attributes COMPATIBLE.RDBMS and COMPATIBLE.ASM must be set to the
minimum software release of 11.1. Therefore a 10g database can not use this
feature. Setup Fast Mirror Sync as below. Setting up DISK_REPAIR_TIME is
optional. It has a default value of 3.6h
alter diskgroup dg1 set attribute 'compatible.asm' = '11.2.0.0.0';
alter diskgroup dg1 set attribute 'compatible.rdbms' = '11.2.0.0.0';
alter diskgroup dg1 set attribute 'disk_repair_time'='5h';
List
the new attributes we have just updated using the script dg_attribs.sql
SQL> @dg_attribs.sql
DG NAME ATTRIB NAME VALUE READ_ON
-------------------- ------------------------------ -------------------- -------
DG1 disk_repair_time 5h N
DG1 au_size 1048576 Y
DG1 sector_size 512 Y
DG1 access_control.umask 066 N
DG1 compatible.rdbms 11.2.0.0.0 N
DG1 cell.smart_scan_capable FALSE N
DG1 access_control.enabled FALSE N
DG1 compatible.asm 11.2.0.0.0 N
2. Simulate a transient
failure
Below,
as it can be seen from path of the ASM disks, I have used raw devices as ASM
storage. Each raw device corresponds to a disk in a disk group. Since we have
the ability to change file access privileges of raw devices, I can create a
transient disk path failure by changing read-write permission of the raw
devices.
Below
is the status of the ASM disks just before the failure.
SQL> @chkasmdiskstat.sql
NAME PATH MOUNT_S HEADER_STATU MODE_ST STATE FAILGROUP REPAIR_TIMER
--------------- --------------- ------- ------------ ------- -------- ---------- ------------
RAW_DISK6 /dev/raw/raw6 CACHED MEMBER ONLINE NORMAL DG1_FG02 0
RAW_DISK5 /dev/raw/raw5 CACHED MEMBER ONLINE NORMAL DG1_FG01 0
RAW_DISK4 /dev/raw/raw4 CACHED MEMBER ONLINE NORMAL DG1_FG02 0
RAW_DISK3 /dev/raw/raw3 CACHED MEMBER ONLINE NORMAL DG1_FG02 0
RAW_DISK2 /dev/raw/raw2 CACHED MEMBER ONLINE NORMAL DG1_FG01 0
RAW_DISK1 /dev/raw/raw1 CACHED MEMBER ONLINE NORMAL DG1_FG01 0
Below
I am simulating a transient disk path failure by modifying the read-write
permissions of disks (raw devices) to 000.
From this point on, ASM can not access to these disks to read from or to write.
From this point on, ASM can not access to these disks to read from or to write.
# chmod 000 /dev/raw/raw6 /dev/raw/raw4 /dev/raw/raw3
At
this stage it is better to execute a workload to make some modifications in the
existing redundant extent copies in the disk failure group DG1_FG01. It will
also make ASM to realize that some disks are missing while it tries to mirror
the redundant copies to the extents of failed disks.
SQL> @workload.sql
As it
can be seen below, ASM can not access to any disk in failure group DG1_FG02 and
it has taken these disks offline. Also disk repair timer has started. Besides
this information, you can see further details regarding consequences of this
failure in alert.log of ASM if you want.
SQL> @chkasmdiskstat.sql
NAME PATH MOUNT_S HEADER_STATU MODE_ST STATE FAILGROUP REPAIR_TIMER
--------------- --------------- ------- ------------ ------- -------- ---------- ------------
RAW_DISK3 MISSING UNKNOWN OFFLINE NORMAL DG1_FG02 18000
RAW_DISK4 MISSING UNKNOWN OFFLINE NORMAL DG1_FG02 18000
RAW_DISK6 MISSING UNKNOWN OFFLINE NORMAL DG1_FG02 18000
RAW_DISK5 /dev/raw/raw5 CACHED MEMBER ONLINE NORMAL DG1_FG01 0
RAW_DISK2 /dev/raw/raw2 CACHED MEMBER ONLINE NORMAL DG1_FG01 0
RAW_DISK1 /dev/raw/raw1 CACHED MEMBER ONLINE NORMAL DG1_FG01 0
3. Fix the transient failure
The
fix is simply to revert the raw device read-write permissions to 660. This
makes raw devices accessible by ASM again.
# chmod 660 /dev/raw/raw6 /dev/raw/raw4 /dev/raw/raw3
Now it
is time to take the disks online.
SQL> alter diskgroup dg1 online disks in failgroup dg1_fg02;
4. Monitor the Fast Mirror
Resync in action
Below
it can be seen that stale extents of the disks taken online are being
resynchronized. Look at the devices below having SYNCING value in the the
column MODE_STATUS
SQL> @chkasmdiskstat.sql
NAME PATH MOUNT_S HEADER_STATU MODE_ST STATE FAILGROUP REPAIR_TIMER
--------------- --------------- ------- ------------ ------- ------- ----------- ------------
RAW_DISK6 /dev/raw/raw6 CACHED MEMBER SYNCING NORMAL DG1_FG02 17297
RAW_DISK5 /dev/raw/raw5 CACHED MEMBER ONLINE NORMAL DG1_FG01 0
RAW_DISK4 /dev/raw/raw4 CACHED MEMBER SYNCING NORMAL DG1_FG02 17297
RAW_DISK3 /dev/raw/raw3 CACHED MEMBER SYNCING NORMAL DG1_FG02 17297
RAW_DISK2 /dev/raw/raw2 CACHED MEMBER ONLINE NORMAL DG1_FG01 0
RAW_DISK1 /dev/raw/raw1 CACHED MEMBER ONLINE NORMAL DG1_FG01 0
Also
syncing operation is represented with a status ONLINE in the column OPERATION
of v$ASM_OPERATION view. See below.
SQL> @chkasmoper.sql
GROUP_NUMBER OPERA STAT
------------ ----- ----
1 ONLIN RUN
After
syncing operation completed, redundancy has been restored and disks are online
again. See below.
SQL> @chkasmdiskstat.sql
NAME PATH MOUNT_S HEADER_STATU MODE_ST STATE FAILGROUP REPAIR_TIMER
--------------- --------------- ------- ------------ ------- ------- ----------- ------------
RAW_DISK6 /dev/raw/raw6 CACHED MEMBER ONLINE NORMAL DG1_FG02 0
RAW_DISK5 /dev/raw/raw5 CACHED MEMBER ONLINE NORMAL DG1_FG01 0
RAW_DISK4 /dev/raw/raw4 CACHED MEMBER ONLINE NORMAL DG1_FG02 0
RAW_DISK3 /dev/raw/raw3 CACHED MEMBER ONLINE NORMAL DG1_FG02 0
RAW_DISK2 /dev/raw/raw2 CACHED MEMBER ONLINE NORMAL DG1_FG01 0
RAW_DISK1 /dev/raw/raw1 CACHED MEMBER ONLINE NORMAL DG1_FG01 0
List of scripts used in this
post
-- chkasmdiskstat.sql
-- Show ASM disks status
set pagesize 100
set linesize 200
col path for a15
col name for a15
col failgroup for a15
select disk_number,name,path,mount_status,header_status,mode_status,state,failgroup,repair_timer
from v$asm_disk;
-- dg_attribs.sql
-- show disk group attributes
col "dg name" for a20
col "attrib name" for a30
col value for a20
set pagesize 20
select dg.name "dg name"
,a.name "attrib name"
,a.value
,read_only
from v$asm_diskgroup dg,
v$asm_attribute a
where dg.name = 'DG1'
and dg.group_number = a.group_number;
-- chkasmoper.sql
-- show ongoing ASM operations
select * from v$asm_operation;
-- workload.sql
-- create a schema having a table using table space asmfmr_ts which has data files in the disk group DG1
-- This script is going to modify the contents of the extents on the disk group DG1
connect / as sysdba
create user asmfmr identified by asmfmr default tablespace asmfmr_ts;
grant create session, create table to asmfmr;
alter user asmfmr quota unlimited on asmfmr_ts;
connect asmfmr/asmfmr
drop table table_workload;
create table table_workload (col1 number, col2 varchar2(200));
begin
insert into table_workload values ( 1,'workload data');
for i in 1..20 loop
insert into table_workload select * from table_workload;
end loop;
end;
/
commit;
select count(1) from table_workload;
exit
Very detailed explanation!
ReplyDeleteExcellent Post ...
ReplyDelete