歡迎來到Linux教程網
Linux教程網
Linux教程網
Linux教程網
Linux教程網 >> Linux綜合 >> Linux內核 >> linux內核md源代碼解讀 十 raid5數據流之同步數據流程

linux內核md源代碼解讀 十 raid5數據流之同步數據流程

日期:2017/3/3 16:17:28   编辑:Linux內核

上一節講到在raid5的同步函數sync_request中炸土豆片是通過handle_stripe來進行的。從最初的創建陣列,到申請各種資源,建立每個陣列的personality,所有的一切都是為了迎接數據流而作的准備。就像我們寒窗苦讀就是為了上大學一樣。數據流的過程就像大學校園一樣豐富多彩並且富有挑戰性,但只要跨過了這道坎,內核代碼將不再神秘,剩下的問題只是時間而已。

首先看handle_stripe究竟把我們的土豆片帶往何處:

3379 static void handle_stripe(struct stripe_head *sh)  
3380 {  
3381         struct stripe_head_state s;  
3382         struct r5conf *conf = sh->raid_conf;  
3383         int i;  
3384         int prexor;  
3385         int disks = sh->disks;  
3386         struct r5dev *pdev, *qdev;  
3387  
3388         clear_bit(STRIPE_HANDLE, &sh->state);  
3389         if (test_and_set_bit_lock(STRIPE_ACTIVE, &sh->state)) {  
3390                 /* already being handled, ensure it gets handled 
3391                  * again when current action finishes */
3392                 set_bit(STRIPE_HANDLE, &sh->state);  
3393                 return;  
3394         }  
3395  
3396         if (test_and_clear_bit(STRIPE_SYNC_REQUESTED, &sh->state)) {  
3397                 set_bit(STRIPE_SYNCING, &sh->state);  
3398                 clear_bit(STRIPE_INSYNC, &sh->state);  
3399         }  
3400         clear_bit(STRIPE_DELAYED, &sh->state);  
3401  
3402         pr_debug("handling stripe %llu, state=%#lx cnt=%d, "
3403                 "pd_idx=%d, qd_idx=%d\n, check:%d, reconstruct:%d\n",  
3404                (unsigned long long)sh->sector, sh->state,  
3405                atomic_read(&sh->count), sh->pd_idx, sh->qd_idx,  
3406                sh->check_state, sh->reconstruct_state);  
3407  
3408         analyse_stripe(sh, &s);

這個函數代碼比較長先貼第一部分,分析條帶。分析的作用就是根據條帶的狀態做一些預處理,根據這些狀態再來判斷下一步應該做什麼具體操作。比如說同步,那麼首先會讀數據盤,等讀回來之後,再校驗,然後再寫校驗值。但是這些步驟又不是一次性在handle_stripe裡就完成的,因為跟磁盤IO都是異步的,所以必要要等上一次磁盤請求回調之後再次調用handle_stripe,通常每個數據流都會多次進入handle_stripe,而每一次進入經過的代碼流程是不大一樣的。

struct stripe_head有很多狀態,這些狀態決定條帶應該怎麼處理,所以必須非常小心處理這些標志,這些標志很多,現在先簡單地過一下。

enum {  
     STRIPE_ACTIVE,   // 正在處理  
     STRIPE_HANDLE,  // 需要處理  
     STRIPE_SYNC_REQUESTED,  // 同步請求   
     STRIPE_SYNCING,  // 正在處理同步  
     STRIPE_INSYNC,  // 條帶已同步  
     STRIPE_PREREAD_ACTIVE,  // 預讀  
     STRIPE_DELAYED,  // 延遲處理  
     STRIPE_DEGRADED,  // 降級  
     STRIPE_BIT_DELAY,  // 等待bitmap處理  
     STRIPE_EXPANDING,  //   
     STRIPE_EXPAND_SOURCE,  //   
     STRIPE_EXPAND_READY,  //   
     STRIPE_IO_STARTED,     /* do not count towards 'bypass_count' */   // IO已下發  
     STRIPE_FULL_WRITE,     /* all blocks are set to be overwritten */  // 滿寫  
     STRIPE_BIOFILL_RUN,  // bio填充,就是將page頁拷貝到bio  
     STRIPE_COMPUTE_RUN,  // 運行計算  
     STRIPE_OPS_REQ_PENDING,  // handle_stripe排隊用  
     STRIPE_ON_UNPLUG_LIST,  // 批量release_stripe時標識是否加入unplug鏈表  
};

3388行,清除需要處理標志。

3389行,設置正在處理標志。

3392行,如果已經在處理則設置下次處理標志並返回。

3396行,如果是同步請求。

3397行,設置正在處理同步標志。

3398行,清除已同步標志。

3400行,清除延遲處理標志。

3408行,分析stripe,這個函數很長分幾段來說明:

3198 static void analyse_stripe(struct stripe_head *sh, struct stripe_head_state *s)  
3199 {  
3200         struct r5conf *conf = sh->raid_conf;  
3201         int disks = sh->disks;  
3202         struct r5dev *dev;  
3203         int i;  
3204         int do_recovery = 0;  
3205  
3206         memset(s, 0, sizeof(*s));  
3207  
3208         s->expanding = test_bit(STRIPE_EXPAND_SOURCE, &sh->state);  
3209         s->expanded = test_bit(STRIPE_EXPAND_READY, &sh->state);  
3210         s->failed_num[0] = -1;  
3211         s->failed_num[1] = -1;  
3212  
3213         /* Now to look around and see what can be done */
3214         rcu_read_lock();

數據初始化和加鎖,接著看:

3215         for (i=disks; i--; ) {  
3216                 struct md_rdev *rdev;  
3217                 sector_t first_bad;  
3218                 int bad_sectors;  
3219                 int is_bad = 0;  
3220  
3221                 dev = &sh->dev[i];  
3222  
3223                 pr_debug("check %d: state 0x%lx read %p write %p written %p\n",  
3224                          i, dev->flags,  
3225                          dev->toread, dev->towrite, dev->written);

接著是一個大循環,循環次數是數據盤的個數,循環的對象的3221行的dev,dev的類型是struct r5dev,那我們先來看一看這個結構,這個結構是嵌套在struct stripe_head裡面的:

struct r5dev {  
     /* rreq and rvec are used for the replacement device when 
     * writing data to both devices. 
     */
     struct bio     req, rreq;  
     struct bio_vec     vec, rvec;  
     struct page     *page;  
     struct bio     *toread, *read, *towrite, *written;  
     sector_t     sector;               /* sector of this page */
     unsigned long     flags;  
} dev[1]; /* allocated with extra space depending of RAID geometry */

首先看注釋,rreq 和rvec由replacement設備在寫數據時使用。r就是replacement的簡寫,replacement是什麼意思呢?就是原數據盤的替代,replacement是最近幾個版本裡才引入的特性,在實際產品中這個特性很重要,具體實現後面會講到。page是緩存頁,通常用於運行計算,接著幾個bio是讀寫bio頭指針。sector是條帶對應的物理扇區位置。flags是struct r5dev的標志。

3226                 /* maybe we can reply to a read 
3227                  * 
3228                  * new wantfill requests are only permitted while 
3229                  * ops_complete_biofill is guaranteed to be inactive 
3230                  */
3231                 if (test_bit(R5_UPTODATE, &dev->flags) && dev->toread &&  
3232                     !test_bit(STRIPE_BIOFILL_RUN, &sh->state))  
3233                         set_bit(R5_Wantfill, &dev->flags);  
3234  
3235                 /* now count some things */
3236                 if (test_bit(R5_LOCKED, &dev->flags))  
3237                         s->locked++;  
3238                 if (test_bit(R5_UPTODATE, &dev->flags))  
3239                         s->uptodate++;  
3240                 if (test_bit(R5_Wantcompute, &dev->flags)) {  
3241                         s->compute++;  
3242                         BUG_ON(s->compute > 2);  
3243                 }  
3244  
3245                 if (test_bit(R5_Wantfill, &dev->flags))  
3246                         s->to_fill++;  
3247                 else if (dev->toread)  
3248                         s->to_read++;  
3249                 if (dev->towrite) {  
3250                         s->to_write++;  
3251                         if (!test_bit(R5_OVERWRITE, &dev->flags))  
3252                                 s->non_overwrite++;  
3253                 }  
3254                 if (dev->written)  
3255                         s->written++;

3231行,什麼樣的r5dev要設置R5_Wantfill標志呢?已更新、有讀請求、不在拷貝過程。這又是什麼意思呢?就是說需要的數據已經為最新了,這時只要把數據從page拷貝到bio就可以了。

3236行,統計加鎖磁盤數

3238行,統計已最新磁盤數

3240行,統計需要計算的磁盤數

3245行,統計需要拷貝操作磁盤數

3247行,統計需要讀的磁盤數

3249行,統計需要寫的磁盤數

3251行,統計滿寫的磁盤數

3254行,統計已下發寫的磁盤數

3256                 /* Prefer to use the replacement for reads, but only 
3257                  * if it is recovered enough and has no bad blocks. 
3258                  */
3259                 rdev = rcu_dereference(conf->disks[i].replacement);  
3260                 if (rdev && !test_bit(Faulty, &rdev->flags) &&  
3261                     rdev->recovery_offset >= sh->sector + STRIPE_SECTORS &&  
3262                     !is_badblock(rdev, sh->sector, STRIPE_SECTORS,  
3263                                  &first_bad, &bad_sectors))  
3264                         set_bit(R5_ReadRepl, &dev->flags);  
3265                 else {  
3266                         if (rdev)  
3267                                 set_bit(R5_NeedReplace, &dev->flags);  
3268                         rdev = rcu_dereference(conf->disks[i].rdev);  
3269                         clear_bit(R5_ReadRepl, &dev->flags);  
3270                 }  
3271                 if (rdev && test_bit(Faulty, &rdev->flags))  
3272                         rdev = NULL;  
3273                 if (rdev) {  
3274                         is_bad = is_badblock(rdev, sh->sector, STRIPE_SECTORS,  
3275                                              &first_bad, &bad_sectors);  
3276                         if (s->blocked_rdev == NULL  
3277                             && (test_bit(Blocked, &rdev->flags)  
3278                                 || is_bad < 0)) {  
3279                                 if (is_bad < 0)  
3280                                         set_bit(BlockedBadBlocks,  
3281                                                 &rdev->flags);  
3282                                 s->blocked_rdev = rdev;  
3283                                 atomic_inc(&rdev->nr_pending);  
3284                         }  
3285                 }

3256行,優先讀重建過並沒有壞扇區的replacement盤

3264行,讀replacement盤

3267行,寫replacement盤

3271行,壞盤

3273行,檢查壞扇區

3286行,初始化dev狀態

3300行,沒有壞扇區,設置同步標志

3312行,寫錯誤處理

3325行,數據盤修復處理

3336行,replacement盤修復處理

3352行,記錄不同步盤

3360行,判斷同步還是重構replacement盤

到此analyse_stripe就結束了,那麼對於同步來說,這個函數做了哪些事情呢?就只是設置了s.syncing=1而已,所以不要看這個函數那麼長,每一次進來做的事情卻很少。

繼續返回到handle_stripe函數中,中間不執行的代碼先跳過,然後就會執行到這裡:

3468         /* Now we might consider reading some blocks, either to check/generate 
3469          * parity, or to satisfy requests 
3470          * or to load a block that is being partially written. 
3471          */
3472         if (s.to_read || s.non_overwrite  
3473             || (conf->level == 6 && s.to_write && s.failed)  
3474             || (s.syncing && (s.uptodate + s.compute < disks))  
3475             || s.replacing  
3476             || s.expanding)  
3477                 handle_stripe_fill(sh, &s, disks);

3468行,這裡是准備讀磁盤,在生成校驗、讀寫請求時都有可能讀磁盤

3474行,在analyse_stripe中設置了syncing標志,所以這裡滿足這個條件,進入handle_stripe_fill函數。

2707 /** 
2708  * handle_stripe_fill - read or compute data to satisfy pending requests. 
2709  */
2710 static void handle_stripe_fill(struct stripe_head *sh,  
2711                                struct stripe_head_state *s,  
2712                                int disks)  
2713 {  
2714         int i;  
2715   
2716         /* look for blocks to read/compute, skip this if a compute 
2717          * is already in flight, or if the stripe contents are in the 
2718          * midst of changing due to a write 
2719          */
2720         if (!test_bit(STRIPE_COMPUTE_RUN, &sh->state) && !sh->check_state &&  
2721             !sh->reconstruct_state)  
2722                 for (i = disks; i--; )  
2723                         if (fetch_block(sh, s, i, disks))  
2724                                 break;  
2725         set_bit(STRIPE_HANDLE, &sh->state);  
2726 }

2720行,如果已經在計算、校驗或重建狀態,則不需要再讀磁盤

2722行,循環每一個r5dev看是否需要讀磁盤

跟進fetch_block函數:

2618 /* fetch_block - checks the given member device to see if its data needs
2619 * to be read or computed to satisfy a request.
2620 *
2621 * Returns 1 when no more member devices need to be checked, otherwise returns
2622 * 0 to tell the loop in handle_stripe_fill to continue
2623 */

查看指定設置是否有必要讀入數據,返回1表示剩余設置不需要檢查了,返回0表示需要繼續檢查剩余的設置。

2624 static int fetch_block(struct stripe_head *sh, struct stripe_head_state *s,  
2625                        int disk_idx, int disks)  
2626 {  
2627         struct r5dev *dev = &sh->dev[disk_idx];  
2628         struct r5dev *fdev[2] = { &sh->dev[s->failed_num[0]],  
2629                                   &sh->dev[s->failed_num[1]] };  
2630   
2631         /* is the data in this block needed, and can we get it? */
2632         if (!test_bit(R5_LOCKED, &dev->flags) &&  
2633             !test_bit(R5_UPTODATE, &dev->flags) &&  
2634             (dev->toread ||  
2635              (dev->towrite && !test_bit(R5_OVERWRITE, &dev->flags)) ||  
2636              s->syncing || s->expanding ||  
2637              (s->replacing && want_replace(sh, disk_idx)) ||  
2638              (s->failed >= 1 && fdev[0]->toread) ||  
2639              (s->failed >= 2 && fdev[1]->toread) ||  
2640              (sh->raid_conf->level <= 5 && s->failed && fdev[0]->towrite &&  
2641               !test_bit(R5_OVERWRITE, &fdev[0]->flags)) ||  
2642              (sh->raid_conf->level == 6 && s->failed && s->to_write))) {  
2643                 /* we would like to get this block, possibly by computing it, 
2644                  * otherwise read it if the backing disk is insync 
2645                  */
2646                 BUG_ON(test_bit(R5_Wantcompute, &dev->flags));  
2647                 BUG_ON(test_bit(R5_Wantread, &dev->flags));  
2648                 if ((s->uptodate == disks - 1) &&  
2649                     (s->failed && (disk_idx == s->failed_num[0] ||  
2650                                    disk_idx == s->failed_num[1]))) {  
2651                         /* have disk failed, and we're requested to fetch it; 
2652                          * do compute it 
2653                          */
2654                         pr_debug("Computing stripe %llu block %d\n",  
2655                                (unsigned long long)sh->sector, disk_idx);  
2656                         set_bit(STRIPE_COMPUTE_RUN, &sh->state);  
2657                         set_bit(STRIPE_OP_COMPUTE_BLK, &s->ops_request);  
2658                         set_bit(R5_Wantcompute, &dev->flags);  
2659                         sh->ops.target = disk_idx;  
2660                         sh->ops.target2 = -1; /* no 2nd target */
2661                         s->req_compute = 1;  
2662                         /* Careful: from this point on 'uptodate' is in the eye 
2663                          * of raid_run_ops which services 'compute' operations 
2664                          * before writes. R5_Wantcompute flags a block that will 
2665                          * be R5_UPTODATE by the time it is needed for a 
2666                          * subsequent operation. 
2667                          */
2668                         s->uptodate++;  
2669                         return 1;  
2670                 } else if (s->uptodate == disks-2 && s->failed >= 2) {  
2671                         /* Computing 2-failure is *very* expensive; only 
2672                          * do it if failed >= 2 
2673                          */
2674                         int other;  
2675                         for (other = disks; other--; ) {  
2676                                 if (other == disk_idx)  
2677                                         continue;  
2678                                 if (!test_bit(R5_UPTODATE,  
2679                                       &sh->dev[other].flags))  
2680                                         break;  
2681                         }  
2682                         BUG_ON(other < 0);  
2683                         pr_debug("Computing stripe %llu blocks %d,%d\n",  
2684                                (unsigned long long)sh->sector,  
2685                                disk_idx, other);  
2686                         set_bit(STRIPE_COMPUTE_RUN, &sh->state);  
2687                         set_bit(STRIPE_OP_COMPUTE_BLK, &s->ops_request);  
2688                         set_bit(R5_Wantcompute, &sh->dev[disk_idx].flags);  
2689                         set_bit(R5_Wantcompute, &sh->dev[other].flags);  
2690                         sh->ops.target = disk_idx;  
2691                         sh->ops.target2 = other;  
2692                         s->uptodate += 2;  
2693                         s->req_compute = 1;  
2694                         return 1;  
2695                 } else if (test_bit(R5_Insync, &dev->flags)) {  
2696                         set_bit(R5_LOCKED, &dev->flags);  
2697                         set_bit(R5_Wantread, &dev->flags);  
2698                         s->locked++;  
2699                         pr_debug("Reading block %d (sync=%d)\n",  
2700                                 disk_idx, s->syncing);  
2701                 }  
2702         }  
2703   
2704         return 0;  
2705 }

從這個函數進入,我們擁有的僅僅是s.syncing這張牌,那麼這張牌在這裡能不能發揮作用呢?

2632行,判斷是否需要讀設置

2636行,很明顯地,這個判斷為真,因為s.syncing==1,其他判斷暫且不看

2648行,當前設置都未讀入,所以s->uptodate==0

2670行,同上也不成立

2695行,真正執行到的是這個分支

2696行,設置設備加鎖標志

2697行,設置設備准備讀標志

2698行,遞增本條帶加鎖設備數

handle_stripe函數執行完成,條帶的每個struct r5dev都被設置了R5_Wantread標志。在接下來handle_stripe就會調用ops_run_io函數去讀:

3673 ops_run_io(sh, &s);

我們再跟進這個函數,為了突出重點,這裡只列出跟同步相關的代碼:

537 static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)  
538 {  
539         struct r5conf *conf = sh->raid_conf;  
540         int i, disks = sh->disks;  
541   
542         might_sleep();  
543   
544         for (i = disks; i--; ) {  
545                 int rw;  
546                 int replace_only = 0;  
547                 struct bio *bi, *rbi;  
548                 struct md_rdev *rdev, *rrdev = NULL;  
...  
554                 } else if (test_and_clear_bit(R5_Wantread, &sh->dev[i].flags))  
555                         rw = READ;  
...  
560                 } else
561                         continue;  
564   
565                 bi = &sh->dev[i].req;  
566                 rbi = &sh->dev[i].rreq; /* For writing to replacement */
567   
568                 bi->bi_rw = rw;  
569                 rbi->bi_rw = rw;  
570                 if (rw & WRITE) {  
573                 } else
574                         bi->bi_end_io = raid5_end_read_request;  
575   
576                 rcu_read_lock();  
577                 rrdev = rcu_dereference(conf->disks[i].replacement);  
578                 smp_mb(); /* Ensure that if rrdev is NULL, rdev won't be */
579                 rdev = rcu_dereference(conf->disks[i].rdev);  
580                 if (!rdev) {  
581                         rdev = rrdev;  
582                         rrdev = NULL;  
583                 }  
...  
598                 if (rdev)  
599                         atomic_inc(&rdev->nr_pending);  
...  
604                 rcu_read_unlock();  
...  
643                 if (rdev) {  
644                         if (s->syncing || s->expanding || s->expanded  
645                             || s->replacing)  
646                                 md_sync_acct(rdev->bdev, STRIPE_SECTORS);  
647   
648                         set_bit(STRIPE_IO_STARTED, &sh->state);  
649   
650                         bi->bi_bdev = rdev->bdev;  
651                         pr_debug("%s: for %llu schedule op %ld on disc %d\n",  
652                                 __func__, (unsigned long long)sh->sector,  
653                                 bi->bi_rw, i);  
654                         atomic_inc(&sh->count);  
655                         if (use_new_offset(conf, sh))  
656                                 bi->bi_sector = (sh->sector  
657                                                  + rdev->new_data_offset);  
658                         else
659                                 bi->bi_sector = (sh->sector  
660                                                  + rdev->data_offset);  
661                         if (test_bit(R5_ReadNoMerge, &sh->dev[i].flags))  
662                                 bi->bi_rw |= REQ_FLUSH;  
663   
664                         bi->bi_flags = 1 << BIO_UPTODATE;  
665                         bi->bi_idx = 0;  
666                         bi->bi_io_vec[0].bv_len = STRIPE_SIZE;  
667                         bi->bi_io_vec[0].bv_offset = 0;  
668                         bi->bi_size = STRIPE_SIZE;  
669                         bi->bi_next = NULL;  
670                         if (rrdev)  
671                                 set_bit(R5_DOUBLE_LOCKED, &sh->dev[i].flags);  
672                         generic_make_request(bi);  
673                 }  
...  
709         }  
710 }

542行,函數可能休眠

544行,遍歷每一個r5dev

554行,設置讀標志

568行,設置bio為讀

574行,設置bio回調函數為raid5_end_read_request,這裡將是下發讀請求之後代碼繼續執行的入口點。

598行,增加設備nr_pending

646行,統計信息

648行,設置IO下發標志

650行,設置bio設備為對應的磁盤設備

654行,增加stripe_head引用計數

655-660行,設置新的扇區數,需要加上磁盤上的數據偏移

661行,如果為NoMerge讀,則設置bio REQ_FLUSH標志

664行,接著設置bio其他域

672行,下發bio到磁盤

在磁盤執行完讀請求的時候,raid5_end_read_request被調用:

1710 static void raid5_end_read_request(struct bio * bi, int error)  
1711 {  
...  
1824         rdev_dec_pending(rdev, conf->mddev);  
1825         clear_bit(R5_LOCKED, &sh->dev[i].flags);  
1826         set_bit(STRIPE_HANDLE, &sh->state);  
1827         release_stripe(sh);  
1828 }

在這個函數中,清除了R5_LOCKED標志,並重新將stripe_head加入處理。經過raid5d中轉,重新調用到handle_stripe函數,這一次調用時在analyse_stripe函數中遞增s->uptodate,所有數據盤都遞增1,所以s->uptodate等於數據盤。接著handle_tripe函數到達:

3528     if (sh->check_state ||  
3529         (s.syncing && s.locked == 0 &&  
3530          !test_bit(STRIPE_COMPUTE_RUN, &sh->state) &&  
3531          !test_bit(STRIPE_INSYNC, &sh->state))) {  
3532          if (conf->level == 6)  
3533               handle_parity_checks6(conf, sh, &s, disks);  
3534          else
3535               handle_parity_checks5(conf, sh, &s, disks);  
3536     }

進入3535行進行校驗,進入handle_parity_check5函數:

2881     switch (sh->check_state) {  
2882     case check_state_idle:  
2883          /* start a new check operation if there are no failures */
2884          if (s->failed == 0) {  
2885               BUG_ON(s->uptodate != disks);  
2886               sh->check_state = check_state_run;  
2887               set_bit(STRIPE_OP_CHECK, &s->ops_request);  
2888               clear_bit(R5_UPTODATE, &sh->dev[sh->pd_idx].flags);  
2889               s->uptodate--;  
2890               break;  
2891          }

2881行,check_state為0,進入2882行分支

2886行,設置check_state_run狀態

2887行,設置STRIPE_OP_CHECK操作

2889行,遞減s->uptodate

由於這裡設置了STRIPE_OP_CHECK操作,所以在handle_stripe會調用到raid_run_ops,進而會調用到:

1412     if (test_bit(STRIPE_OP_CHECK, &ops_request)) {  
1413          if (sh->check_state == check_state_run)  
1414               ops_run_check_p(sh, percpu);

ops_run_check_p校驗條帶是否同步,對應的回調函數為:

1301static void ops_complete_check(void *stripe_head_ref)  
1302{  
1303     struct stripe_head *sh = stripe_head_ref;  
1304  
1305     pr_debug("%s: stripe %llu\n", __func__,  
1306          (unsigned long long)sh->sector);  
1307  
1308     sh->check_state = check_state_check_result;  
1309     set_bit(STRIPE_HANDLE, &sh->state);  
1310     release_stripe(sh);  
1311}

 

第1308行將狀態設置為check_state_check_result,條帶繼續又重新加入到handle_list。handle_stripe再一次調用到handle_parity_check5函數,但這一次check_state==check_state_check_result:

2916     case check_state_check_result:  
2917          sh->check_state = check_state_idle;  
2918  
2919          /* if a failure occurred during the check operation, leave 
2920          * STRIPE_INSYNC not set and let the stripe be handled again 
2921          */
2922          if (s->failed)  
2923               break;  
2924  
2925          /* handle a successful check operation, if parity is correct 
2926          * we are done.  Otherwise update the mismatch count and repair 
2927          * parity if !MD_RECOVERY_CHECK 
2928          */
2929          if ((sh->ops.zero_sum_result & SUM_CHECK_P_RESULT) == 0)  
2930               /* parity is correct (on disc, 
2931               * not in buffer any more) 
2932               */
2933               set_bit(STRIPE_INSYNC, &sh->state);  
2934          else {  
2935               conf->mddev->resync_mismatches += STRIPE_SECTORS;  
2936               if (test_bit(MD_RECOVERY_CHECK, &conf->mddev->recovery))  
2937                    /* don't try to repair!! */
2938                    set_bit(STRIPE_INSYNC, &sh->state);  
2939               else {  
2940                    sh->check_state = check_state_compute_run;  
2941                    set_bit(STRIPE_COMPUTE_RUN, &sh->state);  
2942                    set_bit(STRIPE_OP_COMPUTE_BLK, &s->ops_request);  
2943                    set_bit(R5_Wantcompute,  
2944                         &sh->dev[sh->pd_idx].flags);  
2945                    sh->ops.target = sh->pd_idx;  
2946                    sh->ops.target2 = -1;  
2947                    s->uptodate++;  
2948               }  
2949          }  
2950          break;

2929行,如果校驗的結果是同步的

2933行,直接設置條帶為同步的,不需要進行其他任何操作了

2934行,如果條帶不同步

2940行,設置check_state為check_state_compute_run

2942行,ops_request 為STRIPE_OP_COMPUTE_BLK,即准備計算校驗

2943行,計算目標為條帶校驗盤

2947行,由於之前計算校驗時uptodate遞減,這裡恢復

如果條帶已經同步了,那麼帶著STRIPE_INSYNC標志我們來到了handle_stripe:

3550         if ((s.syncing || s.replacing) && s.locked == 0 &&  
3551             test_bit(STRIPE_INSYNC, &sh->state)) {  
3552                 md_done_sync(conf->mddev, STRIPE_SECTORS, 1);  
3553                 clear_bit(STRIPE_SYNCING, &sh->state);  
3554         }

如果條帶未同步,那帶著STRIPE_OP_COMPUTE_BLK標志來到了raid_run_ops函數,該函數調用__raid_run_ops:

1383     if (test_bit(STRIPE_OP_COMPUTE_BLK, &ops_request)) {  
1384          if (level < 6)  
1385               tx = ops_run_compute5(sh, percpu);

最終調用ops_run_compute5函數計算出條帶中校驗盤的值,該函數回調函數ops_complete_compute:

856static void ops_complete_compute(void *stripe_head_ref)  
857{  
858     struct stripe_head *sh = stripe_head_ref;  
859  
860     pr_debug("%s: stripe %llu\n", __func__,  
861          (unsigned long long)sh->sector);  
862  
863     /* mark the computed target(s) as uptodate */
864     mark_target_uptodate(sh, sh->ops.target);  
865     mark_target_uptodate(sh, sh->ops.target2);  
866  
867     clear_bit(STRIPE_COMPUTE_RUN, &sh->state);  
868     if (sh->check_state == check_state_compute_run)  
869          sh->check_state = check_state_compute_result;  
870     set_bit(STRIPE_HANDLE, &sh->state);  
871     release_stripe(sh);  
872}

864行,設置校驗盤dev為R5_UPTODATE

869行,由於handle_parity_check5中設置為check_state_compute_run,這裡繼續設置為check_state_compute_result

870行,設置處理標志,在871之後再一次進入handle_stripe

當再一次進入handle_stripe函數,又再一次來到handle_parity_check5函數,由於這次是check_state_compute_result標志:

2894     case check_state_compute_result:  
2895          sh->check_state = check_state_idle;  
2896          if (!dev)  
2897               dev = &sh->dev[sh->pd_idx];  
2898  
2899          /* check that a write has not made the stripe insync */
2900          if (test_bit(STRIPE_INSYNC, &sh->state))  
2901               break;  
2902  
2903          /* either failed parity check, or recovery is happening */
2904          BUG_ON(!test_bit(R5_UPTODATE, &dev->flags));  
2905          BUG_ON(s->uptodate != disks);  
2906  
2907          set_bit(R5_LOCKED, &dev->flags);  
2908          s->locked++;  
2909          set_bit(R5_Wantwrite, &dev->flags);  
2910  
2911          clear_bit(STRIPE_DEGRADED, &sh->state);  
2912          set_bit(STRIPE_INSYNC, &sh->state);  
2913          break;

我們可以一眼看2912行設置了STRIPE_INSYNC標志,那麼也意味著條帶同步的結束。但是也別高興得太早,回頭看卻有2908行s->locked++,同步結束的判斷條件之一就是s->locked==0,所以在同步結束之前我們還有一件事情要做,2909行設置了R5_Wantwrite標志就是告訴我們需要調用一次ops_run_io將剛才計算的校驗值寫入條帶的校驗盤中,再寫成功再返回時就會滿足同步結束的條件了。就這樣,一次簡單的同步過程就完成了。

出處:http://blog.csdn.net/liumangxiong

Copyright © Linux教程網 All Rights Reserved