Linux教程網 >> Linux基礎 >> 關於Linux >> linux進程管理之調度與切換

linux進程管理之調度與切換

日期：2017/3/3 16:41:01 编辑：關於Linux

進程的調度與切換直接影響著進程子系統的執行效率.Linux摒棄了i386 硬件提供的進程切換方法.手動保存進程上下文.在調度策略上,近幾個版本對其都有很大的改動.特別是在2.6.23版本與以前發布的2.6.0更是相差甚遠.在調度方面.我們以2.6.9在代碼作為基准作為分析.

一:進程切換

進程的切換過程是在context_switch()中實現的.從它的代碼說起:

static inline void
context_switch(struct rq *rq, struct task_struct *prev,
　　　　　 struct task_struct *next)
{
　　 struct mm_struct *mm, *oldmm;
　　 prepare_task_switch(rq, prev, next);
　　 mm = next->mm;
　　 oldmm = prev->active_mm;
　　 /*
　　 * For paravirt, this is coupled with an exit in switch_to to
　　 * combine the page table reload and the switch backend into
　　 * one hypercall.
　　 */
　　 arch_enter_lazy_cpu_mode();
　　 //task->mm 為空.則是一個內核線程
　　 if (unlikely(!mm)) {
　　　　 //內核線程共享上一個運行進程的mm
　　　　 next->active_mm = oldmm;
　　　　 //增加引用計數
　　　　 atomic_inc(&oldmm->mm_count);
　　　　 enter_lazy_tlb(oldmm, next);
　　 } else
　　　　 //如果是用戶進程,則切換運行空間
　　　　 switch_mm(oldmm, mm, next);
　　 //如果上一個運行進程是內核線程
　　 if (unlikely(!prev->mm)) {
　　　　 //賦active_mm為空.
　　　　 prev->active_mm = NULL;
　　　　 //更新運行隊列的prev_mm成員
　　　　 rq->prev_mm = oldmm;
　　 }
　　 /*
　　 * Since the runqueue lock will be released by the next
　　 * task (which is an invalid locking op but in the case
　　 * of the scheduler it's an obvious special-case), so we
　　 * do an early lockdep release here:
　　 */
#ifndef __ARCH_WANT_UNLOCKED_CTXSW
　　 spin_release(&rq->lock.dep_map, 1, _THIS_IP_);
#endif
　　 /* Here we just switch the register state and the stack. */
　　 //切換進程的執行環境
　　 switch_to(prev, next, prev);
　　 barrier();
　　 /*
　　 * this_rq must be evaluated again because prev may have moved
　　 * CPUs since it called schedule(), thus the 'rq' on its stack
　　 * frame will be invalid.
　　 */
　　 //進程切換之後的處理工作
　　 finish_task_switch(this_rq(), prev);
}

實際上，進程切換包括進程的執行環境切換和運行空間的切換。運行空間的切換是由switch_mm()完成的。代碼如下：

//切換進程的執行空間
static inline void switch_mm(struct mm_struct *prev,
　　　　　　　　　 struct mm_struct *next,
　　　　　　　　　 struct task_struct *tsk)
{
　　 //得到當前進程運行的cpu
　　 int cpu = smp_processor_id();
　　 //如果要要切換的prev!=next 執行切換過程
　　 if (likely(prev != next)) {
　　　　 /* stop flush ipis for the previous mm */
　　　　 //清除prev的cpu_vm_mask標志.表示prev已經棄用了當前CPU
　　　　 cpu_clear(cpu, prev->cpu_vm_mask);
#ifdef CONFIG_SMP
　　　　 //在SMP系統中.更新cpu_tlbstate
　　　　 per_cpu(cpu_tlbstate, cpu).state = TLBSTATE_OK;
　　　　 per_cpu(cpu_tlbstate, cpu).active_mm = next;
#endif
　　　　 //設置cpu_vm_mask.表示next占用了當前CPU
　　　　 cpu_set(cpu, next->cpu_vm_mask);
　　　　 /* Re-load page tables */
　　　　 //加載CR3
　　　　 load_cr3(next->pgd);
　　　　 /*
　　　　 * load the LDT, if the LDT is different:
　　　　 */
　　　　 //如果LDT不相同,還要加載LDT
　　　　 if (unlikely(prev->context.ldt != next->context.ldt))
　　　　　　　load_LDT_nolock(&next->context);
　　 }
#ifdef CONFIG_SMP
　　 else {
　　　　 per_cpu(cpu_tlbstate, cpu).state = TLBSTATE_OK;
　　　　 //prev == next .那當前cpu中的active_mm就是prev. 也即next
　　　　 BUG_ON(per_cpu(cpu_tlbstate, cpu).active_mm != next);
　　　　 //在SMP系統中.雖然MM是一樣的,但需要加載CR3
　　　　 //執行cpu_test_and_set()來判斷next是否正運行在此CPU上,這裡是判斷在切換之前next
　　　　 //是否運行在當前的cpu中
　　　　 //假設當前cpu為1..一個進程在1上執行的時候,被調度出來.再次調度的時候
　　　　 //又發生在cpu 1
　　　　 if (!cpu_test_and_set(cpu, next->cpu_vm_mask)) {
　　　　　　　/* We were in lazy tlb mode and leave_mm disabled
　　　　　　　* tlb flush IPI delivery. We must reload %cr3.
　　　　　　　*/
　　　　　　　load_cr3(next->pgd);
　　　　　　　load_LDT_nolock(&next->context);
　　　　 }
　　 }
#endif
}

我們在上面的代碼分析中可以看到。它主要是切換了進程的CR3。

執行環境的切換是在switch_to()中完成的。它的代碼如下：

#define switch_to(prev,next,last) do {                       \
     unsigned long esi,edi;                         \
     asm volatile("pushfl\n\t"        /* Save flags */   \
              "pushl %%ebp\n\t"                    \
              "movl %%esp,%0\n\t"    /* save ESP */         \
              "movl %5,%%esp\n\t"    /* restore ESP */  \
              "movl $1f,%1\n\t"      /* save EIP */         \
              "pushl %6\n\t"         /* restore EIP */  \
              "jmp __switch_to\n"                  \
              "1:\t"                          \
              "popl %%ebp\n\t"                     \
              "popfl"                         \
              :"=m" (prev->thread.esp),"=m" (prev->thread.eip),  \
               "=a" (last),"=S" (esi),"=D" (edi)            \
              :"m" (next->thread.esp),"m" (next->thread.eip),    \
               "2" (prev), "d" (next));                 \
} while (0)

這段代碼是用AT&T嵌入式匯編寫的。如果對相關語法不熟悉，請查閱相關資料。它的運行流程如下：

它總共有9個參數。程序開始時，先把這9個參數壓棧。

後面一個冒號跟的是輸入部。它表示程序開始運時的一系列初始化。初始化如下：

Prev à eax

Next à edx

它的執行流程如下：

Flag寄存器入棧(pushfl)

Ebp入棧(pushl %%ebp.在C中。Ebp寄存器用來存放當前的運行函數的幀指針)

Esp à prev->tread.esp(movel &&esp, %0 保存當前寄存器指針到prev->tread.esp)

Next->tread.esp à esp(將next->tread.esp加載esp寄存器.這一步實際上是加載next進程的內核棧。運行到這裡。理論上prev已經被切換到next了。因為因此如果用curret獲得當前進程應該是next)

將標號為1的地址 à prev->thread.eip　(movl $1f,%1.雖然理論是進程是切換完成了，但是需要保存prev的執行環境。這裡是指明prev進程的下一條運行指令。這樣如果切換進prev進程的話，就會從標號1的地址開始運行)

Next->thread.eip壓棧(pushl %6)

跳轉到__switch_to()函數(jmp __switch_to)　

這裡要注意的是__switch_to函數是經過regparm(3)來修飾的。這個是GCC的一個擴展語法。即從eax.ebx.ecx寄存器取函數的參數。這樣：__switch_to()函數的參數指是從寄存器裡取的。並不是像普通函數那樣從當前堆棧中取得.在調用函數__switch_to()之前。將next->thread.eip壓棧了。這樣從函數返回之後，它的下一條運行指令就是next->thread.eip了

我們可以想象一下：

對於新創建的進程。我們設置了 p->thread.eip = (unsigned long) ret_from_fork。這樣子進程被切換進來之後，就會通過ret_from_fork返回到用戶空間了。(詳情請參照本站的相關文檔)

對於已經運行的進程。我們在這裡可以看到，在進程被切換出去的時候，prev->thread.eip會被設置成標號1的地址。即是從標號1的地址開始運行的。

標號1的操作：

恢復ebp　(popl %%ebp)

恢復flags (popfl)

這樣就恢復了進程的執行環境

這段代碼有很多讓人疑惑的地方：

在進程切換的時候，只是顯示保存了flags esp和 ebp寄存。顯示的用到了eax.edx.那其它寄存器怎麼保存的呢？

實際上，過程切換只是發生在內核態中。對於內核態中的寄存器來說。它們的段寄存器都是一樣的。所以不需要保存。

另外：對於通用寄存器。Esi edi ecx怎麼保存的呢？我查閱了相關資料。沒有發現能完全解釋清楚的。現將我各人意見所列如下：

Esi.edi寄存器：有人認為在輸出部中有這兩句："=S" (esi),"=D" (edi) 認為會將esi edi壓棧。這際上這是錯誤的。壓棧的只是esi edi變量(在嵌入代碼之前定義的).而不是esi edi寄存器。在這裡。它只是在嵌入代碼運行之後。將esi edi寄存器的內容分別放到esi edi變量中而已。在我用gcc –S 測試結果看來也確實如此。

那即然這樣。Esi edi ecx寄存器怎麼被保存。以便將來在prev運行的時候被恢復呢？

我認為：esi edi ecx沒有被保存。也不用保存。

我們來看一下切換的整個過程：

在switch_to()中執行環境的切換過程。在prev被切換回來之後。是從switch_to中標號1的地址繼續執行的。這時候已經被切換回了進程prev的內核棧。假設此時沒有保存其它的寄存器(flags ,ebp是必須要保存的哦 ^_^ ).它的後續操作是barrier()(GCC的一段嵌入代碼。聲明內存已經變動.它實際上沒有任何操作，這跟寄存器沒關系).調用函數finish_task_switch().再從context_switch()函數中返回。再從schedule()中返回。再繼續執行prev進程流。

我們知道，C函數的調用機制中。在調用之前先會執行進程環境的保存，再推入參數。最後推入下一條執令地址。然後跳轉到函數地址。從函數中返回之後。會將環境恢復。繼續執行。

這樣我們知道。在switch_to()後的finish_task_switch()的函數調用是不受寄存器影響的(因為這是個函數。在進入函數後相當是一個全新的“世界”).然後從context_switch()返回之後就會恢復schedule()中的執行環境了。到了schedule()之後，這時候的環境跟之前切換出去的環境是完全一樣的了.

當然，這樣做的前提時：在switch_to()之後的代碼對寄存器沒有依賴。也就是全為函數。假設在switch_to()之後，有對ecx .edi esi寄存器的讀取操作就會產生錯誤了。

在switch_to中會調用__switch_to()。它的代碼如下所示：

struct task_struct fastcall * __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
{
　　 struct thread_struct *prev = &prev_p->thread,
　　　　　　　　　 *next = &next_p->thread;
　　 int cpu = smp_processor_id();
　　 struct tss_struct *tss = &per_cpu(init_tss, cpu);
　　 /* never put a printk in __switch_to... printk() calls wake_up*() indirectly */
　　 //MMX,FPU,XMM寄存器的相關操作
//我們在前面已經詳細分析過這個函數。詳細請參閱本站相關文檔
　　 __unlazy_fpu(prev_p);
　　 /* we're going to use this soon, after a few expensive things */
　　 if (next_p->fpu_counter > 5)
　　　　 prefetch(&next->i387.fxsave);
　　 /*
　　 * Reload esp0.
　　 */
　　 //將next->thread.esp0　--> tss
　　 //從用戶態切換到內核態的時候,將此值加載esp寄存器.即為進程的內核棧頂位置
　　 load_esp0(tss, next);
　　 /*
　　 * Save away %gs. No need to save %fs, as it was saved on the
　　 * stack on entry.　No need to save %es and %ds, as those are
　　 * always kernel segments while inside the kernel.　Doing this
　　 * before setting the new TLS descriptors avoids the situation
　　 * where we temporarily have non-reloadable segments in %fs
　　 * and %gs.　This could be an issue if the NMI handler ever
　　 * used %fs or %gs (it does not today), or if the kernel is
　　 * running inside of a hypervisor layer.
　　 */
　　 　
　　 //為 prev保存gs
　　 savesegment(gs, prev->gs);
　　 /*
　　 * Load the per-thread Thread-Local Storage descriptor.
　　 */
　　 //從next的tls_array 緩存中加載線程的Thread-Local Storage描述符
　　 load_TLS(next, cpu);
　　 /*
　　 * Restore IOPL if needed.　In normal use, the flags restore
　　 * in the switch assembly will handle this.　But if the kernel
　　 * is running virtualized at a non-zero CPL, the popf will
　　 * not restore flags, so it must be done in a separate step.
　　 */
　　 //如果當前特權級別是0並且prev->iopl != next->iopl則恢復IOPL設置set_iopl_mask(next->iopl)。
　　 if (get_kernel_rpl() && unlikely(prev->iopl != next->iopl))
　　　　 set_iopl_mask(next->iopl);
　　 /*
　　 * Now maybe handle debug registers and/or IO bitmaps
　　 */
　　 //　根據thread_info的TIF標志_TIF_WORK_CTXSW和TIF_IO_BITMAP判斷是否需要處理debug寄存器和IO位圖
　　 if (unlikely(task_thread_info(prev_p)->flags & _TIF_WORK_CTXSW_PREV ||
　　　　 　　 task_thread_info(next_p)->flags & _TIF_WORK_CTXSW_NEXT))
　　　　 __switch_to_xtra(prev_p, next_p, tss);
　　 /*
　　 * Leave lazy mode, flushing any hypercalls made here.
　　 * This must be done before restoring TLS segments so
　　 * the GDT and LDT are properly updated, and must be
　　 * done before math_state_restore, so the TS bit is up
　　 * to date.
　　 */
　　 //arch_leave_lazy_cpu_mode()設置CPU的lazy模式
　　 arch_leave_lazy_cpu_mode();
　　 /* If the task has used fpu the last 5 timeslices, just do a full
　　 * restore of the math state immediately to avoid the trap; the
　　 * chances of needing FPU soon are obviously high now
　　 */
　　 // 如果next_p->fpu_counter > 5則恢復next_p的FPU寄存器內容
　　 if (next_p->fpu_counter > 5)
　　　　 math_state_restore();
　　 /*
　　 * Restore %gs if needed (which is common)
　　 */
　　 //如果需要，恢復gs寄存器
　　 if (prev->gs | next->gs)
　　　　 loadsegment(gs, next->gs);
　　 //把當前的task寫入current_task(一個per-cpu變量)
　　 x86_write_percpu(current_task, next_p);
　　 return prev_p;
}

這個函數涉及到很多硬件相關的東西。需要參閱x86的架構體系方面的資料才能完全理解。

進程切換完了之後，還得要進行一些清理工作，這是由finish_task_switch()完成的。它的代碼如下：

static inline void finish_task_switch(struct rq *rq, struct task_struct *prev)
　　 __releases(rq->lock)
{
　　 struct mm_struct *mm = rq->prev_mm;
　　 long prev_state;
　　 //清空運行隊列的prev_mm .
　　 rq->prev_mm = NULL;
　　 /*
　　 * A task struct has one reference for the use as "current".
　　 * If a task dies,then it sets TASK_DEAD in tsk->state and calls
　　 * schedule one last time. The schedule call will never return, and
　　 * the scheduled task must drop that reference.
　　 * The test for TASK_DEAD must occur while the runqueue locks are
　　 * still held, otherwise prev could be scheduled on another cpu, die
　　 * there before we look at prev->state, and then the reference would
　　 * be dropped twice.
　　 *　　　 Manfred Spraul <[email protected]>
　　 */
　　 prev_state = prev->state;
　　 finish_arch_switch(prev);
　　 finish_lock_switch(rq, prev);
　　 fire_sched_in_preempt_notifiers(current);
　　 //mmdrop:減少mm的引用計數,如果引用計數為零,釋放之
　　 if (mm)
　　　　 mmdrop(mm);
　　 //如果過時程已經終止
　　 if (unlikely(prev_state == TASK_DEAD)) {
　　　　 /*
　　　　 * Remove function-return probe instances associated with this
　　　　 * task and put them back on the free list.
　　　　 */
　　　　 //釋放該進程的task
　　　　 kprobe_flush_task(prev);
　　　　 put_task_struct(prev);
　　 }
}

這個函數比較簡單。不需要做詳細分析了。

到這裡，進程切換已經分析完了。那進程什麼時候切換呢？切換的依據又是什麼呢？這就是我們接下來要分析的進程調度方面的內容了。

二：進程調度

進程調度在近幾個版本中都進行了重要的修改。我們以2.6.9版為例過行分析。在進行具體的代碼分析之前。我們先學習一下關於進程調度的原理。

1：進程類型：

在linux調度算法中，將進程分為兩種類型。即：I/O消耗型和CPU消耗型。例如文本處理程序與正在執行的Make的程序。文本處理程序大部份時間都在等待I/O設備的輸入,而make程序大部份時間都在CPU的處理上。因此為了提高響應速度，I/O消耗程序應該有較高的優先級，才能提高它的交互性。相反的，Make程序相比之下就不那麼重要了。只要它能處理完就行了。因此，基於這樣的原理，linux有一套交互程序的判斷機制。

在task_struct結構中新增了一個成員：sleep_avg.此值初始值為100。進程在CPU上執行時，此值減少。當進程在等待時，此值增加。最後，在調度的時候。根據sleep_avg的值重新計算優先級。

2：進程優先級

正如我們在上面所說的：交互性強的需要高優先級，交互性弱的需要低優先級。在linux系統中，有兩種優先級：普通優先級和實時優先級。我們在這裡主要分析的是普通優先級，實時優先級部份可自行了解。

3：運行時間片

進程的時間片是指進程在搶占前可以持續運行的時間。在linux中,時間片長短可根據優先級來調度。進程不一定要一次運行完所有的時間片。可以在運時的中途被切換出去。

4：進程搶占

當一個進程被設為TASK_RUNING狀態時。它會判斷它的優先級是否高於正在運行的進程。如果是，則設置調度標志位，調用schedule()執行進程的調度。當一個進程的時間片為0時，也會執行進程搶占.

重要的數據結構：

在2。6。9中，每個CPU都有一個運行隊列，這樣就避免了競爭所帶來的額外開銷。運行隊列的結構如下所示：

struct runqueue {
　　 spinlock_t lock;
　　 /*
　　 * nr_running and cpu_load should be in the same cacheline because
　　 * remote CPUs use both these fields when doing load calculation.
　　 */
　　 //運行隊列中進程數目
　　 unsigned long nr_running;
#ifdef CONFIG_SMP
　　 unsigned long cpu_load;
#endif
　　 //進程切換的總數
　　 unsigned long long nr_switches;
　　 //當新一輪的時間片遞減開始後，
　　 //expired_timestamp變量記錄著最早發生的進程耗完時間片事件的時間
　　 //nr_uninterruptible: 記錄本 CPU 尚處於 TASK_UNINTERRUPTIBLE 狀態的進程數，和負載信息有關
　　 unsigned long expired_timestamp, nr_uninterruptible;
　　 //本就緒隊列最近一次發生調度事件的時間
　　 unsigned long long timestamp_last_tick;
　　 //當前CPU上運行的進程和指定的空閒進程
　　 task_t *curr, *idle;
　　 //上一次調度時的MM結構
　　 struct mm_struct *prev_mm;
　　 //兩個隊列:active與expired
　　 //active記錄的是活動的隊列
　　 //expired:記錄的是時間片運行完了的隊列
　　 prio_array_t *active, *expired, arrays[2];
　　 //記錄 expired 就緒進程組中的最高優先級（數值最小)
　　 // 該變量在進程進入 expired 隊列的時候保存
　　 int best_expired_prio;
　　 //記錄本 CPU 因等待 IO 而處於休眠狀態的進程數
　　 atomic_t nr_iowait;
　　 ……
　　 ……
}

prio_array_t的定義如下：

struct prio_array {
　　 //該隊列的進程數
　　 unsigned int nr_active;
　　 //位圖.每一位表示一個隊列,如果該隊列有可以運行的進程,置該位為1
　　 unsigned long bitmap[BITMAP_SIZE];
　　 //運行隊列
　　 struct list_head queue[MAX_PRIO];
};

進程調度的實現

對調度有了一般的了解之後，我們來看下進程調度到底是怎麼樣實現的：

進程調度由schedule()完成,它的代碼如下：

asmlinkage void __sched schedule(void)
{
　　 long *switch_count;
　　 task_t *prev, *next;
　　 runqueue_t *rq;
　　 prio_array_t *array;
　　 struct list_head *queue;
　　 unsigned long long now;
　　 unsigned long run_time;
　　 int cpu, idx;
　　 //WARN_ON(system_state == SYSTEM_BOOTING);
　　 /*
　　 * Test if we are atomic.　Since do_exit() needs to call into
　　 * schedule() atomically, we ignore that path for now.
　　 * Otherwise, whine if we are scheduling when we should not be.
　　 */
　　 //in_atomic:判斷是否在一個中斷上下文或者是原子上下文
　　 if (likely(!(current->state & (TASK_DEAD | TASK_ZOMBIE)))) {
　　　　 if (unlikely(in_atomic())) {
　　　　　　　printk(KERN_ERR "bad: scheduling while atomic!\n");
　　　　　　　dump_stack();
　　　　 }
　　 }
need_resched:
　　 //禁止搶占
　　 preempt_disable();
　　 prev = current;
　　 //得到當前CPU上的運行隊列
　　 rq = this_rq();
　　 /*
　　 * The idle thread is not allowed to schedule!
　　 * Remove this check after it has been exercised a bit.
　　 */
　　 //如果當前進程是空閒進程而且不處於運行狀態.
　　 //BUG:
　　 if (unlikely(current == rq->idle) && current->state != TASK_RUNNING) {
　　　　 printk(KERN_ERR "bad: scheduling from the idle thread!\n");
　　　　 dump_stack();
　　 }
　　 release_kernel_lock(prev);
　　 schedstat_inc(rq, sched_cnt);
　　 now = sched_clock();
　　 //timestamp: 1:切換上去的時間戳
　　 //　　　　　　2:切換下來的時間戳
　　 //　　　　　　3:被喚醒的時間恰戳
　　 //prev是當前正在運行的進程.這個timestamp應該是表示被切換上去的時間戳
　　 //當前運行的時間.如果超過了NS_MAX_SLEEP_AVG. 取運行時間為NS_MAX_SLEEP_AVG
　　 if (likely(now - prev->timestamp < NS_MAX_SLEEP_AVG))
　　　　 run_time = now - prev->timestamp;
　　 else
　　　　 run_time = NS_MAX_SLEEP_AVG;
　　 /*
　　 * Tasks with interactive credits get charged less run_time
　　 * at high sleep_avg to delay them losing their interactive
　　 * status
　　 */
　　 //HIGH_CREDIT()判斷是否是一個交互式進程
　　 //interactive_credit 超過了閥值
　　 //如果是交互式進程，則它參與優先級計算的運行時間會比實際運行時間小
　　 //以此獲得較高的優先級
　　 if (HIGH_CREDIT(prev))
　　　　 run_time /= (CURRENT_BONUS(prev) ? : 1);
　　 spin_lock_irq(&rq->lock);
　　 /*
　　 * if entering off of a kernel preemption go straight
　　 * to picking the next task.
　　 */
　　 switch_count = &prev->nivcsw;
　　 //state: -1:不可運行 0:可運行的 >0 暫停的
　　 if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {
　　　　 switch_count = &prev->nvcsw;
//可中斷且有末處理的信號.激活之.
　　　　 if (unlikely((prev->state & TASK_INTERRUPTIBLE) &&
　　　　　　　　　 unlikely(signal_pending(prev))))
　　　　　　　prev->state = TASK_RUNNING;
　　　　 else
　　　　　　　//從運行隊列中移除
　　　　　　　deactivate_task(prev, rq);
　　 }
　　 //當前CPU
　　 cpu = smp_processor_id();
　　 //運行隊列中沒有可供運行的進程了
　　 if (unlikely(!rq->nr_running)) {
go_idle:
　　　　 //如果沒有可運行的進程了.從其它CPU上"搬"進程過來
　　　　 idle_balance(cpu, rq);
　　　　 //如果還是沒有可運行進程.則將Next賦值為空閒進程.
　　　　 //再跳轉到switch_tasks.將空閒進程切換進去
　　　　 if (!rq->nr_running) {
　　　　　　　next = rq->idle;
　　　　　　　rq->expired_timestamp = 0;
　　　　　　　wake_sleeping_dependent(cpu, rq);
　　　　　　　/*
　　　　　　　* wake_sleeping_dependent() might have released
　　　　　　　* the runqueue, so break out if we got new
　　　　　　　* tasks meanwhile:
　　　　　　　*/
　　　　　　　if (!rq->nr_running)
　　　　　　　　　 goto switch_tasks;
　　　　 }
　　 } else {
　　　　 if (dependent_sleeper(cpu, rq)) {
　　 　　　　 schedstat_inc(rq, sched_goidle);
　　　　　　　next = rq->idle;
　　　　　　　goto switch_tasks;
　　　　 }
　　　　 /*
　　　　 * dependent_sleeper() releases and reacquires the runqueue
　　　　 * lock, hence go into the idle loop if the rq went
　　　　 * empty meanwhile:
　　　　 */
　　　　 if (unlikely(!rq->nr_running))
　　　　　　　goto go_idle;
　　 }
　　 //活動隊列
　　 array = rq->active;
　　 //如果活動隊列中進程數為0.
　　 if (unlikely(!array->nr_active)) {
　　　　 /*
　　　　 * Switch the active and expired arrays.
　　　　 */
　　　　 //對換expired　與active
　　　　 schedstat_inc(rq, sched_switch);
　　　　 rq->active = rq->expired;
　　　　 rq->expired = array;
　　　　 array = rq->active;
　　　　 rq->expired_timestamp = 0;
　　　　 rq->best_expired_prio = MAX_PRIO;
　　 } else
　　　　 schedstat_inc(rq, sched_noswitch);
　　 //在活動隊列組中取第一個不為空的隊列
　　 idx = sched_find_first_bit(array->bitmap);
　　 queue = array->queue + idx;
　　 next = list_entry(queue->next, task_t, run_list);
　　 //next進程即是將要被切換進的進程
　　 //rt_task():判斷是否是一個實時進程
　　 if (!rt_task(next) && next->activated > 0) {
　　　　 unsigned long long delta = now - next->timestamp;
　　　　 //task->activated:表示進程因什麼原因進入就緒態，這一原因會影響到調度優先級的計算
　　　　 //-1，進程從 TASK_UNINTERRUPTIBLE 狀態被喚醒
　　　　 //0，缺省值，進程原本就處於就緒態
　　　　 // 1，進程從 TASK_INTERRUPTIBLE 狀態被喚醒，且不在中斷上下文中
　　　　 // 2，進程從 TASK_INTERRUPTIBLE 狀態被喚醒，且在中斷上下文中
　　　　 //如果是中斷服務程序調用的 activate_task()，也就是說進程由中斷激活，則該進程最有可能是交互式的，因此，置 activated=2；否則置activated=1。
　　　　 //如果進程是從 TASK_UNINTERRUPTIBLE 狀態中被喚醒的，則 activated=-1（在try_to_wake_up()函數中）。
　　　　 if (next->activated == 1)
　　　　　　　delta = delta * (ON_RUNQUEUE_WEIGHT * 128 / 100) / 128;
　　　　 array = next->array;
　　　　 //將進程移出隊列
　　　　 dequeue_task(next, array);
　　　　 //重新計算優先級
　　　　 recalc_task_prio(next, next->timestamp + delta);
　　　　 //重新插入運行隊列
　　　　 enqueue_task(next, array);
　　 }
　　 next->activated = 0;
switch_tasks:
　　 prefetch(next);
　　 //清除調度標志
　　 clear_tsk_need_resched(prev);
　　 rcu_qsctr_inc(task_cpu(prev));
　　 //更新prev->sleep_avg
　　 prev->sleep_avg -= run_time;
　　 //如果sleep_avg小於零
　　 if ((long)prev->sleep_avg <= 0) {
　　　　 prev->sleep_avg = 0;
　　　　 if (!(HIGH_CREDIT(prev) || LOW_CREDIT(prev)))
　　　　　　　prev->interactive_credit--;
　　 }
　　 prev->timestamp = prev->last_ran = now;
　　 sched_info_switch(prev, next);
　　 //如果prev與next不相等,則執行切換過程
　　 if (likely(prev != next)) {
　　　　 next->timestamp = now;
　　　　 rq->nr_switches++;
　　　　 rq->curr = next;
　　　　 ++*switch_count;
　　　　 //執行進程的切換過程
　　　　 prepare_arch_switch(rq, next);
　　　　 prev = context_switch(rq, prev, next);
　　　　 barrier();
　　　　 finish_task_switch(prev);
　　 } else
　　　　 spin_unlock_irq(&rq->lock);
　　 reacquire_kernel_lock(current);
　　 preempt_enable_no_resched();
　　 //如果還有調度標志.那就重新執行調度過程.
　　 //這個調度標志可能是在別處被設置的
　　 if (unlikely(test_thread_flag(TIF_NEED_RESCHED)))
　　　　 goto need_resched;
}

在上面的代碼中涉及到了SMP的負載平衡，動態優先級的知識。

所謂SMP負載平衡是指，當一個CPU的運行進程過多，但是另一個CPU的進程過少。需要在兩者之間平衡運行的進程數。顯而易見，這樣可以提高系統的運行效率。

動態優先級中涉及到很多的公式，在這裡不加詳細分析，請自行了解。

我們接著看，如果一個進程的時間片完了，會進行怎麼的處理。這個處理過程是在時間中斷中完成的。詳細請參閱本站的相關分析。

在時間中斷處理程序中，會調用sheduler_tick()對當前進程運行的時間片做特定的處理。它的代碼如下：

void scheduler_tick(int user_ticks, int sys_ticks)
{
　　 //當前CPU
　　 int cpu = smp_processor_id();
　　 struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
　　 //取得當前CPU的運行隊列
　　 runqueue_t *rq = this_rq();
　　 task_t *p = current;
　　 //更新timestamp_last_tick為當前時間
　　 rq->timestamp_last_tick = sched_clock();
　　 if (rcu_pending(cpu))
　　　　 rcu_check_callbacks(cpu, user_ticks);
　　 /* note: this timer irq context must be accounted for as well */
　　 if (hardirq_count() - HARDIRQ_OFFSET) {
　　　　 cpustat->irq += sys_ticks;
　　　　 sys_ticks = 0;
　　 } else if (softirq_count()) {
　　　　 cpustat->softirq += sys_ticks;
　　 　　 sys_ticks = 0;
　　 }
　　 //如果當前進程為空閒進程
　　 if (p == rq->idle) {
　　　　 if (atomic_read(&rq->nr_iowait) > 0)
　　　　　　　cpustat->iowait += sys_ticks;
　　　　 else
　　　　　　　cpustat->idle += sys_ticks;
　　　　 if (wake_priority_sleeper(rq))
　　　　　　　goto out;
　　　　 rebalance_tick(cpu, rq, IDLE);
　　　　 return;
　　 }
　　 if (TASK_NICE(p) > 0)
　　　　 cpustat->nice += user_ticks;
　　 else
　　　　 cpustat->user += user_ticks;
　　 cpustat->system += sys_ticks;
　　 /* Task might have expired already, but not scheduled off yet */
　　 //如果當前進程不是在活動隊列中
　　 //說明P是過期但末移除的進程
　　 if (p->array != rq->active) {
　　　　 //設置調度標志位
　　　　 set_tsk_need_resched(p);
　　　　 goto out;
　　 }
　　 spin_lock(&rq->lock);
　　 /*
　　 * The task was running during this tick - update the
　　 * time slice counter. Note: we do not update a thread's
　　 * priority until it either goes to sleep or uses up its
　　 * timeslice. This makes it possible for interactive tasks
　　 * to use up their timeslices at their highest priority levels.
　　 */
　　 //實時進程的處理
　　 if (rt_task(p)) {
　　　　 /*
　　　　 * RR tasks need a special form of timeslice management.
　　　　 * FIFO tasks have no timeslices.
　　　　 */
　　　　 if ((p->policy == SCHED_RR) && !--p->time_slice) {
　　　　　　　p->time_slice = task_timeslice(p);
　　　　　　　p->first_time_slice = 0;
　　　　　　　set_tsk_need_resched(p);
　　　　　　　/* put it at the end of the queue: */
　　　　　　　dequeue_task(p, rq->active);
　　　　　　　enqueue_task(p, rq->active);
　　　　 }
　　　　 goto out_unlock;
　　 }
　　 //如果進程的時間片用完
　　 if (!--p->time_slice) {
　　　　 //移出活動隊列
　　　　 dequeue_task(p, rq->active);
　　　　 //設置調度標志
　　　　 set_tsk_need_resched(p);
　　　　 //重新計算優先級與時間片
　　　　 p->prio = effective_prio(p);
　　　　 p->time_slice = task_timeslice(p);
　　　　 p->first_time_slice = 0;
　　　　 //如果rq->expired_timestamp 為空,則設為當前時間
　　　　 //expired_timestamp:表示運行隊列中最先時間片耗完的時間
　　　　 if (!rq->expired_timestamp)
　　　　　　　rq->expired_timestamp = jiffies;
　　　　 //在這裡要注意了:
　　　　 //如果是一個交互性很強的進程.如果它時間版運行完了.將它移至expired隊列的
　　　　 //話,那就必須要等active隊列的進程運行完之後才能被調度.這樣會影響用戶的交互性
　　　　 //判斷是否是一個交互性很強的進程.如果不是,將其加至expired隊列
　　　　 //如果是,將其加至active隊列
　　　　 if (!TASK_INTERACTIVE(p) || EXPIRED_STARVING(rq)) {
　　　　　　　//加入expired隊列
　　　　　　　enqueue_task(p, rq->expired);
　　　　　　　if (p->static_prio < rq->best_expired_prio)
　　　　　　　　　 rq->best_expired_prio = p->static_prio;
　　　　 } else
　　　　　　　//移至active隊列
　　　　　　　enqueue_task(p, rq->active);
　　 } else {
　　　　 /*
　　　　 * Prevent a too long timeslice allowing a task to monopolize
　　　　 * the CPU. We do this by splitting up the timeslice into
　　　　 * smaller pieces.
　　　　 *
　　　　 * Note: this does not mean the task's timeslices expire or
　　　　 * get lost in any way, they just might be preempted by
　　　　 * another task of equal priority. (one with higher
　　　　 * priority would have preempted this task already.) We
　　　　 * requeue this task to the end of the list on this priority
　　　　 * level, which is in essence a round-robin of tasks with
　　　　 * equal priority.
　　　　 *
　　　　 * This only applies to tasks in the interactive
　　　　 * delta range with at least TIMESLICE_GRANULARITY to requeue.
　　　　 */
　　　　 //當前進程的時間片沒有運行完...
　　　　 //如果進程的剩余時間超長.
　　　　 if (TASK_INTERACTIVE(p) && !((task_timeslice(p) -
　　　　　　　p->time_slice) % TIMESLICE_GRANULARITY(p)) &&
　　　　　　　(p->time_slice >= TIMESLICE_GRANULARITY(p)) &&
　　　　　　　(p->array == rq->active)) {
　　　　　　　//從活動隊列中移除
　　　　　　　dequeue_task(p, rq->active);
　　　　　　　//設置調度標志
　　　　　　　set_tsk_need_resched(p);
//重新計算優先級
　　　　　　　p->prio = effective_prio(p);
　　　　　　　//重新插入活動隊列
　　　　　　　enqueue_task(p, rq->active);
　　　　 }
　　 }
out_unlock:
　　 spin_unlock(&rq->lock);
out:
　　 rebalance_tick(cpu, rq, NOT_IDLE);
}

總結：

操作系統用的進程調度算法是衡量一個操作系統的重要標志。在2.4之前，調度部份算法變動很少。但是在2.6版中的近幾個版本中發生了很大的變化。其中重要的變化是支持內核搶占。內核搶占導致了內核的復雜性。很多代碼都必須要保持重入。