您现在的位置： Linux教程網 >> UnixLinux > >> Linux綜合 >> Linux資訊 >> Linux文化

Kernel-Scheduled Entities for FreeBSD 試譯 (轉）

　　
Kernel-Scheduled Entities for FreeBSD
FreeBSD中的“內核調度實體”
Jason Evans

[email][email protected][/email]

January 1, 2003

譯者： DarkBlueSea

Abstract:
摘要
FreeBSD has historically had less than ideal support for multi-threaded application programming. At present, there are two threading libraries available. libc_r is entirely invisible to the kernel, and multiplexes threads within a single process. The linuxthreads port, which creates a separate process for each thread via rfork(), plus one thread to handle thread synchronization, relies on the kernel scheduler to multiplex "threads" onto the available processors.
從前FreeBSD一直缺乏對多線程編程的良好支持。目前，有兩個線程庫可以使用了。libc_r是一個內核完全不可見的，並且允許進程擁有多個線程的用戶空間線程庫。devel/linuxthreads port是內核線程庫，它使用rfork()為每個線程創建一個獨立的進程，再額外創建一個線程，來處理線程間的同步，依靠內核調度器，使多個“線程”能在可用的處理器上執行。

Both approaches have scaling problems. libc_r does not take advantage of multiple processors and cannot avoid blocking when doing I/O on "fast" devices. The linuxthreads port requires one process per thread, and thus puts strain on the kernel scheduler, as well as requiring significant kernel resources. In addition, thread switching can only be as fast as the kernel's process switching.
兩者都存在比例轉換問題。libc_r在多處理器上沒有優勢，並且進行“快速”設備I/O操作無法避免阻塞。devel/linuxthreads port 需要為每個線程創建進程，因此，加重了內核調度器的負擔，同時還需要大量的內核資源，另外，線程切換只能達到內核進程切換的速度。

This paper summarizes various methods of implementing threads for application programming, then goes on to describe a new threading architecture for FreeBSD, based on what are called kernel-scheduled entities, or scheduler activations.
這篇文章概括了多種為應用程序編程實現線程的方法，接著，繼續描述一個新的FreeBSD線程結構，基於內核調度實體(KSE),或者叫“激活調度”。

1 Background 背景
FreeBSD has been slow to implement application threading facilities, perhaps because threading is difficult to implement correctly, and the FreeBSD Project's general reluctance to build substandard solutions to problems. Instead, two technologies originally developed by others have been integrated.
FreeBSD在實現應用程序線程工具方面走的很慢，也許這是因為線程很難被正確的實現，而且FreeBSD項目一般不原意用低於標准的解決方案去解決問題，而用一種新的技術來取而待之，這種技術整合了原來的兩種技術。

libc_r is based on Chris Provenzano's userland pthreads implementation, and was significantly re-worked by John Birrell and integrated into the FreeBSD source tree. Since then, numerous other people have improved and expanded libc_r's capabilities. At this time, libc_r is a high-quality userland implementation of POSIX threads.
libc_r 基於 Chris Provenzano 所開發的用戶空間pthreads，經過 John Birrell 的重要修正後，加入了FreeBSD的源代碼樹中。之後，許多人提高和擴展了libc_r的性能。現在，它已經是POSIX threads的一個高品質的用戶態實現

The linuxthreads port is based on Xavier Leroy's LinuxThreads, which is now an integral part of GNU libc. The linuxthreads port is completely compatible with LinuxThreads as it runs on Linux.
linuxthreads port基於 Xavier Leroy 所開發的LinuxThreads，LinuxThreads現在是GNU libc整體的一部分，linuxthreads port和linux下運行的LinuxThreads 完全兼容

Both of these libraries have scalability and performance problems for certain types of threaded programs.
這兩個庫在一定類型的線程程序上，都存在擴展性和性能問題

2 Threading Architectures 線程結構
Following is a short description of several threading architectures, along with a representative implementation for each of the more common ones. In all cases, threading is preemptive.
下面是對幾種線程結構的簡短的描述，每一個類型都選擇一個典型的實現，在所有的情況下，線程是優先的。

2.1 Userland (ala FreeBSD's libc_r) 用戶空間
Userland threads are implemented entirely in an application program, with no explicit support from the kernel. In most cases, a threading library is linked in to the application, though it is also possible to hand code user threading within an application.
用戶線程全部通過用戶應用程序實現，不是由內核來明確支持的。在多數情況下，線程庫是連接到應用程序中的，雖然同樣可以由應用程序來占有用戶線程的代碼。

In order for userland threading to work, two main issues have to be resolved:
為了讓用戶態線程可以使用，兩個重要的問題必須解決

Preemptive scheduling: 優先級調度

    Threads must be periodically preempted so that all runnable threads of sufficient priority get to run. This is done by a combination of a timer signal (SIGALARM for libc_r), which allows the userland threads scheduler (UTS) to run, and setjmp()/ longjmp() calls to switch between threads.
    線程必須進行周期性的基於優先級的調度，來保證所有有足夠優先級的可運行線程可以運行，這個已經通過一個組合定時器信號(libc_r的SIGALARM)實現了，它允許用戶態線程的調度器（UTS）運行，並且通過setjmp(3)/ longjmp(3) 調用實現在線程間切換。

Process blocking: 進程阻塞

    Normally, when a process makes a system call that cannot be completed immediately, the process blocks and another process is scheduled in order to make full use of the processor. However, a threaded program may have multiple runnable threads, so blocking in a system call should be avoided. This is accomplished by converting potentially blocking system calls to non-blocking. This works well in all cases except for operations on so-called fast devices such as local filesystems, where it is not possible to convert to a non-blocking system call. libc_r handles non-blocking and incomplete system calls by converting file descriptors to non-blocking, issuing I/O requests, then adding file descriptors to a central poll()-based event loop.
通常，當一個進程進行系統調用不能立即完成時，為了保證處理器的高利用率，這個進程就會被阻塞，其他的進程會被調度執行。但是，一個線程化的程序可能含有多個可運行的線程，所以在系統調用過程中，應該避免阻塞。這個問題通過將可能阻塞的系統調用轉換為不可阻塞來實現。在絕大多數情況下，這樣做是沒問題的，但是如果要操作所謂的“高速設備”比如本地文件系統時，在這不可能將系統調用轉換為非阻塞。libc_r通過把文件描述符轉換為非阻塞來處理非阻塞和不完全系統調用，放出I/O請求，接著把文件描述符加入到一個基於poll()調用的中央事件循環(The poll() system call examines a set of file descriptors to see if some of them are ready for I/O.).

Userland threads have the advantage of being very fast in the simple case, though the complexities of call conversion eats into this performance advantage for applications that make many system calls.
用戶態線程的優勢是在平常情況下速度非常快，不過在應用程序需要許多系統調用時，調用轉換的復雜性抵消了這種性能上的優勢。

As libc_r currently exists, there are several problems:
當前的libc_r,有幾個問題

   1. The central poll() loop does not scale well to large numbers of threads. This could probably be solved by converting to the kqueue() interface [Lemon].
1.中央循環不能很好的衡量大量的線程，這個問題可能通過轉變為kqueue(2) 的接口來解決。

2. The entire process blocks on I/O to fast devices. This problem could probably be solved by integrating asynchronous I/O into libc_r, which would first require stabilizing the AIO implementation, then require switching from a poll()-based event loop to a kqueue()-based event loop in order to be able to integrate I/O completion notifications into the same loop.
在“快速設備”I/O操作時，整個進程被阻塞，這個問題可能可以通過在libc_r 中整合異步I/O來解決，解決方案首先要求穩定AIO實現，接著從基於poll()調用的事件循環交換到基於kqueue()調用的事件循環中，以便於將I/O完成的通告整合到同一個循環中。

3.Since all threads are multiplexed onto a single process, libc_r cannot take advantage of multiple CPUs in an SMP system. There is no reasonable retrofit to libc_r which solves this scalability problem.
因為所有線程並發的使用一個處理器，libc_r在對稱多處理器系統中沒有優勢，沒有合適的修改可以使libc_r 解決這個可擴展性問題

Figure 1: Userland threading model
圖1：用戶空間線程模型

2.2 Process-based (ala Linux's LinuxThreads) 基於進程的線程
Process-based threading is confusingly referred to as kernel threading much of the time. True kernel threads are threads that run within the kernel in order to run various portions of the kernel in parallel. Process-based threads are threads that are based on some number of processes that share their address space, and are scheduled as normal processes by the kernel. LinuxThreads implements process-based threading.
多數情況下，基於進程的線程作為內核線程讓人感到很混亂，真正的內核線程是在運行在內核中的線程，以便讓線程並行的運行在內核的不同部分。基於進程的線程是線程基於一些共享地址空間的進程，這些進程被內核作為普通進程來調度，LinuxThreads就是基於進程的線程實現。

New processes are created such that they share the address space of the existing process(es), as shown in Figure 2. For Linux, this is done via the clone() system call, whereas for FreeBSD it is done via a special form of the rfork() system call.
新進程創建時，它們共享一個已經存在的進程的地址空間，在表2中可以看出。對於Linux，這些進程通過clone()系統調用完成，但在FreeBSD下，通過rfork()完成。

As soon as a second thread is created via pthread_create(), LinuxThreads creates an additional process that is used exclusively for synchronization among processes. This means that for a program with n threads, there are n + 1 processes (if at some point in time n > 1).
當第二個進程創建後，LinuxThreads 會額外創建一個進程，專門用於同步其他進程，那就是說，對於一個有n個線程的程序，一共有你n+1個進程（如果當前n>1）

LinuxThreads is elegant in that it is simple. However, it has at least the following POSIX compliance issues that cannot be easily addressed:
LinuxThreads 在普通情況下很優雅，但是至少在下面的與POSIX一致性上，不是很容易處理

Each thread runs in a separate process. Each process has a unique process ID (pid), but POSIX requires all threads to appear to have the same pid. Fixing this would require significant modifications to the kernel's data structures.
每一個線程運行在一個獨立的進程裡，每一個進程有一個唯一的進程ID但是POSIX要求，所有的線程中要出現同一個pid，修復這個問題需要大量的修改內核的數據結構。

The thread priority semantics specified by POSIX cannot be implemented, because each thread is actually a process, which makes thread contention at the application level impossible. All thread contention is by definition at the system level. This has the additional disadvantage that multi-threaded applications compete unfairly with single-threaded applications, which makes running a mixture of applications difficult on a single machine.
POSIX指定的線程的優先級的語義不能實現，因為每一個線程實際上是一個進程，這使得線程不可能在應用程序級別上相互競爭，所有的線程的競爭都被定義在系統級別上。這就帶來了額外的缺點，多線程程序和單線程程序的競爭明顯的不公平，這會使一台機器上運行多種混合程序有困難。

Process-based threading also has some inherent performance and scalability issues that cannot be overcome:
基於進程的線程同樣有一些先天的性能和可擴展優勢無人能及

Switching between threads is a very expensive operation. It requires switching to kernel mode, switching the old thread (process) out, and then running the new thread (process). There is no solution to this problem except to optimize process switching, which defies optimization beyond a certain point. This problem is aggravated by cache locality considerations.
線程間的切換是代價非常高的操作，需要換入內核態，換下原來的線程接著運行新的線程。現在沒辦法解決這個問題，只能通過優化進程切換來緩解，挑戰最優化，以超過一個點。這個問題由於緩存擊中率的因素更加尖銳化。

Each thread (process) requires all the kernel resources typically associated with a process. This includes a kernel stack, which means that applications with many threads require large amounts of kernel resources.
每一個線程（進程）需要所有的與進程相關聯的內核資源。包括內核堆棧，這意味著多線程程序需要大量的內核資源。[color=Red]比例轉換[/color][u]Sample Text[/u]

FreeBSD_KSE.odt.zip

Figure 2: Process-based threading model
基於進程的線程模型

2.3 Multi-level (ala Solaris's LWPs) 多級線程
Multi-level threading is a hybrid of user-level and process-based threading. Threads are multiplexed onto a pool of processes. The size of the process pool is normally determined automatically by heuristics internal to the threading library that take into account issues such as the number of processors, number of threads, and processor binding of threads.
多級線程是用戶空間和基於進程線程的結合。線程在一個進程池上多路復用。進程池的大小通常由線程庫內部自動決定，考慮的因素有處理器的數目，線程的數目和處理器捆綁的線程數目。

The idea of multi-level threading is to achieve the performance of userland threading and the SMP scalability of process-based threading. Ideally, most thread scheduling is done by a UTS to avoid the context switch overhead of kernel calls, but multiple threads can run concurrently by running on more than one process at the same time.
多級線程的設計目標是，即達到了用戶線程的性能，又能實現類似基於進程線程在SMP下的可擴展性，理論上，多數線程調度用UTS完成來避免上下文切換進行系統調用的開支，但是多個線程可以並行的運行在多個處理器上。

In practice, multi-level threading's main shortcoming is its complexity. Ideally, the advantages of userland and process-based threading are combined, without any of the disadvantages, but in practice, some of the disadvantages tend to slip in. The overhead of the multi-level scheduling compounds this.
事實上，多級線程的主要缺點就是它的復雜性，理論上，它同時擁有用戶空間線程和基於進程線程的優點，沒有任何缺點，但在實際中，一些缺點被忽略了，多級線程調度的開銷就是其中之一。

Multi-level threading does not require light-weight processes (LWPs) in order to work, but Solaris uses LWPs to address the POSIX compliance issues mentioned above for purely process-based threading. Also, in theory, LWPs are light-weight, though Solaris's LWPs no longer generally meet this criterion. That is, by the time Sun got the kinks worked out, LWPs were no longer light-weight.
多級線程不需要輕量級進程（LWPs）來支持其運行，但是Solaris使用LWPs來實現POSIX中的提到的純基於進程的線程。同樣，在理論上，LWPs是輕量級的，雖然Solaris的LWPs不再完全符合這個尺度。就是說，Sun在讓LWPs向POSIX的標准靠近時遇到了障礙，LWPs也不再是輕量級的了。

Figure 3: Light-weight process-based threading model
輕量級基於進程的線程模型

2.Scheduler Activations 激活調度

This is a very brief overview of scheduler activations (SAs) as presented in [Anderson], and is meant only as a basis for the more complete treatment of kernel-scheduled entities (KSEs) for FreeBSD later in this paper. There are many details left out in this section, which are treated in detail in the original paper.
激活調度的的簡短概述，這裡只是基本介紹FreeBSD中的KSE的內容。許多細節都會被忽略，下一節將會詳細說明。

The SAs approach, like multi-level threading, strives to merge the advantages of userland and process-based threading while avoiding the disadvantages of both approaches. SAs differ from multi-level scheduling in that additional kernel facilities are added in order to provide the UTS with exactly the information and support it needs in order to control scheduling. Simply put, SAs allow the kernel and the UTS to do their jobs without any guess work as to what the other is doing.
Sas方法，類似於多級線程，盡力融合用戶級線程和基於進程線程的優點，同時避免二者的缺點。SAs不同於多級調度通過增加額外的內核工具提供UTS需要的准確的信息和支持來控制調度。簡單的說，SAs允許內核和UTS做自己的事，而不用關心對方在干什麼。

A process that takes advantage of SAs has a significantly different flow of control than a normal process, from the perspective of the kernel. A normal process has a number of data structures associated with it in the kernel, including a stack and process control block (PCB). When a process is switched out, its machine state is saved in the PCB. When the process is run again, the machine state is restored from the PCB and the process continues running, whether in kernel or user mode.
一個進程用SAs得到的好處是，在內核透視中，有一個和普通進程顯著不同的控制流程。在內核中，普通進程有若干數據結構與之關聯，包括一個堆棧，進程控制塊（PCB）。當一個進程被換出時，它的執行狀態保存在PCB中，當進程繼續運行時，執行狀態從PCB中恢復，進程繼續執行，無論內核進程和用戶進程都是如此。

A process that is using SAs does not have a kernel stack or PCB. Instead, every time a process is run, a SA is created that contains a kernel stack and thread control block (TCB), and the process runs in the context of the SA. When the SA is preempted or blocked, machine state is stored in the SA's TCB, and the kernel stack is optionally used for completion of a pending system call. See Figure 4.
使用SAs的進程沒有系統堆棧和PCB，取而代之的是，每次一個進程要運行時系統創建一個SA，SA中包含一個系統堆棧和線程控制塊（TCB），進程運行在SA上下文中，當這個SA被搶占或是阻塞，機器狀態被保存在SA的TCB中，系統堆棧可以用來完成一個未完成的系統調用。見圖4

It is possible to run more than one SA for the same process concurrently on multiple processors. However, the UTS needs to know at all times exactly what processors it is running on, so that it can make informed scheduling decisions. As part of the solution to this problem, every time a SA is started, it initially starts executing in the UTS. Additionally, the kernel makes upcalls to the process to notify it of important events that may affect thread scheduling decisions. The following upcalls are necessary:
一個進程可能同時運行多個SA在多個處理器上，但是，UTS需要每時每刻准確的知道處理器上是什麼在運行，以便做出正確的調度決定。作為這個問題解決方案的一部分，每次一個SA啟動，它首先在UTS中啟動執行。另外，內核會進行回調來通知進程去修改一些會影響線程調度決定的重要事件，下面的回調是必須的：

sa_new(cpu_id):
    Execute a thread on the processor with ID cpu_id.

sa_preempt(sa_id):
    A thread was preempted, with SA ID sa_id.

sa_block(sa_id, pcb):
    A thread blocked in the kernel, with SA ID sa_id and machine state pcb.

sa_unblock(sa_id, pcb):
    A thread that was blocked in the kernel has completed, with SA ID sa_id and machine state pcb.

Also, the following system calls are necessary:

sa_alloc_cpus(ncpus):
    Allocate ncpus additional CPUs, if possible.

sa_dealloc_cpu():
    Remove a CPU (reduce concurrency by one).

Figure 4: Scheduler activations
激活調度

3 Kernel-scheduled entities 內核調度實體

Kernel-scheduled entities (KSEs) are similar in concept to scheduler activations, but support for threads with system-level scheduling contention is added. Figure 5 shows how KSEs and userland interact. Scheduling contention is controlled via KSE groups (KSEGs). A process has one or more KSEGs associated with it. Each KSEG has a concurrency level associated with it, which controls the maximum number of concurrent KSEs that can be run for that KSEG. Each KSEG is a separate entity from a timesharing perspective.
內核調度實體（KSEs）和激活調度是類似的概念。但是加入了線程進行系統級調度的支持。圖5表示了KSEs是如何和用戶空間交互的。調度競爭是通過KSE組（KSEGs）來控制的。一個進程擁有一個或多個的KSEGs，每一個KSEG有一個並發級別，這個級別控制在這個KSEG中可以並發執行的最大KSEs的數量。在一個分時的透視中，每一個KSEG是一個獨立的實體

Figure 5: Kernel-scheduled entities

A process starts out with one KSEG that has a concurrency level of one. As new threads are created, the concurrency level of that KSEG can be adjusted, up to the maximum number of CPUs available for execution of the process. If a thread with system-level scheduling contention is desired, a new KSEG with a concurrency level of one can be created for that thread.
一個進程啟動時有一個並發級別為一的KSEG，隨著新線程的創建，它的並發級別也會做相應的調整，直到達到該進程的最大可用CPU數。如果進程希望創建一個系統級調度的線程，會為這個線程創建一個新的並行級別為一的KSEG。

Since each KSEG is a separate entity from a timesharing perspective, allowing a process to create an arbitrary number of KSEGs would give the process an unfair scheduling advantage. Therefore, instead of enforcing process limits, KSEG limits are enforced. In the common case of single-threaded applications, there is a one to one correspondence between the two methods of enforcing resource limits, but in the case of multi-threaded applications, this prevents a single user from being able to gain a larger portion of the CPU time than would be possible with multiple single-threaded applications.
因為對於分時的透視，每一個KSEG是一個單獨的實體，允許進程創建任意數目的KSEGs將會使進程的調度不公平，因此，系統沒有限制進程，而是限制了KSEG。在通常的單線程程序中，兩個方法在限制資源的權限內進行一對一的通信，但在多線程的程序中，這樣防止一個用戶獲得比多個單線程程序更多的CPU時間。

The KSE architecture makes a distinction between a KSE and a thread. KSEs are used in the kernel in the scheduler queues, and as a general handle, much the same way as a process is used in current BSD kernels. Given a pointer to a KSE, it is possible to access its associated KSEG, proc, and it's current thread. A Thread contains the state of a suspended thread of execution. When a running KSE is blocked, the execution state is saved in the Thread so that when it is possible to continue execution, the thread can be re-attached to a KSE and continued. When a KSE is preempted, the execution state is saved in a Thread so that any userland state can be handed to the UTS.
KSE的結構使KSE和線程間存在不同，KSEs在內核調度隊列中使用，作為一般處理，和現在BSD中使用的進程相同。給一個指向KSE的指針，KSEG，proc和當前線程通過這個指針來引用KSE。一個線程包含停止時的執行狀態。當一個運行的KSE被阻塞時，執行狀態將會保存在線程中，以便於線程在以後運行時使用，線程可以被重新聯接一個KSE後繼續執行。當一個KSE被搶先時，執行狀態也會被保存，以便於任意的用戶級狀態可以被UTS處理。

[ 本帖最後由 DarkBlueSea 於 2006-4-2 17:59 編輯 ]

DarkBlueSea 回復於：2006-04-02 01:51:38

KSEs are only evident within the kernel. The interface with userland only deals with processes, KSEGs, and Threads. KSEs themselves are irrelevant to userland because they serve essentially as an anonymous handle that binds the various kernel structures together.
KSEs非常明顯的運行在內核中，由進程來處理它的用戶空間接口，KSEGs，線程，KSEs和用戶空間是不相干的，因為他們實質上被匿名的和不同的內核結構捆綁在一起，向用戶空間提供服務。

3.Operation without upcalls 無回調操作

Unless upcalls are explicitly activated by a program, execution is almost identical to the traditional BSD process model. The proc structure is still broken into four components, and the KSE is used as the handle to the process most places in the kernel, but otherwise, little is changed. Figure 6 shows the linkage between the four data structures that comprise the single-threaded component. The dashed lines denote linkage that only exists if upcalls are not activated.
除非程序明確的激活回調，否則進程的執行還是按照BSD傳統模型進行。Proc結構還是分為四部分，KSE在內核的大多數地方作為進程的把手，但因此，有一點點改變。圖6顯示了單線程程序四個數據結構之間的關聯，虛線部分只有在回調沒有被激活時才存在。

Figure 6: KSE data structure linkage for a single-threaded process

3.Operation with upcalls 有回調操作

At the time upcalls are activated via a system call, program flow changes radically. Non-blocking system calls will behave normally, but blocking system calls, preemption, and running will cause the program to receive upcalls from the kernel. Figure 7 shows the data structure linkage for a process running on a 4 CPU machine that has real-time, system scope, and timeshared threads. The KSEs that have an associated Thread are currently running.
在任何系統調用中，一旦回調被激活，程序流程會發生根本的變化，非阻塞系統調用會很常見，但是可阻塞系統調用有優先權，並且運行時程序會收到內核發出的回調。圖7顯示了一個運行在四個CPU上的進程，其各個數據結構之間的關聯，這個進程有實時，系統級，分時的三種線程。和KSE關聯的線程正在運行。

Figure 7: KSE data structure linkage for a multi-threaded process on a 4 CPU machine

3.APIs

KSEs require the ability to make upcalls to userland. This is a very different flow of control than normal process execution uses. Normal processes don't know when they are preempted or resumed; execution appears seamless. With KSEs, the process can be notified of every preemption and resumption, which requires upcalls.

Basically, there is only one upcall, with information as to what events caused it available from a pre_agreed mailbox.
KSEs需要有能力對用戶態進行回調。這個完全不同於控制進程正常執行所用的的流程。正常的進程不知道被搶先或是被恢復運行，運行幾乎是無縫的，有了KSEs，當進程被搶先或是恢復運行時系統會通過回調通知進程。基本上，只有一個回調，帶著來自一個預先設定的信箱信息，信息說明在什麼事件發生時，線程可用。
The following system calls (or versions thereof) are necessary:

void kse_new(mailbox_addr, flags):
    Start using a new KSE. mailbox_addr points to a structure that contains the necessary data for the kernel to make upcalls. Initially, there is only one KSEG (ID 0), which has a concurrency level of 1. If the flags specify, a new KSEG is created to hold the new KSE.

int kse_yield():
    The KSE returns to the kernel and does not return.

int kse_wakeup(kse_id):
    Trigger an upcall from this KSE if is is currently inactive.

int thread_cancel(thread_id):
    If the thread in question is stopped somewhere an attempt is made to abort the operation. (Similar to signals).

int kse_bind(int kse_id, int cpu_id):
    Bind the KSE with ID kse_id to the CPU with ID cpu_id. This system call returns the CPU ID that the KSE is bound to, or -1 if there is an error (e.g. invalid CPU ID)

Figure 7 shows the basic linkage between processes (proc), KSEGs (kseg), KSEs (kse), and Threads (thread). The diagram corresponds to a four processor machine. Two of the processors are running KSEs on behalf of bound KSEGs, and the other two processors are running KSEs on behalf of a softly-affined KSEG.
圖7顯示了proc，KSEGs，KSEs和線程之間的關聯。那個圖顯示了一台有四個CPU的機器。兩個CPU正在運行綁定KSEG，另兩個CPU在運行軟親緣（softly-affined） KSEG.

In addition to the new system calls, the getpriority() and setpriority() system calls need to be extended to handle KSEGs. It may be necessary to do so in a way that is not backward compatible, since at a system level, a KSEG can only be uniquely specified using the process ID and the KSEG ID.
此外，對於新的系統調用，getpriority()和 setpriority()系統調用需要被擴展來處理KSEGs。也許必須這麼做，所以從某種程度上不能實現向後兼容，因為在系統級上，KSEG通過進程ID和KSEG Id來保證唯一性。
3.Kernel Scheduler 內核調度

A number of changes to the kernel scheduler are mandatory in order to support KSEs. Chief among these changes are:
內核調度為了支持KSEs進行了許多修改。主要有下面的一些修改

    * KSEs are are placed in the scheduler queues, rather than processes. This enables concurrent execution of KSEs that are associated with the same process.
KSEs被放置在調度隊列，取代了進程。這樣就允許一個進程的多個KSE可以並發執行。

    * Timesharing calculations are done with KSEGs, rather than processes. This allows multiple KSEGs in one process to compete for CPU time at a system-wide level.
分時計算由KSEGs完成，取代了進程。一個進程中可以有多個KSEGs在系統級上共同競爭CPU時間

    * The state for blocked (incomplete) system calls is stored in Threads. This means that this queue consists of Threads rather than KSEs. In other words, the scheduler deals with KSEs in some places, and Threads in other places.
系統調用的阻塞狀態被存儲在線程中，阻塞隊列由線程組成，而不是KSE換句話說，調度程序在一個地方處理KSE，另一個地方處理線程。

Additionally, soft processor affinity for KSEs is important to performance. KSEs are not generally bound to CPUs, so KSEs that belong to the same KSEG can potentially compete with each other for the same processor; soft processor affinity tends to reduce such competition, in addition to well-known benefits of processor affinity.
另外，軟處理器親緣對KSE的性能非常重要。KSE不能全部被綁定在CPU上，所以KSE所屬的KSEG之間可能會競爭同一個處理器，軟處理器親緣傾向於減少這種競爭，除了眾所周知的處理器親緣帶來的好處。

3.5 Userland threads scheduler 用戶空間調度

The KSE-based UTS is actually simpler than is possible for a userland-only threads implementation, mainly because there is no need to perform call conversion. The following is a simplified representation of the core UTS logic.
實際上，基於KSE的UTS比用戶空間的UTS實現要簡單，主要的原因是不再需要進行調用轉換。下面是一個簡單的核心UTS邏輯處理的內容說明
   1. Find the highest priority thread that is mapped to the kseg that this Thread is part of. Optionally heuristically try to improve cache locality by running a thread that may still be partially warm in the processor cache.
在KSEG中找到一個優先級最高的線程，把這個線程放在一個還有部分該線程的緩存仍然的沒有被清除的處理器上，通過這種可選的試探性方法提高緩存的擊中率。
   2. Set a timer that will indicate the end of the scheduling quantum.
設置一個指示調度時間片結束的計時器
   3. Run the thread.
運行這個線程
Of course, there are a number of ways to enter this code, such as getting a new Thread, running a thread to the end of a quantum, or rescheduling due to another thread blocking. However, the fact remains that the UTS logic is quite simple.
當然，有很多種辦法進入這段代碼，比如創建一個新線程時，在一個運行線程的調度時間片結束時，或者由於另一個線程的阻塞而重新調度。但是UTS的邏輯仍然很簡單

3.5.1 Temporary priority inversion 臨時優先級倒置
The UTS always has the information it needs to make fully informed scheduling decisions. However, in the case of priority-based thread scheduling, there are some circumstances that can cause temporary scheduling inversions, where a thread may continue to run to the end of its quantum despite there being a higher priority runnable thread. This can happen when:
UTS總是有它需要的信息來做一個明智的調度決定，但是，就一個基於優先級的線程調度而言，有一些情況可能導致暫時的調度倒置，那種情況下，一個線程也許會在它的時間片結束後繼續運行，即使有一個有一個優先級更高的可運行線程，這種情況會在下列情形下發生。

   1. A new thread is created that has a lower priority than its creator, but a higher priority than a thread that is concurrently running on another processor.
一個線程創建了一個比自己優先級低的線程，但是一個有更高優先級的線程正在另一個處理器上運行。

   2. A KSE (running thread A) is preempted by the kernel, and the upcall notification causes preemption of thread B, which is higher priority than thread A, though thread C is also running on another processor and has a lower priority than both A and B. In this case, A will be scheduled and C will continue to run, even though thread B is higher priority than C. Note that there are much more convoluted examples of this same basic idea.
一個KSE（運行線程A）被內核搶先，回調通知一個比A優先級高的B線程取得優先權，雖然線程C比A的優先級更低，它仍然運行在另一個CPU上，在這種情況下，雖然B的優先級比C高，但是被調度的是A而不是C注意還有很類似的多錯綜復雜的例子

Such temporary inversions are technically violations of the policy that lower priority threads never run in the presence of higher priority threads, but there are two reasons not to do anything about it:
這種暫時的倒置理論上會干擾優先級調度策略，但有兩個原因不用去管它

   1. Solutions to this problem require additional system calls in which the UTS explicitly asks the kernel to preempt KSEs. This is expensive.
解決這個問題，UTS為了搶先KSE，需要額外的系統調用去查詢內核得到准確的KSE優先級。這樣的開銷很大。
   2. Temporary inversions are logically equivalent to the race condition where the UTS determines that a thread should be preempted in favor of scheduling another thread, while the thread races to complete its quantum. It is not important whether the UTS or the thread wins the race; allowing the lower priority thread to complete its quantum without competition from the UTS simply allows the thread to always win.
暫時倒置邏輯上等同與競爭情形，UTS決定一個線程應該被搶先，調度另一個線程更有利，同時線程與UTS競爭完成它的時間片。最終獲勝的是UTS還是線程並不重要，在沒有UTS競爭的情況下允許低優先級的線程完成它的時間片的簡單性使線程總是會獲勝。

3.6 Initialization and upcalls 初始化和回調
The code that is necessary to start up KSEs in a userland program looks something like:

void
foo(void)
{
        struct kse_mailbox       km;

        kse_init(&km);
        /* Handle upcall and longjmp() to UTS. */
}

Each upcall appears to be a return from kse_init(). The stack on which kse_init() is called must never be unwound, since the kernel fakes up a context at the point where kse_init() is called for each upcall. The argument to kse_init() points to a data structure that contains enough information for the kernel to determine where to write events. Due to the asynchronous completion of blocked Threads, it is possible for a single upcall to report a large number of thread_unblock() events, the number of which is only (potentially) bounded by the resource limit settings for the process.
每個回調作為kse_init()產生一個返回值。調用kse_init()的堆棧永遠不能被釋放，因為在調用kse_init()的地方，內核為每一個回調偽造了一個上下文。kes_init()的參數指向一個數據結構，這個數據結構包含的信息足夠內核決定要把事件寫在哪裡。由於線程阻塞是異步完成的，一個回調可能會報告大量的 thread_unblock()事件，事件的數量由進程的資源限制設置決定。

Since thread_unblock() events include the userland Thread execution state, the amount of space needed to report all events during an upcall can vary wildly; static event buffers are inadequate for such a dynamic problem. Therefore, km contains a pointer to a chain of objects with embedded TCBs, and the kernel stores the TCBs from thread_preempt() and thread_unblock() events in the chain elements. Figure 8 shows the basics of how struct kse_mailbox is laid out.
因為thread_unblock()事件包含用戶空間的線程執行狀態，在一次回調中，報告所有事件所用的空間可能很大，靜態事件緩存不能滿足這樣一個動態問題。因此km包含指向一鏈內嵌TCB的對象的指針，內核把TCB保存在thread_preempt()和thread_unblock()事件形成的對象鏈元素中。圖8 簡單的顯示了kse_mailbox結構是如何放置的。

Figure 8: TCB chaining for upcall event storage

DarkBlueSea 回復於：2006-04-02 01:55:48

3.6.1 Per-upcall event ordering 回調前的事件次序

The techniques described in the previous section for storing events separate the event types that pass TCBs from those that don't. This means that event ordering is not implicit, and would require extra work to preserve. However, it turns out that in most cases, order doesn't matter, and in the cases that it does matter, the UTS can always determine order.
前面的章節描述了存儲事件，區分事件類型的技術忽略了TCB來自那些禁止事項。這意味著事件的順序不是固定的，可能需要額外的工作去保護。但是在大多數情況下，順序沒關系，即使在順序有關系的情況下，UTS總能決定順序

As an example, suppose thread_block() and thread_unblock() events occur for the same Thread ID in the same upcall. In one interpretation, where the thread_block() event occurred first, the UTS last knew the corresponding thread to be running, so the two events simply mean that the thread can be continued. In the other interpretation, where the thread_unblock() event occurred first, the Thread ID has been recycled. However, this isn't possible, since there must have been an intervening thread_new() event, which couldn't have happened until after the Thread unblocked.
舉個例子，假設一個線程在一次回調是thread_block()和thread_unblock()事件都發生了。一種情況，如果thread_block()先發生，UTS最後知道通信的線程在運行，所以兩個事件表示該線程可以繼續執行。另一種情況，如果thread_unblock事件先發生，線程的ID可以被重新利用，但是，這不可能，因為它一定是一個在thread_new()事件之間，在一個線程被解鎖之前是不可能發生的。

3.7 Upcall parallelism 回調對應
This section is not yet adequately fleshed out. Issues to consider:
這部分還有沒適當的充實，關鍵要考慮
How do upcalls from multiple processors keep from stomping on each other? For example, if we have only one chain of TCBs, how does it get safely used?
怎麼才能避免回調在多個處理器上跳來跳去，比如，如果只有一鏈TCB，怎麼能讓它被安全的使用。
What if the UTS doesn't process the TCBs before another upcall happens? Do we run the risk of overflowing the TCB chain?
如果在新的回調發生時，UTS不能處理TCB怎麼辦，要運行這個存在溢出危險的TCB鏈嗎
    * How do we handle races between the UTS and the kernel? For example, signals cannot be cleared until delivered to a thread, and the kernel must look at the userland bitmap to see if a signal is still pending.
怎麼處理UTS和內核間的競爭呢，比如，在信號傳遞給線程前不能被清空，那麼內核必須檢查用戶空間的位圖，去看是否信號還在傳遞中。
3.8 Kernel scheduler 內核調度
The kernel scheduler needs a major overhaul in order to support KSEs. Support for the following features is necessary:
內核調度需要進行仔細檢查以便支持KSE，下面的特征是必須支持的
    * Soft affinity. The scheduler should try to keep processes running on the same processors if there is the possibility of reaping benefits from increased cache locality.
軟親緣。調度程序應該盡力保證一個進程在被調度前後運行在同一個CPU 上，這樣可以通過增加緩存的擊中率提高性能。
    * Hard affinity (binding). Some programs may wish to bind a KSE (pedantically speaking, a KSEG is bound) to a particular processor in order to improve cache locality, or to tighten the bounds on a real-time task. Bound KSEs are run only on one CPU, unless priority inheritance forces it to be run on another CPU in order to avoid deadlock due to priority inversion.
硬親緣（綁定）。一些程序可能希望把KSE綁定在一個單獨的CPU上（准確的說，KSEG）來提高緩存的擊中率，或是讓實時任務更嚴密，被綁定的KSE只運行在一個CPU上，除非在優先級倒置時，為了避免死鎖，優先級繼承迫使它運行在另一個CPU上。

    * Increased parallelism. The current scheduler can only be executed by one processor at a time, which makes it a severe bottleneck, depending on the type of system load.
對應增長。在同一時刻內，調度程序只能運行在一個處理器上，隨著系統的負載量增大，這產生了嚴重的瓶頸。
Part of the solution is to implement per-CPU scheduling queues. Each CPU has an associated queue for softly affined KSEs, as well as a queue for bound KSEs. In addition, there is a global queue for fixed-priority KSEs, such as interrupt threads and real-time KSEs. Fixed-priority KSEs prefer to run on the same CPU as the last time they were run, but scheduling issues keep more complex affinity algorithms from being beneficial for this class of KSEs. See Figure 9 for a simplistic representation of the various scheduling queues. The diagram ignores some details such as split queues for various priorities.
這個問題的解決方案是實現單CPU調度隊列。每個CPU有一個軟親緣KSE隊列和一個綁定KSE隊列。此外，還有一個為固定優先級KSE設的全局隊列。比如中斷處理線程和實時KSE。確定優先級的KSE更希望運行在他們上次運行的CPU上，但是調度算法有更多對這類KSE有益的負雜親緣算法。圖9顯示了一個十分簡單的各種調度隊列的代表。圖忽略了某些細節，比如把隊列分成不同優先級

Each CPU schedules from its own queues, and resorts to stealing runnable softly affined KSEs from other CPUs if there are no runnable KSEs. Every so often (exact number to be determined by benchmarking, but a likely number is 5 seconds), CPU loads are calculated, and if the loads are too disparate, the under-loaded CPU(s) steal additional KSEs from the overloaded CPU(s) in an attempt to balance the load. Instantaneous per-CPU load is defined as the number of runnable timesharing KSEs in a CPU's queues. Instantaneous system load is the sum of the instantaneous per-CPU loads. Note that normal (non-balancing) KSE stealing is adequate to keep the system busy if there is enough work to do, but that without periodic load balancing, time sharing becomes unfair if there aren't approximately equal CPU loads.
每個CPU在自己的隊列中進行調度，如果沒有可運行的KSE，則從其他CPU那裡取一些有軟親緣的KSE，並且重新排序。每過一段時間（確切得數子通過測試決定，但是可能的是5秒）會計算CPU負載，如果CPU之間的負載差別很大低負載的CPU從高負載CPU的調度隊列裡取一些KSE加入自己的隊列，以達到負載平衡。瞬間單CPU負載定義為在當前CPU的隊列裡，可運行的分時KSE的數量。瞬間系統負載是瞬間單CPU負載的和。注意，正常的KSE獲取是適量的，為了保證在有足夠工作時，系統的繁忙。如果不進行周期性的負載平衡，如果沒有大致均衡的CPU負載，分時就會變得不公平。

This is essentially the approach that was taken in DEC OSF/1 3.0 [Denham], and it seems to have worked well there. One notable difference between our kernel and OSF/1 is that our interrupts are backed by threads, whereas OSF/1 keeps spl()s (IPLs in their terminology). However, this doesn't affect the design, since OSF/1 already has to handle real-time scheduling.
這個在本質上和DEC 的OSF/1.3.0相似，這種方法在OSF上好象很有效，一個值得注意的不同是，我們的中斷是基於線程的，雖然OSF/1擁有spl，但是，這個沒有影響設計，因為OSF/1必須處理實時調度。
Figure 9: Kernel scheduler queues 內核調度隊列

4 Summary
Discussion about improved threads support in FreeBSD has been ongoing for several years. The KSE project aims to implement a kernel mechanism that allows the kernel and userland to support threaded processes and communicate with each other effectively so that the necessary information is available for both to do their jobs efficiently and correctly.
討論在FreeBSD中增加線程支持已經進行了幾年。KSE項目的目標是實現一個內核結構，允許內核和用戶空間支持線程處理以及相互間有效的通信以便它們有效和正確的完成工作所必要的信息對兩者來說都是可用的。
Glossary

KSE:
    Kernel-scheduled entity. This gets scheduled.

Thread:
    A runnable context. This gets suspended.

KSEG:
    Kernel-scheduled entity group.

PCB:
    Process control block.

SA:
    Scheduler activation.

TCB:
    Thread control block.

UTS:
    Userland threads scheduler. Userland, multi-level, and scheduler activation-based threads libraries all have a UTS.

Bibliography

Anderson
    Thomas E. Anderson, Brian N. Bershad, Edward D. Lazowska, and Henry M. Levy, Scheduler Activations: Effective Kernel Support for the User-Level Management of Parallelism, ACM Transactions on Computer Systems, Vol. 10, No. 1, February 1992, Pages 53-79.

Boykin
    Joseph Boykin, David Kirschen, Alan Langerman, and Susan LoVerso, Programming under Mach, Addison-Wesley Publishing Company, Inc. (1993).

Butenhof
    David R. Butenhof, Programming with POSIX threads, Addison Wesley Longman, Inc. (1997).

Denham
    Jeffrey M. Denham, Paula Long, and James A. Woodward, DEC OSF/1 Symmetric Multiprocessing, Digital Technical Journal, Vol. 6, No. 3.

Kleiman
    Steve Kleiman, Devang Shah, and Bart Smaalders, Programming with Threads, SunSoft Press (1996).

Lemon
    Jonathan Lemon, Kqueue: A generic and scalable event notification facility, BSDcon Conference Proceedings (2000), http://people.freebsd.org/~jlemon/kqueue.pdf.

Mauro
    Jim Mauro and Richard McDougall, Solaris Internals, Sun Microsystems Press (2001).

McKusick
    Marshall Kirk McKusick, Keith Bostic, Michael J. Karels, and John S. Quarterman, The Design and Implementation of the 4.4BSD Operating System, Addison-Wesley Publishing Company, Inc. (1996).

Vahalia
    Uresh Vahalia, UNIX $^{TM}$ Internals: The New Frontiers, Prentice-Hall, Inc. (1996).

About this document ...
Kernel-Scheduled Entities for FreeBSD

This document was generated using the LaTeX2HTML translator Version 2002-2-1 (1.70)

Copyright © 1993, 1994, 1995, 1996, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999, Ross Moore, Mathematics Department, Macquarie University, Sydney.

The command line arguments were:
latex2html -split 0 -dir html -no_navigation -no_white -show_section_numbers freebsd_kse.tex

The translation was initiated by Chris Knight on 2003-01-01
Chris Knight 2003-01-01

DarkBlueSea 回復於：2006-04-02 10:05:45

因為內容較多，理論性強，小弟到現在對很多細節也沒有把握，望各位大蝦指點

上一篇文章： Unix下添加新硬盤方法 (轉）
下一篇文章：（圖）2萬4千顆Opteron!美國最強超級計算機

Linux文化

讓FreeBSD更安全(FreeBSD安全設置)

讓FreeBSD更安全(Securing FreeBSD)

FreeBSD 21周年了在1993年11月2日正式發布了FreeBSD 1.0

RAR 3.91 beta 1 for FreeBSD