您现在的位置： Linux教程網 >> UnixLinux > >> Linux基礎 >> Linux教程

用profile和oprofile監視視linux性能

profile使用：
profile功能是架構無關的，可以用來監視linux內核的4項功能，即：
11 #define CPU_PROFILING   1
12 #define SCHED_PROFILING 2
13 #define SLEEP_PROFILING 3
14 #define KVM_PROFILING   4
要想找開profile功能，除了要在menuconfig中打開支持選項外，還要在命令行加上profile=**，##.
**表示上述4種功能之一，##表示一個數字，用來表示監視的顆粒度，越小越細。
當做了這些工作之後還需要用到util linux工具中的readprofile來讀取結果，結果是從/proc/profile文件中讀取的，此工具做了格式化處理。以下為轉載：
1. 如何使用profile：
首先確認內核支持profile，然後在內核啟動時加入以下參數：profile=1或者其它參數，新的內核支持profile=schedule,1
2. 內核啟動後會創建/proc/profile文件，這個文件可以通過readprofile讀取，
如readprofile -m /proc/kallsyms | sort -nr > ~/cur_profile.log,
或者readprofile -r -m /proc/kallsyms |sort -nr,
或者readprofile -r && sleep 1 && readprofile -m /proc/kallsyms |sort -nr >~/cur_profile.log
3. 讀取/proc/profile可獲得哪些內容？
根據啟動配置profile＝？的不同，獲取的內容不同：
如果設置成profile=schedule可以獲得每個函數調用schedule的次數，用來調試schedule很有用
profile的實現：
在內核中創建一個/proc/profile接口，在系統啟動時用profile_init()分配好存放profile信息的內存，每條指令都有一個計數器。
如果設置的是profile=2 統計每條指令執行的次數。在時鐘中斷中調用        profile_tick(CPU_PROFILING, regs)，將當前指令regs->eip的計數值+1。這個統計有點不准，因為一個jiffies之間，可能執行很多函數，而統計的只是恰好發生時鐘中斷時的那個函數。但取樣點多了，這些信息還是能說明問題。
如果設置的是profile=schedule 統計每個指令調用schedule()的次數，在schedule()中調用profile_hit(SCHED_PROFILING, __builtin_return_address(0));
其實真正調用schedule的指令只有有限的幾個，但這些信息可以獲得調度點的精確信息。

profile_hit()的作用是將當前指令的計數值加1
profile_tick()是在每個時鐘tick的時候將響應的指令計數值加1
time_hook 一般被其它profile工具，如oprofile用來在每次中斷發生時，添加自己的處理函數。

profile信息其實包括任務的所有統計信息，所以可以用profile_event_register()在任務退出或者用戶空間內存釋放時，掛載自己的回調函數，以統計這些信息。

profile信息的統計在smp和up下不同，即profile_hit的實現不同，smp的實現中有一個PerCPU cache，這可避免多個CPU在profile統計時效率低下問題。具體可以察看源代碼kernel/profile.c
oprofile使用：
oprofile平台相關工具，請注意自己平台支持的event.

http://oprofile.sourceforge.net/doc/index.html
http://oprofile.sourceforge.net/doc/internals/index.html

簡介

作為一名開發人員，在試圖提高代碼效率時，您可能發現性能瓶頸是您要面對的最困難的任務之一。代碼分析（code profiling）是一種可以使這項任務變得更容易的方法。代碼分析包括對那些表示運行系統上的某些處理器活動的數據樣本進行分析。OProfile 為 POWER 上的 Linux 提供了這種解決方案。OProfile 被包含在最新的 IBM? 支持的 Linux for POWER 發行版本中：Red Hat Enterprise Linux 4 (RHEL4) 和 SUSE LINUX Enterprise Server 9 (SLES9)。本文將介紹 OProfile for Linux on POWER，並提供兩個例子，演示如何使用它來發現性能瓶頸。

代碼分析概述

OProfile for Linux on POWER 使用了一個內核模塊和一個用戶空間守護進程，前者可以訪問性能計數寄存器，後者在後台運行，負責從這些寄存器中收集數據。在啟動守護進程之前，OProfile 將配置事件類型以及每種事件的樣本計數（sample count）。如果沒有配置任何事件，那麼 OProfile 將使用 Linux on POWER 上的默認事件，即 CYCLES，該事件將對處理器循環進行計數。事件的樣本計數將決定事件每發生多少次計數器才增加一次。OProfile 被設計成可以在低開銷下運行，從而使後台運行的守護進程不會擾亂系統性能。

OProfile 具有對 POWER4?、POWER5? 和 PowerPC? 970 處理器的內核支持。PowerPC 970 和 POWER4 處理器有 8 個計數寄存器，而 POWER5 處理器有 6 個計數寄存器。在不具備 OProfile 內核支持的架構上使用的則是計時器（timer）模式。在這種模式下，OProfile 使用了一個計數器中斷，對於禁用中斷的代碼，OProfile 不能對其進行分析。

OProfile 工具

與 OProfile 內核支持一起提供的還有一些與內核交互的用戶空間工具，以及分析收集到的數據的工具。如前所述，OProfile 守護進程收集樣本數據。控制該守護進程的工具稱作 opcontrol。表 1 列出了用於 opcontrol 的一些常見的命令行選項。本文的後面還將描述 opreport 和 opannotate 這兩個工具，它們都是用於分析收集到的數據的工具。在 OProfile 手冊的第 2.2 節中，可以找到對所有 OProfile 工具的概述。（請參閱參考資料。）

RHEL4 和 SLES9 上支持的處理器事件類型是不同的，正如不同 POWER 處理器上支持的事件類型也會有所變化一樣。您可以使用 opcontrol 工具和 --list-events 選項獲得自己平台所支持的那些事件的列表。

表 1. opcontrol 命令行選項
opcontrol 選項   描述
--list-events   列出處理器事件和單元屏蔽（unit mask）
--vmlinux=   將要分析的內核鏡像文件
--no-vmlinux   不分析內核
--reset   清除當前會話中的數據
--setup   在運行守護進程之前對其進行設置
--event=   監視給定的處理器事件
--start   開始取樣
--dump   使數據流到守護進程中
--stop   停止數據取樣
-h   關閉守護進程

OProfile 例子
您可以使用 OProfile 來分析處理器周期、TLB 失誤、內存引用、分支預測失誤、緩存失誤、中斷處理程序，等等。同樣，您可以使用 opcontrol 的 --list-events 選項來提供完整的特定處理器上可監視事件列表。

下面的例子演示了如何使用 OProfile for Linux on POWER。第一個例子監視處理器周期，以發現編寫不當、會導致潛在性能瓶頸的算法。雖然這是一個很小的例子，但是當您分析一個應用程序，期望發現大部分處理器周期究竟用在什麼地方時，仍可以借鑒這裡的方法。然後您可以進一步分析這部分代碼，看是否可以對其進行優化。

第二個例子要更為復雜一些 —— 它演示了如何發現二級（level 2，L2）數據緩存失誤，並為減少數據緩存失誤的次數提供了兩套解決方案。

例 1：分析編寫不當的代碼

這個例子的目的是展示如何編譯和分析一個編寫不當的代碼示例，以分析哪個函數性能不佳。這是一個很小的例子，只包含兩個函數 —— slow_multiply() 和 fast_multiply() —— 這兩個函數都是用於求兩個數的乘積，如下面的清單 1 所示。

清單 1. 兩個執行乘法的函數

int fast_multiply(x,  y)
{
        return x * y;
}
int slow_multiply(x, y)
{
        int i, j, z;
        for (i = 0, z = 0; i < x; i++)
                z = z + y;
        return z;
}
int main()
{
        int i,j;
        int x,y;
        for (i = 0; i < 200; i ++) {
                for (j = 0; j " 30 ; j++) {
                        x = fast_multiply(i, j);
                        y = slow_multiply(i, j);
                }
        }
        return 0;
}

分析這個代碼，並使用 opannotate 對其進行分析，該工具使您可以用 OProfile 注釋查看源代碼。首先必須利用調試信息來編譯源代碼，opannotate 要用它來添加注釋。使用 Gnu Compiler Collections C 編譯器，即 gcc，通過運行以下命令來編譯清單 1 中的例子。注意，-g 標志意味著要添加調試信息。

gcc  -g multiply.c -o multiply

接下來，使用清單 2 中的命令分析該代碼，然後使用 CYCLES 事件計算處理器周期，以分析結果。

清單 2. 用來分析乘法例子的命令

# opcontrol --vmlinux=/boot/vmlinux-2.6.5-7.139-pseries64
# opcontrol --reset
# opcontrol --setup --event=CYCLES:1000
# opcontrol --start
Using 2.6+ OProfile kernel interface.
Reading module info.
Using log file /var/lib/oprofile/oprofiled.log
Daemon started.
Profiler running.
# ./multiply
# opcontrol --dump
# opcontrol --stop
Stopping profiling.
# opcontrol -h
Stopping profiling.
Killing daemon.

最後，使用 opannotate 工具和 --source 選項生成源代碼，或者和 --assembly 選項一起生成匯編代碼。具體使用這兩個選項中的哪一個選項，或者是否同時使用這兩個選項，則取決於您想要分析的詳細程度。對於這個例子，只需使用 --source 選項來確定大部分處理器周期發生在什麼地方即可。

清單 3. 對乘法例子的 opannotate 結果的分析

# opannotate --source ./multiply
/*
 * Command line: opannotate --source ./multiply
 *
 * Interpretation of command line:
 * Output annotated source file with samples
 * Output all files
 *
 * CPU: ppc64 POWER5, speed 1656.38 MHz (estimated)
 * Counted CYCLES events (Processor cycles) with a unit mask of
0x00 (No unit mask) count 1000
 */
/*
 * Total samples for file : "/usr/local/src/badcode/multiply.c"
 *
 *   6244 100.000
 */
               :int fast_multiply(x, y)
    36  0.5766 :{ /* fast_multiply total:     79  1.2652 */
    26  0.4164 :        return x * y;
    17  0.2723 :}
               :
               :int slow_multiply(x, y)
    50  0.8008 :{ /* slow_multiply total:   6065 97.1332 */
               :        int i, j, z;
  2305 36.9154 :        for (i = 0, z = 0; i " x; i++)
  3684 59.0006 :                z = z + y;
    11  0.1762 :        return z;
    15  0.2402 :}
               :
               :int main()
               :{ /* main total:    100  1.6015 */
               :        int i,j;
               :        int x,y;
               :
     1  0.0160 :        for (i = 0; i " 200; i ++) {
     6  0.0961 :                for (j = 0; j " 30 ; j++) {
    75  1.2012 :                        x = fast_multiply(i, j);
    18  0.2883 :                        y = slow_multiply(i, j);
               :                }
               :        }
               :        return 0;
               :}

清單 3 中下面的幾行將顯示兩個乘法函數中所使用的 CYCLES 數：

36  0.5766 :{ /* fast_multiply total:     79  1.2652 */

50  0.8008 :{ /* slow_multiply total:   6065 97.1332 */

您可以看到，fast_mulitply() 只使用了 79 個樣本，而 slow_multiply() 使用了 6065 個樣本。雖然這是一個很小的例子，在現實中不大可能出現，但它仍然足以演示如何剖析代碼，並為發現性能瓶頸而對其進行分析。

例 2：發現二級數據緩存失誤

這個例子比第一個例子要復雜一些，它需要發現二級（L2）數據緩存失誤。POWER 處理器包含芯片二級緩存（on-chip L2 cache），這是鄰近處理器的一種高速存儲器。處理器從 L2 緩存中訪問經常修改的數據。當兩個處理器共享一個數據結構，並同時修改那個數據結構時，就有可能引發問題。CPU1 在它的 L2 緩存中包含數據的一個副本，而 CPU2 修改了這個共享的數據結構。CPU1 L2 緩存中的副本現在是無效的，必須進行更新。CPU1 必須花費大量步驟從主存中檢索數據，這需要占用額外的處理器周期。

在這個例子中，您將查看這個數據結構（如清單 4 所示），並分析兩個處理器同時修改這個數據結構時出現的情景）。然後觀察數據緩存失誤，並考察用來修正這個問題的兩種解決方案。

清單 4. 共享的數據結構

struct shared_data_struct {
   unsigned int data1;
   unsigned int data1;
}

清單 5 中的程序使用 clone() 系統調用和 VM_CLONE 標志生成一個子進程。VM_CLONE 標志會導致子進程和父進程在同一個存儲空間中運行。父線程修改該數據結構的第一個元素，而子線程則修改第二個元素。

清單 5. 演示 L2 數據緩存失誤的代碼示例

#include
#include
struct shared_data_struct {
        unsigned int data1;
        unsigned int data2;
};
struct shared_data_struct shared_data;
static int inc_second(struct shared_data_struct *);
int main(){
        int i, j, pid;
        void *child_stack;
        /* allocate memory for other process to execute in */
        if((child_stack = (void *) malloc(4096)) == NULL) {
                perror("Cannot allocate stack for child");
                exit(1);
        }
        /* clone process and run in the same memory space */
        if ((pid = clone((void *)&inc_second, child_stack,
           CLONE_VM, &shared_data)) < 0) {
                perror("clone called failed.");
                exit(1);
        }
        /* increment first member of shared struct */
        for (j = 0; j < 2000; j++) {
                for (i = 0; i < 100000; i++) {
                        shared_data.data1++;
                }
        }
        return 0;
}
int inc_second(struct shared_data_struct *sd)
{
        int i,j;
        /* increment second member of shared struct */
        for (j = 1; j < 2000; j++) {
                for (i = 1; i < 100000; i++) {
                        sd->data2++;
                }
        }
}

使用 gcc 編譯器，運行清單 6 中的命令不帶優化地編譯這個示例程序。

清單 6. 用於編譯清單 5 中例子代碼的命令

gcc -o cache-miss cache-miss.c

現在您可以用 OProfile 分析上述程序中出現的 L2 數據緩存失誤。

對於這個例子，作者在一台 IBM eServer? OpenPower? 710 上執行和分析了這個程序，該機器有兩個 POWER5 處理器，並運行 SLES9 Service Pack 1 (SLES9SP1)。將 --list-events 標志傳遞給 opcontrol，以判斷是哪一個事件負責監視 L2 數據緩存失誤。對於基於 POWER5 處理器的、運行 SLES9SP1 的系統，由 PM_LSU_LMQ_LHR_MERGE_GP9 事件監視 L2 數據緩存失誤。如果您將樣本計數設置為 1000，比如在這個例子中，那麼 OProfile 將從每 1000 個硬件事件抽取一個樣本。如果使用不同的平台，例如基於 POWER4 處理器的服務器，那麼這樣的事件也會有所不同。

使用清單 7 中的命令分析這個例子代碼，如下所示：

清單 7. 用來分析清單 5 所示例子中的 L2 數據緩存失誤的命令

# opcontrol --vmlinux=/boot/vmlinux-2.6.5-7.139-pseries64
# opcontrol --reset
# opcontrol --setup –event=PM_LSU_LMQ_LHR_MERGE_GP9:1000
# opcontrol --start
Using 2.6+ OProfile kernel interface.
Reading module info.
Using log file /var/lib/oprofile/oprofiled.log
Daemon started.
Profiler running.
# ./cache-miss
# opcontrol --dump
# opcontrol -h
Stopping profiling.
Killing daemon.
# opreport -l ./cache-miss
CPU: ppc64 POWER5, speed 1656.38 MHz (estimated)
Counted PM_LSU_LMQ_LHR_MERGE_GP9 events (Dcache miss occurred for
  the same real cache line as earlier req, merged into LMQ) with a
    unit mask of 0x00 (No unit mask) count 1000
samples  %        symbol name
47897    58.7470  main
33634    41.2530  inc_second

在分析來自 opreport 的結果時，您可以看到，在函數 main() 和 inc_second() 中存在很多緩存失誤。opreport 的 -l 選項將輸出符號信息，而實質上輸出的應該只是二進制映像名。同樣，緩存失誤的起因也是兩個處理器修改一個共享的數據結構，這個數據結構大小為 8 字節，放在一個 128 字節的緩存行中。

消除數據緩存失誤的一種方法是填充數據結構，使得它的每一個元素都存儲在各自的緩存行中。清單 8 包含一個修改後的結構，其中有 124 字節的填充物。

清單 8. 帶填充物的數據結構，每個元素放進不同的緩存行中

struct shared_data_struct {
   unsigned int data1;
   char pad[124];
   unsigned int data1;

像前面那樣重新編譯該程序，但是這一次使用修改後的數據結構。然後使用清單 9 中的命令再次分析結果。

清單 9. 填充數據結構後用於 profile L2 數據緩存失誤的命令

# opcontrol --vmlinux=/boot/vmlinux-2.6.5-7.139-pseries64
# opcontrol --reset
# opcontrol --setup –event=PM_LSU_LMQ_LHR_MERGE_GP9:1000
# opcontrol --start
Using 2.6+ OProfile kernel interface.
Reading module info.
Using log file /var/lib/oprofile/oprofiled.log
Daemon started.
Profiler running.
# ./cache-miss
# opcontrol --dump
# opcontrol -h
Stopping profiling.
Killing daemon.
# opreport -l ./cache-miss
error: no sample files found: profile specification too strict ?

Opreport 表明，由於沒有發現抽樣數據，所以可能存在錯誤。然而，隨著對共享數據結構的修改，這是可以預期的，因為每個數據元素都在自己的緩存行中，所以不存在 L2 緩存失誤。

現在可以考察 L2 緩存失誤在處理器周期上的代價。首先，分析使用未填充的原有共享數據結構的代碼（清單 4）。您將進行抽樣的事件是 CYCLES。使用清單 10 中的命令針對 CYCLES 事件分析這個例子。

清單 10. 用於 profile 清單 5 所示例子中處理器周期數的命令

# opcontrol --vmlinux=/boot/vmlinux-2.6.5-7.139-pseries64
# opcontrol --reset
# opcontrol --setup –event=CYCLES:1000
# opcontrol --start
Using 2.6+ OProfile kernel interface.
Reading module info.
Using log file /var/lib/oprofile/oprofiled.log
Daemon started.
Profiler running.
# ./cache-miss
# opcontrol --dump
# opcontrol -h
Stopping profiling.
Killing daemon.
# opreport -l ./cache-miss
CPU: ppc64 POWER5, speed 1656.38 MHz (estimated)
Counted CYCLES events (Processor cycles) with a unit mask of 0x00
(No unit mask) count 1000
samples  %        symbol name
121166   53.3853  inc_second
105799   46.6147  main

現在，使用清單 11 中的命令分析使用填充後的數據結構的例子代碼（清單 8）。

清單 11. 用於分析使用填充後的數據結構的例子中處理器周期數的命令

# opcontrol --vmlinux=/boot/vmlinux-2.6.5-7.139-pseries64
# opcontrol --reset
# opcontrol --setup –event=CYCLES:1000
# opcontrol --start
Using 2.6+ OProfile kernel interface.
Reading module info.
Using log file /var/lib/oprofile/oprofiled.log
Daemon started.
Profiler running.
# ./cache-miss
# opcontrol --dump
# opcontrol -h
Stopping profiling.
Killing daemon.
# opreport -l ./cache-miss
CPU: ppc64 POWER5, speed 1656.38 MHz (estimated)
Counted CYCLES events (Processor cycles) with a unit mask of 0x00
 (No unit mask) count 1000
samples  %        symbol name
104916   58.3872  inc_second
74774    41.6128  main

不出所料，隨著 L2 緩存失誤數量的增加，處理器周期數也有所增加。其主要原因是，與從 L2 緩存取數據相比，從主存獲取數據代價昂貴。

避免兩個處理器之間緩存失誤的另一種方法是在相同處理器上運行兩個線程。通過使用 Cpu 相似性（affinity），將一個進程綁定到一個特定的處理器，下面的例子演示了這一點。在 Linux 上，sched_setaffinity() 系統調用在一個處理器上運行兩個線程。清單 12 提供了原來的示例程序的另一個變體，其中使用 sched_setaffinity() 調用來執行這一操作。

清單 12. 利用 cpu 相似性來避免 L2 緩存失誤的示例代碼

#include
#include
struct shared_data_struct {
        unsigned int data1;
        unsigned int data2;
};
struct shared_data_struct shared_data;
static int inc_second(struct shared_data_struct *);
int main(){
        int i, j, pid;
        cpu_set_t cmask;
        unsigned long len = sizeof(cmask);
        pid_t p = 0;
        void *child_stack;
        __CPU_ZERO(&cmask);
        __CPU_SET(0, &cmask);
        /* allocate memory for other process to execute in */
        if((child_stack = (void *) malloc(4096)) == NULL) {
                perror("Cannot allocate stack for child");
                exit(1);
        }
        /* clone process and run in the same memory space */
        if ((pid = clone((void *)&inc_second, child_stack,
CLONE_VM, &shared_data)) < 0) {
                perror("clone called failed");
                exit(1);
        }
        if (!sched_setaffinity(0, len, &cmask)) {
                printf("Could not set cpu affinity for current
process.\n");
                exit(1);
        }
        if (!sched_setaffinity(pid, len, &cmask)) {
                printf("Could not set cpu affinity for cloned
process.\n");
                exit(1);
        }
        /* increment first member of shared struct */
        for (j = 0; j < 2000; j++) {
                for (i = 0; i < 100000; i++) {
                        shared_data.data1++;
                }
        }
        return 0;
}
int inc_second(struct shared_data_struct *sd)
{
        int i,j;
        /* increment second member of shared struct */
        for (j = 1; j < 2000; j++) {
                for (i = 1; i < 100000; i++) {
                        sd->data2++;
                }
        }
}

這個例子在同處理器上運行兩個線程，共享數據結構存放在一個處理器上的一個 L2 緩存行中。這樣應該可以導致零緩存失誤。使用前面描述的步驟分析緩存失誤，以驗證在一個處理器上運行兩個進程時，是否不存在 L2 緩存失誤。對於數據緩存失誤這個問題，第三種解決方法是使用編譯器優化，這樣可以減少緩存失誤的數量。然而，在某些環境下，這不是一個合適的選擇，您仍然必須分析代碼，並對不良性能做出改正。

結束語
分析是開發過程中最困難的任務之一。為了使代碼獲得最佳性能，好的工具是必不可少的。OProfile 就是這樣一種工具，目前它提供了針對 Linux on POWER 的分析功能。對於其他平台上的可以快速移植到 Linux on POWER 的 Linux，還有其他許多性能和調試工具。除了處理器事件的類型有所差別外，在基於 POWER 處理器的 Linux 平台上運行 OProfile 與在其他架構上運行 OProfile 是類似的。所以，如果在其他平台上使用過 OProfile，那麼您應該在很短時間內就可以知道如何在 Linux on POWER 上運行 OProfile。

上一篇文章：【編譯打包】teamtalk在CentOS 7上的安裝
下一篇文章： LVS三種工作模式、十種調度算法介紹

Linux教程

Linux基礎教程：Linux性能監控-NetworkIO

Linux基礎教程：Linux性能監控-Memory

Linux：Linux鸓ing

Linux之父呼吁:對Linux內核性能測試要經常化

linux性能分析

Linux下性能監測工具----gprof和oprofile