您现在的位置： Linux教程網 >> UnixLinux > >> Linux基礎 >> Linux教程

Linux高級文本處理之gawk關聯數組

Awk 的數組，都是關聯數組，即一個數組包含多個”索引/值”的元素。索引沒必要是一系列連續的數字，實際上，它可以使字符串或者數字，並且不需要指定數組長度。

語法：

arrayname[string]=value

arrayname 是數組名稱
string 是數組索引
value 是為數組元素賦的值

訪問 awk 數組的元素

如果要訪問數組中的某個特定元素，使用 arrayname[index] 即可返回該索引中的值。

實例1：

[root@localhost ~]# awk '
>BEGIN{ item[101]="HD Camcorder";
>item["102"]="Refrigerator";
>item[103]="MP3 Player";
>item["na"]="Young"
>print item[101];  
>print item["102"];    #注意帶引號不帶引號awk都視為字符串來處理
>print item[103];
>print item["na"];}'   #字符串索引需要加雙引號
HD Camcorder
Refrigerator
MP3 Player
Young

注意：

數組索引沒有順序，甚至沒有從 0 或 1 開始.
數組索引可以是字符串，數組的最後一個元素就是字符串索引，即”na”
Awk 中在使用數組前，不需要初始化甚至定義數組，也不需要指定數組的長度。
Awk 數組的命名規范和 awk 變量命名規范相同。

以 awk 的角度來說，數組的索引通常是字符串，即是你使用數組作為索引， awk 也會當做字符串來處理。下面的寫法是等價的：

Item[101]="HD Camcorder"
Item["101"]="HD Camcorder"

一、引用數組元素

如果試圖訪問一個不存在的數組元素， awk 會自動以訪問時指定的索引建立該元素，並賦予 null 值。為了避免這種情況，在使用前最後檢測元素是否存在。

使用 if 語句可以檢測元素是否存在，如果返回 true，說明改元素存在於數組中。

if ( index in array-name )

實例2：一個簡單的引用數組元素的例子

[root@localhost ~]# cat arr.awk 
BEGIN {
    x = item[55];  #在引用前沒有賦任何值，所以在引用是 awk 自動創建該元素並賦 null 值
    if ( 55 in item )
        print "Array index 55 contains",item[55];
    item[101]="HD Camcorder";
    if ( 101 in item )
        print "Array index 101 contains",item[101];
    if ( 1010 in item )  #不存在，因此檢查索引值時，返回 false，不會被打印
        print "Array index 1010 contains",item[1010];
}
[root@localhost ~]# awk -f arr.awk 
Array index 55 contains 
Array index 101 contains HD Camcorder

二、使用循環遍歷 awk 數組

如果要訪問數組中的所有元素，可以使用 for 的一個特殊用法來遍歷數組的所有索引：

語法：

for ( var in arrayname )
actions

說明：

var 是變量名稱
in 是關鍵字
arrayname 是數組名
actions 是一系列要執行的 awk 語句，如果有多條語句，必須包含在{ }中。通過把索引值賦給變量 var，循環體可以把所有語句應用到數組中所有的元素上。

實例1：將數組中元素全部打印出來

[root@localhost ~]# cat arr-for.awk 
BEGIN {
    item[101]="HD Camcorder";
    item[102]="Refrigerator";
    item[103]="MP3 Player";
    item[104]="Tennis Racket";
    item[105]="Laser Printer";
    item[1001]="Tennis Ball";
    item[55]="Laptop";
    item["no"]="Not Available";

    for(x in item)  #x 是變量名，用來存放數組索引，無需制定條件，awk自行判斷
        print item[x];
}
[root@localhost ~]# awk -f arr-for.awk 
Not Available
Laptop
HD Camcorder
Refrigerator
MP3 Player
Tennis Racket
Laser Printer
Tennis Ball

三、刪除數組元素

如果要刪除特定的數組元素，使用 delete 語句。一旦刪除了某個元素，就再也獲取不到它的值了。

語法：

delete arrayname[index];

刪除數組內所有元素：

for (var in array)
delete array[var]

在 GAWK 中，可以使用單個 delete 命令來刪除數組的所有元素:

Delete array

實例1：

[root@localhost ~]# awk '
>BEGIN{item[101]="HD Camcorder";
>item[102]="Refrigerator";
>item[103]="MP3 Player";
>delete item[101];
>print item[101];print item[102];
>for(x in item) delete item[x]; #使用for循環刪除全部數組
>print item[102];print item[103];}'

Refrigerator


[root@localhost ~]#

實例2：

[root@localhost ~]# awk '
>BEGIN{item[1]="a"; 
>item[2]="b";item[3]="c";
>delete item;   #使用delete直接加數組名稱刪除全部數組
>for(x in item) print item[x];}'

四、多維數組

雖然 awk 只支持一維數組，但可以使用一維數組來模擬多維數組。

實例1：

[root@localhost ~]# cat array-multi.awk
BEGIN {
item["1,1"]=10;
item["1,2"]=20;
item["2,1"]=30;
item["2,2"]=40
for (x in item)
print item[x]
}
[root@localhost ~]# awk -f array-multi.awk
30
20
40
10

說明：即使使用了”1,1”作為索引值，它也不是兩個索引，仍然是單個字符串索引，值為”1,1”。所以item[“1,1”]=10，實際上是把 10 賦給一維數組中索引”1,1”代表的值。

實例2：將雙引號去掉

[root@localhost ~]# cat array-multi2.awk
BEGIN {
item[1,1]=10;
item[1,2]=20;
item[2,1]=30;
item[2,2]=40
for (x in item)
print item[x]
}
[root@localhost ~]# awk -f array-multi2.awk
10
30
20
40

說明：上面的例子仍然可以運行，但是結果有所不同。在多維數組中，如果沒有把下標用引號引住， awk 會使用”\034”作為下標分隔符。

當指定元素 item[1,2]時，它會被轉換為 item[“1\0342”]。 Awk 用把兩個下標用”\034”連接起來並轉換為字符串。

實例3：

[root@localhost ~]# cat 034.awk 
BEGIN {
    item["1,1"]=10;
    item["1,2"]=20;
    item[2,1]=30;
    item[2,2]=40;
    for(x in item)
        print "Index",x,"contains",item[x];
}
[root@localhost ~]# awk -f 034.awk 
Index 1,2 contains 20
Index 21 contains 30
Index 22 contains 40
Index 1,1 contains 10

說明：

索引”1,1”和”1,2”放在了引號中，所以被當做一維數組索引， awk 沒有使用下標分隔符，因此，索引值被原封不動地輸出。

所以 2,1 和 2,2 沒有放在引號中，所以被當做多維數組索引， awk 使用下標分隔符來處理，因此索引變成”2\0341”和”2\0342”,於是在兩個下標直接輸出了非打印字符 “\034”

五、SUBSEP 下標分隔符

通過變量 SUBSEP 可以把默認的下標分隔符改成任意字符。

實例1：

[root@localhost ~]# cat subsep.awk 
BEGIN {
    SUBSEP=":";
    item["1,1"]=10;
    item["1,2"]=20;
    item[2,1]=30;
    item[2,2]=40;
    for(x in item)
        print "Index",x,"contains",item[x];
}
[root@localhost ~]# awk -f subsep.awk 
Index 1,2 contains 20
Index 2:1 contains 30
Index 2:2 contains 40
Index 1,1 contains 10

說明：索引”1,1”和”1,2”由於放在了引號中而沒有使用 SUBSEP 變量。

注意：使用多維數組時，最好不要給索引值加引號，直接使用SUBSEP變量制定索引分隔符。

六、用 asort 為數組排序

asort 函數重新為元素值排序，並且把索引重置為從 1 到 n 的值，此處 n 代表數組元素個數。

實例1：

[root@localhost ~]# cat asort.awk 
BEGIN {
    item[101]="HD Camcorder";
    item[102]="Refrigerator";item[103]="MP3 Player";
    item[104]="Tennis Racket";
    item[105]="Laser Printer";
    item[1001]="Tennis Ball";
    item[55]="Laptop";
    item["na"]="Not Available";
    print "---------- Before asort -------------"
    for(x in item)
        print "Index",x,"contains",item[x]
    total = asort(item);
    print "---------- After asort -------------"
    for(x in item)
        print "Index",x,"contains",item[x]
    print "Return value from asort:",total;
}
[root@localhost ~]# awk -f asort.awk 
---------- Before asort -------------
Index 55 contains Laptop
Index 101 contains HD Camcorder
Index 102 contains Refrigerator
Index 103 contains MP3 Player
Index 104 contains Tennis Racket
Index 105 contains Laser Printer
Index na contains Not Available
Index 1001 contains Tennis Ball
---------- After asort -------------  #awk數組索引是從1開始的不是0
Index 4 contains MP3 Player
Index 5 contains Not Available
Index 6 contains Refrigerator
Index 7 contains Tennis Ball
Index 8 contains Tennis Racket
Index 1 contains HD Camcorder
Index 2 contains Laptop
Index 3 contains Laser Printer
Return value from asort: 8

注意：一旦調用 asort 函數，數組原始的索引值就不復存在了，索引並不是按照1-8排序而是隨機排序。

實例2：增加索引排序功能

[root@localhost ~] cat asort1.awk
BEGIN {
item[101]="HD Camcorder";
item[102]="Refrigerator";item[103]="MP3 Player";
item[104]="Tennis Racket";
item[105]="Laser Printer";
item[1001]="Tennis Ball";
item[55]="Laptop";
item["na"]="Not Available";
total = asort(item);
for(i=1;i<=total;i++)  #添加for循環控制索引輸出的順序
print "Index",i,"contains",item[i]
}
[root@localhost ~] awk -f asort1.awk
Index 1 contains HD Camcorder
Index 2 contains Laptop
Index 3 contains Laser Printer
Index 4 contains MP3 Player
Index 5 contains Not Available
Index 6 contains Refrigerator
Index 7 contains Tennis Ball
Index 8 contains Tennis Racket

七、用 asorti 為索引排序

和以元素值排序相似，也可以取出所有索引值，排序，然後把他們保存在新數組中。

說明：

asorti 函數為索引值(不是元素值)排序，並且把排序後的元素值當做元素值保存。
如果使用 asorti(state)將會丟失原始元素值，即索引值變成了元素值。因此為了保險起見，通常給 asorti 傳遞兩個參數，即 asorti(state,statebbr).這樣一來，原始數組state 就不會被覆蓋了。

實例1：

[root@localhost ~]# cat asorti.awk
BEGIN {
state["TX"]="Texas";
state["PA"]="Pennsylvania";
state["NV"]="Nevada";
state["CA"]="California";
state["AL"]="Alabama";
print "-------------- Function: asort -----------------"
total = asort(state,statedesc);
for(i=1;i<=total;i++)
print "Index",i,"contains",statedesc[i];
print "-------------- Function: asorti -----------------"
total = asorti(state,stateabbr);
for(i=1;i<=total;i++)   #索引按順序輸出也需要自行排序
print "Index",i,"contains",stateabbr[i];
}
[root@localhost ~]# awk -f asorti.awk
-------------- Function: asort -----------------
Index 1 contains Alabama
Index 2 contains California
Index 3 contains Nevada
Index 4 contains Pennsylvania
Index 5 contains Texas
-------------- Function: asorti -----------------
Index 1 contains AL
Index 2 contains CA
Index 3 contains NV
Index 4 contains PA
Index 5 contains TX

補充實例：利用數組刪除重復行

[root@localhost ~]# cat alpha

a a a b c b d d e e f f f f g [root@localhost ~]# awk '!a[$0]++' alpha a b c d e f g

注解：

為何上面的命令將重復的行去掉了呢？原因如下：首先，當讀入第一個字符a時，關聯數組array的以a為索引的值為空，即array[a]=0，將此取反為1，邏輯上為真，則輸出第一行，然後自相加為2。其次，當讀入第二個值b時，同理可知為1，array也為1。當第二次讀入a時，因為array[a]的值已經為2，（邏輯）取反之後為0，邏輯上是假，則不會輸出，自相加最後為1。

注意：第一點，！的運算順序比++要更優先；第二點，++是在print之後才會執行。

上一篇文章： Linux文件壓縮與歸檔
下一篇文章： Linux高級文本處理之gawk分支和循環

Linux教程

Linux高級文本處理之gawk變量的操作符

Linux高級文本處理之gawk內置變量

Linux高級文本處理之gawk語法和基礎命令

Linux高級文本處理工具之sed

Linux高級文本處理之gawk分支和循環

Linux高級文本處理之gawk printf命令與函數

Linux高級文本處理之gawk的使用

Linux 第七天: (08月05日) Linux文本處理