您现在的位置： Linux教程網 >> UnixLinux > >> Linux編程 >> Linux編程

Hadoop常見重要命令行操作及命令作用

關於Hadoop

[root@master ~]# hadoop --help
Usage: hadoop [--config confdir] COMMAND
       where COMMAND is one of:
  fs                   run a generic filesystem user client
  version              print the version
  jar <jar>            run a jar file
  checknative [-a|-h]  check native hadoop and compression libraries availability
  distcp <srcurl> <desturl> copy file or directories recursively
  archive -archiveName NAME -p <parent path> <src>* <dest> create a hadoop archive
  classpath            prints the class path needed to get the
                       Hadoop jar and the required libraries
  daemonlog            get/set the log level for each daemon
 or
  CLASSNAME            run the class named CLASSNAME

Most commands print help when invoked w/o parameters.

查看版本

[root@master ~]# hadoop version
Hadoop 2.2.0.2.0.6.0-101
Subversion [email protected]:hortonworks/hadoop.git -r b07b2906c36defd389c8b5bd22bebc1bead8115b
Compiled by jenkins on 2014-01-09T05:18Z
Compiled with protoc 2.5.0
From source with checksum 704f1e463ebc4fb89353011407e965
This command was run using /usr/lib/hadoop/hadoop-common-2.2.0.2.0.6.0-101.jar

運行jar文件

[root@master liguodong]# hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples-2.2.0.2.0.6.0-101.jar pi 10 100
Number of Maps  = 10
Samples per Map = 100
Wrote input for Map #0
Wrote input for Map #1
Wrote input for Map #2
...
Job Finished in 19.715 seconds
Estimated value of Pi is 3.14800000000000000000

檢查Hadoop本地庫和壓縮庫的可用性

[root@master liguodong]# hadoop checknative -a
15/06/03 10:28:07 INFO bzip2.Bzip2Factory: Successfully loaded & initialized native-bzip2 library system-native
15/06/03 10:28:07 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
Native library checking:
hadoop: true /usr/lib/hadoop/lib/native/libhadoop.so.1.0.0
zlib:   true /lib64/libz.so.1
snappy: true /usr/lib64/libsnappy.so.1
lz4:    true revision:43
bzip2:  true /lib64/libbz2.so.1

文件歸檔 Archive

hadoop不適合小文件的存儲，小文件本身就占用了很多metadata,就會造成namenode越來越大。
Hadoop Archives (HAR files)是在0.18.0版本中引入的，它的出現就是為了
緩解大量小文件消耗namenode內存的問題。
HAR文件是通過在HDFS上構建一個層次化的文件系統來工作。一個HAR文件是通過hadoop的archive命令來創建，而這個命令實際上也是運行了一個MapReduce任務來將小文件打包成HAR。對於client端來說，使用HAR文件沒有任何影響。所有的原始文件都使用har://URL。但在HDFS端它內部的文件數減少了。
通過HAR來讀取一個文件並不會比直接從HDFS中讀取文件高效，而且實際上可能還會稍微低效一點，因為對每一個HAR文件的訪問都需要完成兩層讀取，index文件的讀取和文件本身數據的讀取。並且盡管HAR文件可以被用來作為MapReduce job的input，但是並沒有特殊的方法來使maps將HAR文件中打包的文件當作一個HDFS文件處理。
創建文件 hadoop archive -archiveName xxx.har -p /src /dest
查看內容 hadoop fs -lsr har:///dest/xxx.har

[root@master liguodong]# hadoop archive
archive -archiveName NAME -p <parent path> <src>* <dest>
[root@master liguodong]# hadoop fs -lsr /liguodong
drwxrwxrwx   - hdfs      hdfs          0 2015-05-04 19:40 /liguodong/output
-rwxrwxrwx   3 hdfs      hdfs          0 2015-05-04 19:40 /liguodong/output/_SUCCESS
-rwxrwxrwx   3 hdfs      hdfs         23 2015-05-04 19:40 /liguodong/output/part-r-00000

[root@master liguodong]# hadoop archive -archiveName liguodong.har -p /liguodong output /liguodong/har

[root@master liguodong]# hadoop fs -lsr /liguodong
drwxr-xr-x   - root      hdfs          0 2015-06-03 11:15 /liguodong/har
drwxr-xr-x   - root      hdfs          0 2015-06-03 11:15 /liguodong/har/liguodong.har
-rw-r--r--   3 root      hdfs          0 2015-06-03 11:15 /liguodong/har/liguodong.har/_SUCCESS
-rw-r--r--   5 root      hdfs        254 2015-06-03 11:15 /liguodong/har/liguodong.har/_index
-rw-r--r--   5 root      hdfs         23 2015-06-03 11:15 /liguodong/har/liguodong.har/_masterindex
-rw-r--r--   3 root      hdfs         23 2015-06-03 11:15 /liguodong/har/liguodong.har/part-0
drwxrwxrwx   - hdfs      hdfs          0 2015-05-04 19:40 /liguodong/output
-rwxrwxrwx   3 hdfs      hdfs          0 2015-05-04 19:40 /liguodong/output/_SUCCESS
-rwxrwxrwx   3 hdfs      hdfs         23 2015-05-04 19:40 /liguodong/output/part-r-00000

查看內容
[root@master liguodong]# hadoop fs -lsr har:///liguodong/har/liguodong.har
lsr: DEPRECATED: Please use 'ls -R' instead.
drwxr-xr-x   - root hdfs          0 2015-05-04 19:40 har:///liguodong/har/liguodong.har/output
-rw-r--r--   3 root hdfs          0 2015-05-04 19:40 har:///liguodong/har/liguodong.har/output/_SUCCESS
-rw-r--r--   3 root hdfs         23 2015-05-04 19:40 har:///liguodong/har/liguodong.har/output/part-r-00000

---------------------------------------------------------------
[root@master liguodong]# hadoop archive -archiveName liguodong2.har -p /liguodong/output /liguodong/har

[root@master liguodong]# hadoop fs -lsr har:///liguodong/har/liguodong2.har
-rw-r--r--   3 root hdfs          0 2015-05-04 19:40 har:///liguodong/har/liguodong2.har/_SUCCESS
-rw-r--r--   3 root hdfs         23 2015-05-04 19:40 har:///liguodong/har/liguodong2.har/part-r-00000

關於HDFS

[root@master /]# hdfs  --help
Usage: hdfs [–config confdir] COMMAND 
where COMMAND is one of: 
dfs run a filesystem command on the file systems supported in Hadoop. 
namenode -format format the DFS filesystem 
secondarynamenode run the DFS secondary namenode 
namenode run the DFS namenode 
journalnode run the DFS journalnode 
zkfc run the ZK Failover Controller daemon 
datanode run a DFS datanode 
dfsadmin run a DFS admin client 
haadmin run a DFS HA admin client 
fsck run a DFS filesystem checking utility 
balancer run a cluster balancing utility 
jmxget get JMX exported values from NameNode or DataNode. 
oiv apply the offline fsimage viewer to an fsimage 
oev apply the offline edits viewer to an edits file 
fetchdt fetch a delegation token from the NameNode 
getconf get config values from configuration 
groups get the groups which users belong to 
snapshotDiff diff two snapshots of a directory or diff the 
current directory contents with a snapshot 
lsSnapshottableDir list all snapshottable dirs owned by the current user 
Use -help to see options 
portmap run a portmap service 
nfs3 run an NFS version 3 gateway

校驗檢查某個目錄是否健康

[root@master liguodong]# hdfs fsck /liguodong
Connecting to namenode via http://master:50070
FSCK started by root (auth:SIMPLE) from /172.23.253.20 for path /liguodong at Wed Jun 03 10:43:41 CST 2015
...........Status: HEALTHY
 Total size:    1559 B
 Total dirs:    7
 Total files:   11
 Total symlinks:                0
 Total blocks (validated):      7 (avg. block size 222 B)
...
The filesystem under path '/liguodong' is HEALTHY

更加詳細的查看命令

[root@master liguodong]# hdfs fsck /liguodong -files -blocks

作用：
檢查文件系統的健康狀態
可以查看一個文件所在的數據塊
可以刪除一個壞塊。
可以查找一個缺失的塊。

balancer磁盤均衡器

命令：hdfs balancer，也可以動過腳本啟動均衡器。
Hadoop的HDFS集群非常容易出現機器與機器之間磁盤利用率不平衡的情況，比如集群中添加新的數據節點。當HDFS出現不平衡狀況的時候，將引發很多問題，比如MR程序無法很好地利用本地計算的優勢，機器之間無法達到更好的網絡帶寬使用率，機器磁盤無法利用等等。可見，保證HDFS中的數據平衡是非常重要的。

[root@master liguodong]# hdfs balancer

hdfs dfsadmin

可以設置安全模式，如出現異常可以設置為只讀模式。

[root@master liguodong]# hdfs dfsadmin
Usage: java DFSAdmin
Note: Administrative commands can only be run as the HDFS superuser.
           [-report]
           [-safemode enter | leave | get | wait]
           [-allowSnapshot <snapshotDir>]
           [-disallowSnapshot <snapshotDir>]
           [-saveNamespace]
           [-rollEdits]
           [-restoreFailedStorage true|false|check]
           [-refreshNodes]
           [-finalizeUpgrade]
           [-metasave filename]
           [-refreshServiceAcl]
           [-refreshUserToGroupsMappings]
           [-refreshSuperUserGroupsConfiguration]
           [-printTopology]
           [-refreshNamenodes datanodehost:port]
           [-deleteBlockPool datanode-host:port blockpoolId [force]]
           [-setQuota <quota> <dirname>...<dirname>]
           [-clrQuota <dirname>...<dirname>]
           [-setSpaceQuota <quota> <dirname>...<dirname>]
           [-clrSpaceQuota <dirname>...<dirname>]
           [-setBalancerBandwidth <bandwidth in bytes per second>]
           [-fetchImage <local directory>]
           [-help [cmd]]

edits和fsimage文件查看器

edits和fsimage是兩個至關重要的文件，其中edits負責保存自最新檢查點後命名空間的變化，起著日志的作用，而fsimage則保存了最新的檢查點信息。這個兩個文件中的內容使用普通文本編輯器是無法直接查看的，幸運的是hadoop為此准備了專門的工具用於查看文件的內容，這些工具分別為oev和oiv，可以使用hdfs調用執行。

oiv(offline image viewer的縮寫)，用於將fsimage文件的內容轉儲到指定文件中以便於閱讀，該工具還提供了只讀的WebHDFS API以允許離線分析和檢查hadoop集群的命名空間。oiv在處理非常大的fsimage文件時是相當快的，如果該工具不能夠處理fsimage，它會直接退出。該工具不具備向後兼容性，比如使用hadoop-2.4版本的oiv不能處理hadoop-2.3版本的fsimage，只能使用hadoop-2.3版本的oiv。就像它的名稱所提示的（offline），oiv也不需要hadoop集群處於運行狀態。oiv具體語法可以通過在命令行輸入hdfs oiv查看。

oiv支持三種輸出處理器，分別為Ls、XML和FileDistribution，通過選項-p指定。
Ls是默認的處理器，該處理器的輸出與lsr命令的輸出極其相似，以相同的順序輸出相同的字段，比如目錄或文件的標志、權限、副本數量、所有者、組、文件大小、修改日期和全路徑等。與lsr不同的是，該處理器的輸出包含根路徑/，另一個重要的不同是該處理器的輸出不是按照目錄名稱和內容排序的，而是按照在fsimage中的順序顯示。除非命名空間包含較少的信息，否則不太可能直接比較該處理器和lsr命令的輸出。Ls使用INode塊中的信息計算文件大小並忽略-skipBlocks選項。示例如下：

[root@master current]# pwd
/hadoop/hdfs/namenode/current
[root@master current]# hdfs oiv  -i fsimage_0000000000000053234 -o fsimage.ls
[root@master current]# cat fsimage.ls
-rwxrwxrwx  3    oozie       hdfs     890168 2015-04-28 17:41 /user/oozie/share/lib/pig/jaxb-impl-2.2.3-1.jar
-rwxrwxrwx  3    oozie       hdfs     201124 2015-04-28 17:41 /user/oozie/share/lib/pig/jdo-api-3.0.1.jar
-rwxrwxrwx  3    oozie       hdfs     130458 2015-04-28 17:41 /user/oozie/share/lib/pig/jersey-client-1.9.jar

XML處理器輸出fsimage的xml文檔，包含了fsimage中的所有信息，比如inodeid等。該處理器的輸出支持XML工具的自動化處理和分析，由於XML語法格式的冗長，該處理器的輸出也最大。

[root@master current]# hdfs oiv -i fsimage_0000000000000053234 -p XML -o fsimage.xml
[root@master current]# more fsimage.xml

FileDistribution是分析命名空間中文件大小的工具。為了運行該工具需要通過指定最大文件大小和段數定義一個整數范圍[0,maxSize]，該整數范圍根據段數分割為若干段[0, s[1], …, s[n-1], maxSize]，處理器計算有多少文件落入每個段中（[s[i-1], s[i]），大於maxSize的文件總是落入最後的段中，即(s[n-1], maxSize)。輸出文件被格式化為由tab分隔的包含Size列和NumFiles列的表，其中Size表示段的起始，NumFiles表示文件大小落入該段的文件數量。在使用FileDistribution處理器時還需要指定該處理器的參數maxSize和step，若未指定，默認為0。

[root@master current]# hdfs oiv -i fsimage_0000000000000053234 -o fsimage.fd -p FileDistribution 1000 step 5
Files processed: 1  Current: /app-logs/ambari-qa/logs/application_1430219478244_0003/slave2_45454
totalFiles = 534
totalDirectories = 199
totalBlocks = 537
totalSpace = 1151394477
maxFileSize = 119107289

[root@master current]# more fsimage.fd
Size    NumFiles
0       22
2097152 491
4194304 13
6291456 2
8388608 1
10485760        3
12582912        0

oev是（offline edits viewer（離線edits查看器）的縮寫），該工具只操作文件因而並不需要hadoop集群處於運行狀態。該工具提供了幾個輸出處理器，用於將輸入文件轉換為相關格式的輸出文件，可以使用參數-p指定。
目前支持的輸出格式有binary（hadoop使用的二進制格式）、xml（在不使用參數p時的默認輸出格式）和stats（輸出edits文件的統計信息）。
該工具支持的輸入格式為binary和xml，其中的xml文件為該工具使用xml處理器的輸出文件。
由於沒有與stats格式對應的輸入文件，所以一旦輸出為stats格式將不可��再轉換為原有格式。比如輸入格式為bianry，輸出格式為xml，可以通過將輸入文件指定為原來的輸出文件，將輸出文件指定為原來的輸入文件實現binary和xml的轉換，而stats則不可以。

[root@master current]# hdfs oev -i edits_0000000000000042778-0000000000000042779 -o edits.xml
[root@master current]# cat edits.xml
<?xml version="1.0" encoding="UTF-8"?>
<EDITS>
  <EDITS_VERSION>-47</EDITS_VERSION>
  <RECORD>
    <OPCODE>OP_START_LOG_SEGMENT</OPCODE>
    <DATA>
      <TXID>42778</TXID>
    </DATA>
  </RECORD>
  <RECORD>
    <OPCODE>OP_END_LOG_SEGMENT</OPCODE>
    <DATA>
      <TXID>42779</TXID>
    </DATA>
  </RECORD>
</EDITS>

在輸出文件中，每個RECORD記錄了一次操作，當edits文件破損進而導致hadoop集群出現問題時，保存edits文件中正確的部分是可能的，可以通過將原有的bianry文件轉換為xml文件，並手動編輯xml文件然後轉回bianry文件來實現。最常見的edits文件破損情況是丟失關閉記錄的部分（OPCODE為-1），關閉記錄如下所示。如果在xml文件中沒有關閉記錄，可以在最後正確的記錄後面添加關閉記錄，關閉記錄後面的記錄都將被忽略。

<RECORD>
    <OPCODE>-1</OPCODE>
    <DATA>
    </DATA>
</RECORD>

關於yarn

[root@master liguodong]# yarn --help
Usage: yarn [--config confdir] COMMAND
where COMMAND is one of:
  resourcemanager      run the ResourceManager
  nodemanager          run a nodemanager on each slave
  rmadmin              admin tools
  version              print the version
  jar <jar>            run a jar file
  application          prints application(s) report/kill application
  node                 prints node report(s)
  logs                 dump container logs
  classpath            prints the class path needed to get the
                       Hadoop jar and the required libraries
  daemonlog            get/set the log level for each daemon
 or
  CLASSNAME            run the class named CLASSNAME
Most commands print help when invoked w/o parameters.

Ubuntu14.04下Hadoop2.4.1單機/偽分布式安裝配置教程 http://www.linuxidc.com/Linux/2015-02/113487.htm

CentOS安裝和配置Hadoop2.2.0 http://www.linuxidc.com/Linux/2014-01/94685.htm

Ubuntu 13.04上搭建Hadoop環境 http://www.linuxidc.com/Linux/2013-06/86106.htm

Ubuntu 12.10 +Hadoop 1.2.1版本集群配置 http://www.linuxidc.com/Linux/2013-09/90600.htm

Ubuntu上搭建Hadoop環境（單機模式+偽分布模式） http://www.linuxidc.com/Linux/2013-01/77681.htm

Ubuntu下Hadoop環境的配置 http://www.linuxidc.com/Linux/2012-11/74539.htm

單機版搭建Hadoop環境圖文教程詳解 http://www.linuxidc.com/Linux/2012-02/53927.htm

更多Hadoop相關信息見Hadoop 專題頁面 http://www.linuxidc.com/topicnews.aspx?tid=13

上一篇文章： Java Servlet關鍵點詳解
下一篇文章： Hadoop的壓縮算法實例及壓縮算法選取

Linux編程

一些高效的Linux命令行操作

Linux常見命令:磁盤操作與管理

Shell命令行操作

Hadoop 用命令行編譯URLCat

redis命令行操作學習

Linux必學的重要命令

Linux灰常重要命令—find命令

Unix操作系統重要命令的使用

相關文章

Linux重要命令ls詳解

一些高效的Linux命令行操作

注意用戶對命令行操作的影響

Linux-31-linux基礎重要命令

Linux-28-linux基礎重要命令

Linux-26-linux基礎重要命令

Linux-36-linux基礎重要命令

Linux-35-liunx基礎重要命令

Linux-32-linux基礎重要命令

linux命令行高效操作方法

在Linux的命令行中操作屏幕錄制的方法

Linux基本操作 1 命令行BASH的基本操作

Linux編程

SHELL編程

PERL編程