您现在的位置： Linux教程網 >> UnixLinux > >> Linux編程 >> Linux編程

Python 解析html文檔模塊HTMLPaeser

python中，有三個庫可以解析html文本，HTMLParser,sgmllib,htmllib。他們的實現方法不通，但功能差不多。這三個庫中提供解析html的類都是基類，本身並不做具體的工作。他們在發現的元件後（如標簽、注釋、聲名等），會調用相應的函數，這些函數必須重載，因為基類中不作處理。

《Python開發技術詳解》.( 周偉,宗傑).[高清PDF掃描版+隨書視頻+代碼] http://www.linuxidc.com/Linux/2013-11/92693.htm

Python腳本獲取Linux系統信息 http://www.linuxidc.com/Linux/2013-08/88531.htm

在Ubuntu下用Python搭建桌面算法交易研究環境 http://www.linuxidc.com/Linux/2013-11/92534.htm

用Python中自帶的HTMLPaeser模塊，解析下面的HTMl文件

要求：1、獲取到每一個漏洞的名稱，CVE號，風險值

2、顯示每一個漏洞單獨顯示，不要堆疊在一起

3、只獲取高風險的漏洞

<html>
<head>
<title>search</title>
<meta http-equiv="Content-Type" content="text/html; charset=gb2312">
<LINK href="include/bbs.css" rel=stylesheet>
</head>
<body bgcolor="#ffffff" text="#000000" leftmargin="0" topmargin="0"><br>

<table width="100%" border="0" height="29" align="center" cellspacing="1" cellpadding="1" bordercolordark="#FFFFFF" bordercolorlight="#000000" class="a2">

<tr class="a1" height="22">
<td width="9%" class="a8">ID</td>
<td class="a8">檢測名稱</td>
<td width="14%" class="a8">CVE號</td>
<td width="20%" class="a8">檢測類別</td>
<td width="15%" class="a8">風險級別</td>
</tr>

<tr class="a1" height="22">
<td class="a9">1</td>
<td class="a9">
<a href="javascript:openwindow(0);">
FTP緩沖區溢出</a>
</td>
<td class="a9">
<a href='http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-1999-0789' target='_blank'> CVE-1999-0789</a>

</td>
<td class="a9">
FTP測試
</td>
<td class="a9">
<font color=#FF00FF>高風險</font>
</td>
</tr>

<tr class="a1" height="22">
<td class="a9">2</td>
<td class="a9">
<a href="javascript:openwindow(2);">
AFS客戶版本</a>
</td>
<td class="a9">
</td>
<td class="a9">
信息獲取測試
</td>
<td class="a9">
<font color=#00CC00>信息</font>
</td>
</tr>

<tr class="a1" height="22">
<td class="a9">1</td>
<td class="a9">
<a href="javascript:openwindow(1);">
ACC 路由器無需認證顯示配置信息</a>
</td>
<td class="a9">
<a href='http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-1999-0383' target='_blank'> CVE-1999-0383</a>

</td>
<td class="a9">
網絡設備測試
</td>
<td class="a9">
<font color=#FFCC00>中風險</font>
</td>
</tr>

<tr class="a1" height="22">
<td class="a9">3</td>
<td class="a9">
<a href="javascript:openwindow(17);">
Knox Arkeia 緩沖區溢出</a>
</td>
<td class="a9">
<a href='http://cve.mitre.org/cgi-bin/cvename.cgi?name=CAN-1999-1534' target='_blank'> CAN-1999-1534</a>

</td>
<td class="a9">
雜項測試
</td>
<td class="a9">
<font color=#FF00FF>高風險</font>
</td>
</tr>

</table>

</div>
</body>
</html>

Python程序

html_get.py

class CustomParser(HTMLParser.HTMLParser):
'''
定義一個新的HTMLParser類,覆蓋用到的方法
'''
cve_list = []
sigle_cve = []
selected = ('table', 'div', 'tr', 'td', 'a','font') #需要解析的標簽
selected_a = ['table/div/tr/td/a'] #需要獲取標簽a數據的路徑
selected_font = ['table/div/tr/td/font'] #需要獲取標簽font數據的路徑
def reset(self):
HTMLParser.HTMLParser.reset(self)
self._level_stack = []

def handle_starttag(self, tag, attrs):
if tag in CustomParser.selected:
self._level_stack.append(tag)
def handle_endtag(self, tag):
if self._level_stack and tag in CustomParser.selected and tag == self._level_stack[-1]:
self._level_stack.pop()
def handle_data(self, data):
#我們將需要獲取的數據放到一個list中，同時每一個漏洞的數據會放到一個小的listz中
#如[[名稱,CVE,風險],[名稱，CVE,風險]],這裡拿到的是全部HTML中的數據
if "/".join(self._level_stack) in CustomParser.selected_a and not CustomParser.sigle_cve:
print self._level_stack, data.decode('gbk').encode('utf-8')
CustomParser.sigle_cve.append(data.decode('gbk').encode('utf-8').strip())
elif "/".join(self._level_stack) in CustomParser.selected_a:
print self._level_stack, data.decode('gbk').encode('utf-8').strip()
CustomParser.sigle_cve.append(data.decode('gbk').encode('utf-8').strip())
elif "/".join(self._level_stack) in CustomParser.selected_font and CustomParser.sigle_cve:
print self._level_stack, data.decode('gbk').encode('utf-8').strip()
CustomParser.sigle_cve.append(data.decode('gbk').encode('utf-8').strip())
CustomParser.cve_list.append(CustomParser.sigle_cve)
CustomParser.sigle_cve = []
if __name__ == '__main__':
'''
讀取，判斷是否為高風險，是的打印出來
'''
try:
fd = open('test.html','r')
except Exception,error:
print error
html_string = fd.read()
ht = CustomParser()
ht.feed(html_string)
get_list = ht.cve_list
for item in get_list:
if item[-1] == '高風險':
print item
fd.close()

Python 的詳細介紹：請點這裡
Python 的下載地址：請點這裡

上一篇文章： Ubuntu下用arm-none-linux-gnueabi交叉編譯libxml2
下一篇文章： iOS 實現推送消息

Linux編程

Python入門(一)----什麼是python?python及模塊的安裝

Python linecache模塊

Python之PrettyTable模塊

Python模塊之logging

Python日志模塊logging

Python解析XML文檔示例代碼

Python 2.6.6安裝MySQL-python模塊

Python解析xml文檔實例

相關文章

Python logging 模塊簡介

Python模塊學習之json

用Python的turtle模塊畫國旗

Python time模塊學習

python datetime模塊的timedelta

Python學習之logging模塊

關於Python模塊和包

Python 使用python-nmap模塊實現端口掃描器

Python collections模塊實例

Python 之itertools模塊

Python 之getpass模塊

Python 之 paramiko 模塊

Linux編程

SHELL編程

PERL編程