python beautifulsoup html

BeautifulSoup 筆記

2018-03-02 (Fri)
16,697 views

Beautifulsoup 是一個可以幫助我們 parse HTML 的 lib, 這篇主要紀錄使用 beautifulsoup 時常用的指令。

安裝¶

pip install beautifulsoup4

下載一個網頁並爬出特定內容¶

這邊假設我們想要把維基百科上的「國家區域代碼」的表格爬下來，並轉成一個 Pandas 的 Dataframe：

取得某個頁面的 HTML 字串

import urllib
from bs4 import BeautifulSoup
import pandas as pd

html = urllib.request.urlopen("https://zh.wikipedia.org/zh-tw/ISO_3166-1").read()
soup = BeautifulSoup(html, 'html.parser')

利用 class 從該 HTML 裡取得特定表格

table = soup.find('table', {'class': 'wikitable sortable'})

產生欄位名稱

columns = [th.text.replace('\n', '') for th in table.find('tr').find_all('th')]
columns

['英文短名稱', '二位代碼', '三位代碼', '數字代碼', 'ISO 3166-2', '中文名稱', '獨立主權']

產生每個國家的對應資料

trs = table.find_all('tr')[1:]
rows = list()
for tr in trs:
    rows.append([td.text.replace('\n', '').replace('\xa0', '') for td in tr.find_all('td')])
rows[:5]

[['Afghanistan', 'AF', 'AFG', '004', 'ISO 3166-2:AF', '阿富汗', '是'],
 ['Åland Islands', 'AX', 'ALA', '248', 'ISO 3166-2:AX', '奧蘭', '否'],
 ['Albania', 'AL', 'ALB', '008', 'ISO 3166-2:AL', '阿爾巴尼亞', '是'],
 ['Algeria', 'DZ', 'DZA', '012', 'ISO 3166-2:DZ', '阿爾及利亞', '是'],
 ['American Samoa', 'AS', 'ASM', '016', 'ISO 3166-2:AS', '美屬薩摩亞', '否']]

產生 Dataframe

df = pd.DataFrame(data=rows, columns=columns)
df.head()

	英文短名稱	二位代碼	三位代碼	數字代碼	ISO 3166-2	中文名稱	獨立主權
0	Afghanistan	AF	AFG	004	ISO 3166-2:AF	阿富汗	是
1	Åland Islands	AX	ALA	248	ISO 3166-2:AX	奧蘭	否
2	Albania	AL	ALB	008	ISO 3166-2:AL	阿爾巴尼亞	是
3	Algeria	DZ	DZA	012	ISO 3166-2:DZ	阿爾及利亞	是
4	American Samoa	AS	ASM	016	ISO 3166-2:AS	美屬薩摩亞	否

找出特定 HTML 物件¶

假設我們有一個字串代表一個表格：

html = """<div><table border="1" class="dataframe"><thead><tr style="text-align:right;"><th></th><th>x</th><th>y</th></tr></thead><tbody><tr><th>0</th><td>-2.863752</td><td>-1.066424</td></tr><tr><th>1</th><td>-0.779238</td><td>0.862169</td></tr></tbody></table></div>"""

渲染成 HTML:

	x	y
0	-2.863752	-1.066424
1	-0.779238	0.862169

實際 HTML 架構：

<div>
   <table border="1" class="dataframe">
      <thead>
         <tr style="text-align: right;">
            <th></th>
            <th>x</th>
            <th>y</th>
         </tr>
      </thead>
      <tbody>
         <tr>
            <th>0</th>
            <td>-2.863752</td>
            <td>-1.066424</td>
         </tr>
         <tr>
            <th>1</th>
            <td>-0.779238</td>
            <td>0.862169</td>
         </tr>
      </tbody>
   </table>
</div>

利用 BeautifulSoup 物件 parse HTML:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
soup

<div><table border="1" class="dataframe"><thead><tr style="text-align:right;"><th></th><th>x</th><th>y</th></tr></thead><tbody><tr><th>0</th><td>-2.863752</td><td>-1.066424</td></tr><tr><th>1</th><td>-0.779238</td><td>0.862169</td></tr></tbody></table></div>

找到第一個符合條件的 table 標籤

table = soup.find('table', {'class': 'dataframe'})
table

<table border="1" class="dataframe"><thead><tr style="text-align:right;"><th></th><th>x</th><th>y</th></tr></thead><tbody><tr><th>0</th><td>-2.863752</td><td>-1.066424</td></tr><tr><th>1</th><td>-0.779238</td><td>0.862169</td></tr></tbody></table>

設定新屬性 / class¶

因為這時候我們取出來的 table 物件是 reference 到 soup 裡頭對應的物件, 只要直接改變對應的 attr 就會直接反映結果到 soup 物件:

table['class'] = table['class'] + ['table', 'table-striped', 'table-responsive']

soup

<div><table border="1" class="dataframe table table-striped table-responsive"><thead><tr style="text-align:right;"><th></th><th>x</th><th>y</th></tr></thead><tbody><tr><th>0</th><td>-2.863752</td><td>-1.066424</td></tr><tr><th>1</th><td>-0.779238</td><td>0.862169</td></tr></tbody></table></div>

Iterate 標籤裡頭的子標籤¶

for c in table.children:
    print(f'{c.name} in {table.name}')

thead in table
tbody in table

移除標籤¶

這邊假設我們要移除表格裡頭第一行的值 ( 第2個 tr 標籤 ), 可以對要移除的標籤物件使用 extract() func.

	x	y
0	-2.863752	-1.066424
1	-0.779238	0.862169

for i, tr in enumerate(soup.findAll('tr')):
    if i == 1:
        tr.extract()

	x	y
1	-0.779238	0.862169

建立新標籤¶

假設我們想要建立一個新的 blockquote 標籤，並加入一些文字：

text = 'I love BeautifulSoup!'

blockquote = soup.new_tag('blockquote')
blockquote.append(text)
blockquote

<blockquote>I love BeautifulSoup!</blockquote>

Post Tags python beautifulsoup html

Previous Post Pelican 實戰手冊(主題篇)

Next Post Seaborn 筆記

View All Post