介紹:
Pandas是我們經常使用的數據處理工具包,他有強大的ETL功能及數據分析的功能,但是也有一些其他類型的ETL工具包, 使用起來也很方便,今天我們介紹的是petl。
petl是ETL的Python軟件包(因此名稱為“ petl”)。與panda相似,petl允許用戶通過從許多可能的數據源(csv,xls,html,txt,json等)中提取數據並輸出到您選擇的數據庫或存儲格式,從而在Python中構建表。 petl具有與pandas相同的功能,但專為ETL工作而設計,並且不包含內置的分析功能,因此,如果您對ETL完全感興趣,可以選擇使用它。
安裝:
<code>pip install petl/<code>
抽取和加載各種格式的文件(Extract/Load):
petl支持各種常見的文件類型,如csv,xml,html,json,excel,database,hdf5以及Pandas的Dataframe等,功能非常強大,操作也非常簡單,舉例如下:
csv:
<code>>>> import petl as etl>>> import csv>>> # set up a CSV file to demonstrate with... table1 = [['foo', 'bar'],... ['a', 1],... ['b', 2],... ['c', 2]]>>> with open('example.csv', 'w') as f:... writer = csv.writer(f)... writer.writerows(table1)...>>> # now demonstrate the use of fromcsv()... table2 = etl.fromcsv('example.csv')>>> table2+-----+-----+| foo | bar |+=====+=====+| 'a' | '1' |+-----+-----+| 'b' | '2' |+-----+-----+| 'c' | '2' |+-----+-----+/<code>
json:
<code>>>> import petl as etl>>> data = '''... [{"foo": "a", "bar": 1},... {"foo": "b", "bar": 2},... {"foo": "c", "bar": 2}]... '''>>> with open('example.json', 'w') as f:... f.write(data)...74>>> table1 = etl.fromjson('example.json', header=['foo', 'bar'])>>> table1+-----+-----+| foo | bar |+=====+=====+| 'a' | 1 |+-----+-----+| 'b' | 2 |+-----+-----+| 'c' | 2 |+-----+-----+/<code>
xml:
<code>>>> import petl as etl>>> # setup a file to demonstrate with... d = '''<table>...... ...foo bar ...... ...a 1 ...... ...b 2 ...... ... /<table>'''>>> with open('example1.xml', 'w') as f:... f.write(d)...212>>> table1 = etl.fromxml('example1.xml', 'tr', 'td')>>> table1+-----+-----+| foo | bar |+=====+=====+| 'a' | '1' |+-----+-----+| 'b' | '2' |+-----+-----+| 'c' | '2' |+-----+-----+/<code>c 2 ...
Pandas的Dataframe:
<code>>>> import petl as etl>>> import pandas as pd>>> records = [('apples', 1, 2.5), ('oranges', 3, 4.4), ('pears', 7, 0.1)]>>> df = pd.DataFrame.from_records(records, columns=('foo', 'bar', 'baz'))>>> table = etl.fromdataframe(df)>>> table+-----------+-----+-----+| foo | bar | baz |+===========+=====+=====+| 'apples' | 1 | 2.5 |+-----------+-----+-----+| 'oranges' | 3 | 4.4 |+-----------+-----+-----+| 'pears' | 7 | 0.1 |+-----------+-----+-----+/<code>
轉換(Transform):
<code># 增加一列>>> import petl as etl>>> table1 = [['foo', 'bar'],... ['M', 12],... ['F', 34],... ['-', 56]]>>> # using a fixed value... table2 = etl.addfield(table1, 'baz', 42)>>> table2+-----+-----+-----+| foo | bar | baz |+=====+=====+=====+| 'M' | 12 | 42 |+-----+-----+-----+| 'F' | 34 | 42 |+-----+-----+-----+| '-' | 56 | 42 |+-----+-----+-----+ # 使用正則表達式>>> etl.search(table1, 'foo', 'M')+-----+-----+| foo | bar |+=====+=====+| 'M' | 12 |+-----+-----+ # join>>> table1 = [['id', 'colour'],... [1, 'blue'],... [2, 'red'],... [3, 'purple']]>>> table2 = [['id', 'shape'],... [1, 'circle'],... [3, 'square'],... [4, 'ellipse']]>>> table3 = etl.join(table1, table2, key='id')>>> table3+----+----------+----------+| id | colour | shape |+====+==========+==========+| 1 | 'blue' | 'circle' |+----+----------+----------+| 3 | 'purple' | 'square' |+----+----------+----------+# sort>>> table1 = [['foo', 'bar'],... ['C', 2],... ['A', 9],... ['A', 6],... ['F', 1],... ['D', 10]]>>> table2 = etl.sort(table1, 'foo')>>> table2+-----+-----+| foo | bar |+=====+=====+| 'A' | 9 |+-----+-----+| 'A' | 6 |+-----+-----+| 'C' | 2 |+-----+-----+| 'D' | 10 |+-----+-----+| 'F' | 1 |+-----+-----+/<code>
總結:
以上是petl的簡單用法,對於單純的ETL操作,petl還是很方便的,有興趣的可以查看官方文檔
地址: https://github.com/petl-developers/petl
閱讀更多 數據大視界 的文章