petl:專為ETL工作而設計的Python工具包


petl:專為ETL工作而設計的Python工具包


介紹:

Pandas是我們經常使用的數據處理工具包,他有強大的ETL功能及數據分析的功能,但是也有一些其他類型的ETL工具包, 使用起來也很方便,今天我們介紹的是petl。

petl是ETL的Python軟件包(因此名稱為“ petl”)。與panda相似,petl允許用戶通過從許多可能的數據源(csv,xls,html,txt,json等)中提取數據並輸出到您選擇的數據庫或存儲格式,從而在Python中構建表。 petl具有與pandas相同的功能,但專為ETL工作而設計,並且不包含內置的分析功能,因此,如果您對ETL完全感興趣,可以選擇使用它。

安裝:

<code>pip install petl/<code>

抽取和加載各種格式的文件(Extract/Load):

petl支持各種常見的文件類型,如csv,xml,html,json,excel,database,hdf5以及Pandas的Dataframe等,功能非常強大,操作也非常簡單,舉例如下:

csv:

<code>>>> import petl as etl>>> import csv>>> # set up a CSV file to demonstrate with... table1 = [['foo', 'bar'],...           ['a', 1],...           ['b', 2],...           ['c', 2]]>>> with open('example.csv', 'w') as f:...     writer = csv.writer(f)...     writer.writerows(table1)...>>> # now demonstrate the use of fromcsv()... table2 = etl.fromcsv('example.csv')>>> table2+-----+-----+| foo | bar |+=====+=====+| 'a' | '1' |+-----+-----+| 'b' | '2' |+-----+-----+| 'c' | '2' |+-----+-----+/<code>

json:

<code>>>> import petl as etl>>> data = '''... [{"foo": "a", "bar": 1},... {"foo": "b", "bar": 2},... {"foo": "c", "bar": 2}]... '''>>> with open('example.json', 'w') as f:...     f.write(data)...74>>> table1 = etl.fromjson('example.json', header=['foo', 'bar'])>>> table1+-----+-----+| foo | bar |+=====+=====+| 'a' |   1 |+-----+-----+| 'b' |   2 |+-----+-----+| 'c' |   2 |+-----+-----+/<code>

xml:

<code>>>> import petl as etl>>> # setup a file to demonstrate with... d = '''<table>...     ...         foobar...     ...     ...         a1...     ...     ...         b2...     ...     ...         c2...     ... /<table>'''>>> with open('example1.xml', 'w') as f:...     f.write(d)...212>>> table1 = etl.fromxml('example1.xml', 'tr', 'td')>>> table1+-----+-----+| foo | bar |+=====+=====+| 'a' | '1' |+-----+-----+| 'b' | '2' |+-----+-----+| 'c' | '2' |+-----+-----+/<code>

Pandas的Dataframe:

<code>>>> import petl as etl>>> import pandas as pd>>> records = [('apples', 1, 2.5), ('oranges', 3, 4.4), ('pears', 7, 0.1)]>>> df = pd.DataFrame.from_records(records, columns=('foo', 'bar', 'baz'))>>> table = etl.fromdataframe(df)>>> table+-----------+-----+-----+| foo       | bar | baz |+===========+=====+=====+| 'apples'  |   1 | 2.5 |+-----------+-----+-----+| 'oranges' |   3 | 4.4 |+-----------+-----+-----+| 'pears'   |   7 | 0.1 |+-----------+-----+-----+/<code>

轉換(Transform):

<code># 增加一列>>> import petl as etl>>> table1 = [['foo', 'bar'],...           ['M', 12],...           ['F', 34],...           ['-', 56]]>>> # using a fixed value... table2 = etl.addfield(table1, 'baz', 42)>>> table2+-----+-----+-----+| foo | bar | baz |+=====+=====+=====+| 'M' |  12 |  42 |+-----+-----+-----+| 'F' |  34 |  42 |+-----+-----+-----+| '-' |  56 |  42 |+-----+-----+-----+  # 使用正則表達式>>> etl.search(table1, 'foo', 'M')+-----+-----+| foo | bar |+=====+=====+| 'M' |  12 |+-----+-----+ # join>>> table1 = [['id', 'colour'],...           [1, 'blue'],...           [2, 'red'],...           [3, 'purple']]>>> table2 = [['id', 'shape'],...           [1, 'circle'],...           [3, 'square'],...           [4, 'ellipse']]>>> table3 = etl.join(table1, table2, key='id')>>> table3+----+----------+----------+| id | colour   | shape    |+====+==========+==========+|  1 | 'blue'   | 'circle' |+----+----------+----------+|  3 | 'purple' | 'square' |+----+----------+----------+# sort>>> table1 = [['foo', 'bar'],...           ['C', 2],...           ['A', 9],...           ['A', 6],...           ['F', 1],...           ['D', 10]]>>> table2 = etl.sort(table1, 'foo')>>> table2+-----+-----+| foo | bar |+=====+=====+| 'A' |   9 |+-----+-----+| 'A' |   6 |+-----+-----+| 'C' |   2 |+-----+-----+| 'D' |  10 |+-----+-----+| 'F' |   1 |+-----+-----+/<code>

總結:

以上是petl的簡單用法,對於單純的ETL操作,petl還是很方便的,有興趣的可以查看官方文檔

地址: https://github.com/petl-developers/petl


分享到:


相關文章: