Python之Pandas使用系列(十一): 從各種類型的文件中讀取數據技术頭條網

介紹：

上一篇文章中，我們瞭解了pandas數據對象-pandas Series 和pandas Dataframe，還學習了從頭開始構建Series或Dataframe。在本文中，我們將學習從各種文件格式（如csv，tsv，xls，html，json，sql數據庫等）讀取表格數據，並從中創建Pandas Dataframe。

pandas具有以下方法/功能來讀取各種文件格式，我們將逐一介紹其中幾種：

從製表符分隔的值（TSV）構造Pandas Dataframe

我們可以從tsv文件構造一個pandas Dataframe。默認情況下，pandas 會選擇第一行作為標題，並且索引將是默認的整數索引。運行以下代碼：

<code>import pandas as pd
df = pd.read_table('http://bit.ly/chiporders')
/<code>

上面給出的url包含格式完美的製表符分隔值，第一行是標題。

查找Pandas Dataframe

您可以使用df.head（）或df.tail（）查找Dataframe。默認情況下，它將顯示從頂部或底部開始的5行，標題和索引，如果要查看更多行數，請在括號中添加數字，例如df.head（10）或df.tail（15）。讓我們看看在以上步驟中創建的數據框。

<code>print(df.head())

# Output

   order_id  quantity                              item_name                                 choice_description item_price
0         1         1           Chips and Fresh Tomato Salsa                                                NaN     $2.39 
1         1         1                                   Izze                                       [Clementine]     $3.39 
2         1         1                       Nantucket Nectar                                            [Apple]     $3.39 
3         1         1  Chips and Tomatillo-Green Chili Salsa                                                NaN     $2.39 
4         2         2                           Chicken Bowl  [Tomatillo-Red Chili Salsa (Hot), [Black Beans...    $16.98 

/<code>

如您所見，Pandas已自動將文件的第一行作為Dataframe的標題/列。

但是，如果文件格式不正確，我們將必須將諸如sep和name之類的參數傳遞給（pd.read_table（））。看下面的例子：

<code>df = pd.read_table('http://bit.ly/movieusers')
print(df.head())

# Output
   1|24|M|technician|85711
0       2|53|F|other|94043
1      3|23|M|writer|32067
2  4|24|M|technician|43537
3       5|33|F|other|15213
4   6|42|M|executive|98101
/<code>

看一下頭部，就可以看出該文件不是由製表符分隔的，而是由'|'分隔的（管道）並且它沒有標題。因此，讓我們提供必要的論據。

<code>df = pd.read_table('http://bit.ly/movieusers', sep='|', names=['id', 'age', 'sex', 'occupation', 'zip code'])
print(df.head())

# Output

   id  age sex  occupation zip code
0   1   24   M  technician    85711
1   2   53   F       other    94043
2   3   23   M      writer    32067
3   4   24   M  technician    43537
4   5   33   F       other    15213
/<code>

因此，現在我們的Dataframe看起來不錯。

從CSV文件構造一個Pandas Dataframe。

就像上面一樣，我們可以使用pandas.read_csv（）從csv文件輕鬆構建pandas Dataframe。

<code>import pandas as pd
df = pd.read_csv('http://bit.ly/uforeports')
print(df.head())

# Output
                   City Colors Reported Shape Reported State             Time
0                Ithaca             NaN       TRIANGLE    NY   6/1/1930 22:00
1           Willingboro             NaN          OTHER    NJ  6/30/1930 20:00
2               Holyoke             NaN           OVAL    CO  2/15/1931 14:00
3               Abilene             NaN           DISK    KS   6/1/1931 13:00
4  New York Worlds Fair             NaN          LIGHT    NY  4/18/1933 19:00
/<code>

設置Pandas Dataframe的索引列

在此，pandas 為Dataframe分配了默認的整數索引。您可以使用以下方法將任何列設置為索引。

<code>df.set_index('City', inplace=True)
df = pd.read_csv(http://bit.ly/uforeports, index_col = 0)

# Output 

                     Colors Reported Shape Reported State             Time
City                                                                      
Ithaca                           NaN       TRIANGLE    NY   6/1/1930 22:00
Willingboro                      NaN          OTHER    NJ  6/30/1930 20:00
Holyoke                          NaN           OVAL    CO  2/15/1931 14:00
Abilene                          NaN           DISK    KS   6/1/1931 13:00
New York Worlds Fair             NaN          LIGHT    NY  4/18/1933 19:00
/<code>

重置Pandas Dataframe的索引列

同樣，您可以使用df.reset_index（）將索引重置為默認整數索引

<code>df.reset_index(inplace = True)
print(df.head())

# Output

                   City Colors Reported Shape Reported State             Time
0                Ithaca             NaN       TRIANGLE    NY   6/1/1930 22:00
1           Willingboro             NaN          OTHER    NJ  6/30/1930 20:00
2               Holyoke             NaN           OVAL    CO  2/15/1931 14:00 

3               Abilene             NaN           DISK    KS   6/1/1931 13:00
4  New York Worlds Fair             NaN          LIGHT    NY  4/18/1933 19:00
/<code>

同樣，您可以使用上表中詳細介紹的pandas方法讀取其他數據文件。

獲取有關Pandas Dataframe的信息

您可以使用（.info（））方法獲取有關Pandas Dataframe的信息。讓我們檢查有關上面創建的pandas Dataframe的信息。

<code>print(df.info())

# Output

<class>
Index: 18241 entries, Ithaca to Ybor
Data columns (total 4 columns):
Colors Reported    2882 non-null object
Shape Reported     15597 non-null object
State              18241 non-null object
Time               18241 non-null object
dtypes: object(4)
memory usage: 712.5+ KB
/<class>/<code>

因此，這告訴我們有關Dataframe的列，值，索引，內存使用情況等。

您還可以在Dataframe上使用.describe（）方法，該方法將在所有列上執行數學運算。對於包含字符串或int或float以外的對象的列，它可能沒有用。

<code>df.describe()

# Output

       Colors Reported Shape Reported  State              Time
count             2882          15597  18241             18241
unique              27             27     52             16145 

top                RED          LIGHT     CA  11/16/1999 19:00
freq               780           2803   2529                27
/<code>