Python – pandasでSQLのようにselect/where/groupby

業務を行なっているとdata lakeから直接データ取得して前処理しなければならない場合によく出くわします。そのため、大量データをpandas/dataframeで処理する機会が非常に多いように思います。

そこで今回は、個人的に少し利用頻度の高いPython – pandasでSQLのようにselect/where/groupbyした時のことを記録しようと思います。

1. 前処理

select/where/groupbyで使用するdataframeを作成します。まず、pandasをimportしてから、商品データを作成します。

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({
   ...:     'Product_ID':[101,102,103,104,105,106,107],
   ...:     'Product_name':['Watch','Bag','Shoes','Smartphone','Books','Oil','Laptop'],
   ...:     'Category':['Fashion','Fashion','Fashion','Electronics','Study','Grocery','Electronics'],
   ...:     'Price':[299.0,1350.50,2999.0,14999.0,145.0,110.0,79999.0],
   ...:     'Seller_City':['Tokyo','Chiba','Kanagawa','Saitama','Tokyo','Kanagawa','Gunnma']
   ...: })
   ...: df
Out[2]: 
   Product_ID Product_name     Category    Price Seller_City
0         101        Watch      Fashion    299.0       Tokyo
1         102          Bag      Fashion   1350.5       Chiba
2         103        Shoes      Fashion   2999.0    Kanagawa
3         104   Smartphone  Electronics  14999.0     Saitama
4         105        Books        Study    145.0       Tokyo
5         106          Oil      Grocery    110.0    Kanagawa
6         107       Laptop  Electronics  79999.0      Gunnma

2. SQLのselect句のように射影

先ほど作成したdataframeを使用して、SQLのselect句のように、カラム指定しデータを射影してみます。

In [3]: df_1 = df.copy()
   ...: df_1 = df_1[[
   ...:     'Product_ID',
   ...:     'Product_name',
   ...:     'Category',
   ...:     'Price',
   ...: ]]
   ...: df_1
Out[3]: 
   Product_ID Product_name     Category    Price
0         101        Watch      Fashion    299.0
1         102          Bag      Fashion   1350.5
2         103        Shoes      Fashion   2999.0
3         104   Smartphone  Electronics  14999.0
4         105        Books        Study    145.0
5         106          Oil      Grocery    110.0
6         107       Laptop  Electronics  79999.0

df_1 = df.copy() は必要なの？と思うかもしれませんが、これを行わないと以下warningが表示されると思います。Returning a view versus a copyのセクション参照に、

When setting values in a pandas object, care must be taken to avoid what is called chained indexing.
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

と記載されています。以下の対応策としては、dataframeのcopyを作成してからカラムの絞り込みを行うことでWarningを回避することができると思われます。

<ipython-input-44-364df0866bc3>:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_1['Price'] = df_1['Price'].astype(int)

ちなみに、業務ではカラム数が非常に多い場合がよくあるので私は以下をUtility関数に含めて、カラムを取得しています。

def print_col(df_base):
    for col in df_base.columns: 
        print("    '{}',".format(col))

また、カラムが多い場合、絞り込みの確認にはinfo()を使用した方がカラムの絞り込みが簡単に確認できると思います。個人的には、ほぼこちらで確認しています。

In [4]: df_1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Product_ID    7 non-null      int64  
 1   Product_name  7 non-null      object 
 2   Category      7 non-null      object 
 3   Price         7 non-null      float64
dtypes: float64(1), int64(1), object(2)
memory usage: 352.0+ bytes

dataframeをintでcastする場合、以下のようなエラーに出くわすことが時々あります。intはNULL許容しないことが原因のようです。

ValueError: Cannot convert non-finite values (NA or inf) to integer

そこで、以下のように’Int64’型でcastするとエラーは解消することができる場合があります。

df_1['Price'] = df_1['Price'].astype('Int64')

3. SQLのselect句のようにカラム名変更

SQLのselect句では、簡単にカラム名変更できると思います。dataframeでもリネームが以下のようにできます。ここでは、”Price”カラムを”price_edit”にカラム名を変更しています。

In [5]: # Rename Column
   ...: df_1 = df_1.rename(columns={
   ...:     'Price': 'price_edit'
   ...: })
   ...: df_1
Out[5]: 
   Product_ID Product_name     Category  price_edit
0         101        Watch      Fashion       299.0
1         102          Bag      Fashion      1350.5
2         103        Shoes      Fashion      2999.0
3         104   Smartphone  Electronics     14999.0
4         105        Books        Study       145.0
5         106          Oil      Grocery       110.0
6         107       Laptop  Electronics     79999.0

4. SQLのwhere句のように絞り込み

1.前処理で作成したdataframeを使用して、SQLのwhere句のように、絞り込み条件を指定して絞り込みしてみます。

In [6]: df_2 = df_1[(df_1['price_edit'] > 1000) & (df_1['Category'] == 'Fashion')]
   ...: df_2
Out[6]: 
   Product_ID Product_name Category  price_edit
1         102          Bag  Fashion      1350.5
2         103        Shoes  Fashion      2999.0

ここでは、先ほどカラム名を変更した”price_edit”カラムと”Category”カラムを対象に絞り込みを行なっています。”price_edit”が1000より大きいかつ、”Category”が”Fashion”のレコードに絞り込みを行ない、新たにdataframeを作成して、df_1の結果をdf_2に代入しています。

5. SQLのgroupby句のように集計

最後にSQLのgroupby句のように集計を行なってみます。まずは、datasetを確認してみます。

In [7]: df
Out[7]: 
   Product_ID Product_name     Category    Price Seller_City
0         101        Watch      Fashion    299.0       Tokyo
1         102          Bag      Fashion   1350.5       Chiba
2         103        Shoes      Fashion   2999.0    Kanagawa
3         104   Smartphone  Electronics  14999.0     Saitama
4         105        Books        Study    145.0       Tokyo
5         106          Oil      Grocery    110.0    Kanagawa
6         107       Laptop  Electronics  79999.0      Gunnma

それでは、”Category”にはいくつのProduct_IDが存在しているのかを集計してみます。

In [8]: df.groupby(['Category']).size()
Out[8]: 
Category
Electronics    2
Fashion        3
Grocery        1
Study          1
dtype: int64

CategoryにはFashionが3つ、Electronicsが2つのProduct_IDが存在しているようでした。

次に、複数カラムでgroupbyする場合は、以下のようになります。この場合、”Seller_City”と”Category”カラム毎に”Product_ID”の数を集計しています。

In [9]: df.groupby(['Seller_City', 'Category']).size()
Out[9]: 
Seller_City  Category   
Chiba        Fashion        1
Gunnma       Electronics    1
Kanagawa     Fashion        1
             Grocery        1
Saitama      Electronics    1
Tokyo        Fashion        1
             Study          1
dtype: int64

少しだけ複雑なパターンの場合は、以下の記事が考になるかもしれません。

python – pandas groupbyで文字列をリスト化

6. SQLのorderby句のように並べ替え

5. GROUP BYの実行結果だと結果の件数が例えば1K以上などの件数が多い場合、非常に見づらいと思います。そこで、ここではSQLのorderbyのようにソートをかけてみます。

In [10]: df.groupby(['Category']).size().sort_values(ascending=False)
Out[10]: 
Category
Fashion        3
Electronics    2
Grocery        1
Study          1
dtype: int64

.sort_valuesを追加することで簡単にソートすることができました。ascending=Falseで降順にソートすることができます。

降順の場合は、引数なしで実行するだけです。

In [11]: df.groupby(['Category']).size().sort_values()
Out[11]: 
Category
Grocery        1
Study          1
Electronics    2
Fashion        3
dtype: int64

まとめ

ということで今回は、Python – pandasでSQLのようにselect/where/groupbyについての覚書を記載してみました。集計方法は他にもありますが、今回は使用頻度の高いものについて記載しております。

(Visited 94 times, 1 visits today)