python – pandasでSQLのようにINNER/LEFT/RIGHT JOIN
業務を行なっているとdata lakeから直接データ取得して前処理しなければならない場合によく出くわします。そのため、大量データをpandasで処理する機会が非常に多いように思います。
そこで今回は、個人的に少し利用頻度の高いpython – pandasでSQLのようにINNER/LEFT/RIGHT JOINした時のことを記録しようと思います。
1. 前処理
JOINで使用する商品データ(product)と顧客データ(customer)のdata frameを作成します。まず、pandasをimportしてから、商品データを作成します。
In [1]: import pandas as pd
In [2]: df_product = pd.DataFrame({
...: 'Product_ID':[101,102,103,104,105,106,107],
...: 'Product_name':['Watch','Bag','Shoes','Smartphone','Books','Oil','Laptop'],
...: 'Category':['Fashion','Fashion','Fashion','Electronics','Study','Grocery','Electronics'],
...: 'Price':[299.0,1350.50,2999.0,14999.0,145.0,110.0,79999.0],
...: 'Seller_City':['Delhi','Mumbai','Chennai','Kolkata','Delhi','Chennai','Bengalore']
...: })
...: df_product
Out[2]:
Product_ID Product_name Category Price Seller_City
0 101 Watch Fashion 299.0 Delhi
1 102 Bag Fashion 1350.5 Mumbai
2 103 Shoes Fashion 2999.0 Chennai
3 104 Smartphone Electronics 14999.0 Kolkata
4 105 Books Study 145.0 Delhi
5 106 Oil Grocery 110.0 Chennai
6 107 Laptop Electronics 79999.0 Bengalore
同様に顧客データを作成します。
In [3]: df_customer = pd.DataFrame({
...: 'id':[1,2,3,4,5,6,7,8,9],
...: 'name':['Olivia','Aditya','Cory','Isabell','Dominic','Tyler','Samuel','Daniel','Jeremy'],
...: 'age':[20,25,15,10,30,65,35,18,23],
...: 'Product_ID':[101,0,106,0,103,104,0,0,107],
...: 'Purchased_Product':['Watch','NA','Oil','NA','Shoes','Smartphone','NA','NA','Laptop'],
...: 'City':['Mumbai','Delhi','Bangalore','Chennai','Chennai','Delhi','Kolkata','Delhi','Mumbai']
...: })
...: df_customer
Out[3]:
id name age Product_ID Purchased_Product City
0 1 Olivia 20 101 Watch Mumbai
1 2 Aditya 25 0 NA Delhi
2 3 Cory 15 106 Oil Bangalore
3 4 Isabell 10 0 NA Chennai
4 5 Dominic 30 103 Shoes Chennai
5 6 Tyler 65 104 Smartphone Delhi
6 7 Samuel 35 0 NA Kolkata
7 8 Daniel 18 0 NA Delhi
8 9 Jeremy 23 107 Laptop Mumbai
2. pandasでSQLのINNER JOINのように実装
まずは、INNER JOINを行なってみます。
SQLでのINNER JOIN
それぞれのData frameがproduct、customerというテーブルで存在する場合、SQLではこのようになるはずです。
# SQL
SELECT
*
FROM
product AS p
INNER JOIN
customer AS c
ON
p.Product_ID = c.Product_ID
;
# Out:
Product_ID Product_name Category Price Seller_City id name age Purchased_Product City
0 101 Watch Fashion 299.0 Delhi 1 Olivia 20 Watch Mumbai
1 103 Shoes Fashion 2999.0 Chennai 5 Dominic 30 Shoes Chennai
2 104 Smartphone Electronics 14999.0 Kolkata 6 Tyler 65 Smartphone Delhi
3 106 Oil Grocery 110.0 Chennai 3 Cory 15 Oil Bangalore
4 107 Laptop Electronics 79999.0 Bengalore 9 Jeremy 23 Laptop Mumbai
python – pandasでINNER JOIN (Join key名が同じ)
それでは、本題のpythonコードでpandasを使用してSQLのINNER JOINのように実装していきます。
In [4]: pd.merge(df_product, df_customer, on='Product_ID')
Out[4]:
Product_ID Product_name Category Price Seller_City id name age Purchased_Product City
0 101 Watch Fashion 299.0 Delhi 1 Olivia 20 Watch Mumbai
1 103 Shoes Fashion 2999.0 Chennai 5 Dominic 30 Shoes Chennai
2 104 Smartphone Electronics 14999.0 Kolkata 6 Tyler 65 Smartphone Delhi
3 106 Oil Grocery 110.0 Chennai 3 Cory 15 Oil Bangalore
4 107 Laptop Electronics 79999.0 Bengalore 9 Jeremy 23 Laptop Mumbai
ここでは、製品と顧客のData frameをProduct_IDカラムをキーにして、内部結合(INNER JOIN)してみました。しかし、キーが同一カラムでない場合が多いと思います。その場合はどうなるのかを次のSQLのLEFT JOINのようにpandasで実装してみます。
In [5]: pd.merge(df_product, df_customer, left_on='Product_name', right_on='Purchased_Product')
Out[5]:
Product_ID_x Product_name Category Price Seller_City id name age Product_ID_y Purchased_Product City
0 101 Watch Fashion 299.0 Delhi 1 Olivia 20 101 Watch Mumbai
1 103 Shoes Fashion 2999.0 Chennai 5 Dominic 30 103 Shoes Chennai
2 104 Smartphone Electronics 14999.0 Kolkata 6 Tyler 65 104 Smartphone Delhi
3 106 Oil Grocery 110.0 Chennai 3 Cory 15 106 Oil Bangalore
4 107 Laptop Electronics 79999.0 Bengalore 9 Jeremy 23 107 Laptop Mumbai
python – pandasでINNER JOIN (Join key名が複数)
次に、キーが複数の場合について実装していきます。
In [6]: pd.merge(df_product, df_customer, how='inner', left_on=['Product_ID','Seller_City'], right_on=['Product_ID','City'])
Out[6]:
Product_ID Product_name Category Price Seller_City id name age Purchased_Product City
0 103 Shoes Fashion 2999.0 Chennai 5 Dominic 30 Shoes Chennai
ここでは、製品と顧客のData frameを”Product_ID”カラムと”Seller_City”を複合キーにして、内部結合(INNER JOIN)してみました。SQLは以下のイメージとなるはずです。
# SQL
SELECT
*
FROM
product AS p
INNER JOIN
customer AS c
ON
p.Product_ID = c.Product_ID AND
p.Seller_City = c.City
;
# Out:
Product_ID Product_name Category Price Seller_City id name age Purchased_Product City
0 103 Shoes Fashion 2999.0 Chennai 5 Dominic 30 Shoes Chennai
3. pandasでSQLのLEFT JOINのように実装
では、pandasを使用して、SQLのLEFT JOINのように実装する方法を見ていきます。
SQLでのLEFT JOIN
まずは、SQLで実装すると以下のような記載となるはずです。前回のINNER JOINの時と同様に、product, customerがテーブルとして存在することを仮定して実装します。
# SQL
SELECT
*
FROM
product AS p
LEFT JOIN
customer AS c
ON
p.Product_ID = c.Product_ID
;
# Out:
Product_ID Product_name Category Price Seller_City id name age Purchased_Product City
0 101 Watch Fashion 299.0 Delhi 1.0 Olivia 20.0 Watch Mumbai
1 102 Bag Fashion 1350.5 Mumbai NaN NaN NaN NaN NaN
2 103 Shoes Fashion 2999.0 Chennai 5.0 Dominic 30.0 Shoes Chennai
3 104 Smartphone Electronics 14999.0 Kolkata 6.0 Tyler 65.0 Smartphone Delhi
4 105 Books Study 145.0 Delhi NaN NaN NaN NaN NaN
5 106 Oil Grocery 110.0 Chennai 3.0 Cory 15.0 Oil Bangalore
6 107 Laptop Electronics 79999.0 Bengalore 9.0 Jeremy 23.0 Laptop Mumbai
python – pandasでLEFT JOIN (Join key名が同じ)
それでは、pythonのコードに移ります。pandasを使用して、先ほどと同様のデータである製品、顧客のData frameを使用します。
In [7]: pd.merge(df_product, df_customer, how='left', on='Product_ID')
Out[7]:
Product_ID Product_name Category Price Seller_City id name age Purchased_Product City
0 101 Watch Fashion 299.0 Delhi 1.0 Olivia 20.0 Watch Mumbai
1 102 Bag Fashion 1350.5 Mumbai NaN NaN NaN NaN NaN
2 103 Shoes Fashion 2999.0 Chennai 5.0 Dominic 30.0 Shoes Chennai
3 104 Smartphone Electronics 14999.0 Kolkata 6.0 Tyler 65.0 Smartphone Delhi
4 105 Books Study 145.0 Delhi NaN NaN NaN NaN NaN
5 106 Oil Grocery 110.0 Chennai 3.0 Cory 15.0 Oil Bangalore
6 107 Laptop Electronics 79999.0 Bengalore 9.0 Jeremy 23.0 Laptop Mumbai
python – pandasでLEFT JOIN (Join key名が複数)
キーとなるカラム名が異なる場合は以下のようになります。これまでと同様のデータセットを使用し、pandasのmerge関数を使用して、実装していきます。
In [8]: pd.merge(df_product, df_customer, how='left', left_on='Product_name', right_on='Purchased_Product')
Out[8]:
Product_ID_x Product_name Category Price Seller_City id name age Product_ID_y Purchased_Product City
0 101 Watch Fashion 299.0 Delhi 1.0 Olivia 20.0 101.0 Watch Mumbai
1 102 Bag Fashion 1350.5 Mumbai NaN NaN NaN NaN NaN NaN
2 103 Shoes Fashion 2999.0 Chennai 5.0 Dominic 30.0 103.0 Shoes Chennai
3 104 Smartphone Electronics 14999.0 Kolkata 6.0 Tyler 65.0 104.0 Smartphone Delhi
4 105 Books Study 145.0 Delhi NaN NaN NaN NaN NaN NaN
5 106 Oil Grocery 110.0 Chennai 3.0 Cory 15.0 106.0 Oil Bangalore
6 107 Laptop Electronics 79999.0 Bengalore 9.0 Jeremy 23.0 107.0 Laptop Mumbai
キーが複数の場合は、以下のようにリスト形式で記載します。
In [9]: pd.merge(df_product, df_customer, how='left', left_on=['Product_ID','Seller_City'], right_on=['Product_ID','City'])
Out[9]:
Product_ID Product_name Category Price Seller_City id name age Purchased_Product City
0 101 Watch Fashion 299.0 Delhi NaN NaN NaN NaN NaN
1 102 Bag Fashion 1350.5 Mumbai NaN NaN NaN NaN NaN
2 103 Shoes Fashion 2999.0 Chennai 5.0 Dominic 30.0 Shoes Chennai
3 104 Smartphone Electronics 14999.0 Kolkata NaN NaN NaN NaN NaN
4 105 Books Study 145.0 Delhi NaN NaN NaN NaN NaN
5 106 Oil Grocery 110.0 Chennai NaN NaN NaN NaN NaN
6 107 Laptop Electronics 79999.0 Bengalore NaN NaN NaN NaN NaN
キーが一つの場合も私はリスト形式で記載していたり、Joinの形式(how)も明示的にこの記載方法を使用しているように思います。
4. pandasでSQLのRIGHT JOIN
実は、私は好みの問題でほぼRIGHT JOINは使用しませんが、最後にpandasを使用して、SQLのRIGHT JOINのように実装してみます。
python – pandasでRIGHT JOIN (Join key名が同じ)
In [10]: pd.merge(df_product, df_customer, how='right', on='Product_ID')
Out[10]:
Product_ID Product_name Category Price Seller_City id name age Purchased_Product City
0 101 Watch Fashion 299.0 Delhi 1 Olivia 20 Watch Mumbai
1 0 NaN NaN NaN NaN 2 Aditya 25 NA Delhi
2 106 Oil Grocery 110.0 Chennai 3 Cory 15 Oil Bangalore
3 0 NaN NaN NaN NaN 4 Isabell 10 NA Chennai
4 103 Shoes Fashion 2999.0 Chennai 5 Dominic 30 Shoes Chennai
5 104 Smartphone Electronics 14999.0 Kolkata 6 Tyler 65 Smartphone Delhi
6 0 NaN NaN NaN NaN 7 Samuel 35 NA Kolkata
7 0 NaN NaN NaN NaN 8 Daniel 18 NA Delhi
pandasでSQLのLEFT JOINのような実装とほぼ同様です。
In [11]: pd.merge(df_product, df_customer, how='right', left_on='Product_name', right_on='Purchased_Product')
Out[11]:
Product_ID_x Product_name Category Price Seller_City id name age Product_ID_y Purchased_Product City
0 101.0 Watch Fashion 299.0 Delhi 1 Olivia 20 101 Watch Mumbai
1 NaN NaN NaN NaN NaN 2 Aditya 25 0 NA Delhi
2 106.0 Oil Grocery 110.0 Chennai 3 Cory 15 106 Oil Bangalore
3 NaN NaN NaN NaN NaN 4 Isabell 10 0 NA Chennai
4 103.0 Shoes Fashion 2999.0 Chennai 5 Dominic 30 103 Shoes Chennai
5 104.0 Smartphone Electronics 14999.0 Kolkata 6 Tyler 65 104 Smartphone Delhi
6 NaN NaN NaN NaN NaN 7 Samuel 35 0 NA Kolkata
7 NaN NaN NaN NaN NaN 8 Daniel 18 0 NA Delhi
8 107.0 Laptop Electronics 79999.0 Bengalore 9 Jeremy 23 107 Laptop Mumbai
python – pandasでLEFT JOIN (Join key名が複数)
こちらも、pandasでSQLのLEFT JOINのような実装とほぼ同様で、howのみ変更しているだけです。
In [12]: pd.merge(df_product, df_customer, how='right', left_on=['Product_ID','Seller_City'], right_on=['Product_ID','City'])
Out[12]:
Product_ID Product_name Category Price Seller_City id name age Purchased_Product City
0 101 NaN NaN NaN NaN 1 Olivia 20 Watch Mumbai
1 0 NaN NaN NaN NaN 2 Aditya 25 NA Delhi
2 106 NaN NaN NaN NaN 3 Cory 15 Oil Bangalore
3 0 NaN NaN NaN NaN 4 Isabell 10 NA Chennai
4 103 Shoes Fashion 2999.0 Chennai 5 Dominic 30 Shoes Chennai
5 104 NaN NaN NaN NaN 6 Tyler 65 Smartphone Delhi
6 0 NaN NaN NaN NaN 7 Samuel 35 NA Kolkata
7 0 NaN NaN NaN NaN 8 Daniel 18 NA Delhi
8 107 NaN NaN NaN NaN 9 Jeremy 23 Laptop Mumbai
5. まとめ
ということで今回は、python codeの中でpandasでSQLのようにINNER/LEFT/RIGHT JOINした時のことを記録してみました。