python – pandasでSQLのようにINNER/LEFT/RIGHT JOIN

python – pandasでSQLのようにINNER/LEFT/RIGHT JOIN

業務を行なっているとdata lakeから直接データ取得して前処理しなければならない場合によく出くわします。そのため、大量データをpandasで処理する機会が非常に多いように思います。

そこで今回は、個人的に少し利用頻度の高いpython – pandasでSQLのようにINNER/LEFT/RIGHT JOINした時のことを記録しようと思います。

1. 前処理

JOINで使用する商品データ(product)と顧客データ(customer)のdata frameを作成します。まず、pandasをimportしてから、商品データを作成します。

In [1]: import pandas as pd

In [2]: df_product = pd.DataFrame({
   ...:     'Product_ID':[101,102,103,104,105,106,107],
   ...:     'Product_name':['Watch','Bag','Shoes','Smartphone','Books','Oil','Laptop'],
   ...:     'Category':['Fashion','Fashion','Fashion','Electronics','Study','Grocery','Electronics'],
   ...:     'Price':[299.0,1350.50,2999.0,14999.0,145.0,110.0,79999.0],
   ...:     'Seller_City':['Delhi','Mumbai','Chennai','Kolkata','Delhi','Chennai','Bengalore']
   ...: })
   ...: df_product

Out[2]:
   Product_ID Product_name     Category    Price Seller_City
0         101        Watch      Fashion    299.0       Delhi
1         102          Bag      Fashion   1350.5      Mumbai
2         103        Shoes      Fashion   2999.0     Chennai
3         104   Smartphone  Electronics  14999.0     Kolkata
4         105        Books        Study    145.0       Delhi
5         106          Oil      Grocery    110.0     Chennai
6         107       Laptop  Electronics  79999.0   Bengalore

同様に顧客データを作成します。

In [3]: df_customer = pd.DataFrame({
   ...:     'id':[1,2,3,4,5,6,7,8,9],
   ...:     'name':['Olivia','Aditya','Cory','Isabell','Dominic','Tyler','Samuel','Daniel','Jeremy'],
   ...:     'age':[20,25,15,10,30,65,35,18,23],
   ...:     'Product_ID':[101,0,106,0,103,104,0,0,107],
   ...:     'Purchased_Product':['Watch','NA','Oil','NA','Shoes','Smartphone','NA','NA','Laptop'],
   ...:     'City':['Mumbai','Delhi','Bangalore','Chennai','Chennai','Delhi','Kolkata','Delhi','Mumbai']
   ...: })
   ...: df_customer

Out[3]:
   id     name  age  Product_ID Purchased_Product       City
0   1   Olivia   20         101             Watch     Mumbai
1   2   Aditya   25           0                NA      Delhi
2   3     Cory   15         106               Oil  Bangalore
3   4  Isabell   10           0                NA    Chennai
4   5  Dominic   30         103             Shoes    Chennai
5   6    Tyler   65         104        Smartphone      Delhi
6   7   Samuel   35           0                NA    Kolkata
7   8   Daniel   18           0                NA      Delhi
8   9   Jeremy   23         107            Laptop     Mumbai

2. pandasでSQLのINNER JOINのように実装

まずは、INNER JOINを行なってみます。

SQLでのINNER JOIN

それぞれのData frameがproduct、customerというテーブルで存在する場合、SQLではこのようになるはずです。

# SQL
SELECT
    *
FROM
    product AS p
INNER JOIN
    customer AS c
ON
    p.Product_ID = c.Product_ID
;

# Out:
   Product_ID Product_name     Category    Price Seller_City  id     name  age Purchased_Product       City
0         101        Watch      Fashion    299.0       Delhi   1   Olivia   20             Watch     Mumbai
1         103        Shoes      Fashion   2999.0     Chennai   5  Dominic   30             Shoes    Chennai
2         104   Smartphone  Electronics  14999.0     Kolkata   6    Tyler   65        Smartphone      Delhi
3         106          Oil      Grocery    110.0     Chennai   3     Cory   15               Oil  Bangalore
4         107       Laptop  Electronics  79999.0   Bengalore   9   Jeremy   23            Laptop     Mumbai

python – pandasでINNER JOIN (Join key名が同じ)

それでは、本題のpythonコードでpandasを使用してSQLのINNER JOINのように実装していきます。

In [4]: pd.merge(df_product, df_customer, on='Product_ID')

Out[4]:
   Product_ID Product_name     Category    Price Seller_City  id     name  age Purchased_Product       City
0         101        Watch      Fashion    299.0       Delhi   1   Olivia   20             Watch     Mumbai
1         103        Shoes      Fashion   2999.0     Chennai   5  Dominic   30             Shoes    Chennai
2         104   Smartphone  Electronics  14999.0     Kolkata   6    Tyler   65        Smartphone      Delhi
3         106          Oil      Grocery    110.0     Chennai   3     Cory   15               Oil  Bangalore
4         107       Laptop  Electronics  79999.0   Bengalore   9   Jeremy   23            Laptop     Mumbai

ここでは、製品と顧客のData frameをProduct_IDカラムをキーにして、内部結合(INNER JOIN)してみました。しかし、キーが同一カラムでない場合が多いと思います。その場合はどうなるのかを次のSQLのLEFT JOINのようにpandasで実装してみます。

In [5]: pd.merge(df_product, df_customer, left_on='Product_name', right_on='Purchased_Product')

Out[5]:
   Product_ID_x Product_name     Category    Price Seller_City  id     name  age  Product_ID_y Purchased_Product       City
0           101        Watch      Fashion    299.0       Delhi   1   Olivia   20           101             Watch     Mumbai
1           103        Shoes      Fashion   2999.0     Chennai   5  Dominic   30           103             Shoes    Chennai
2           104   Smartphone  Electronics  14999.0     Kolkata   6    Tyler   65           104        Smartphone      Delhi
3           106          Oil      Grocery    110.0     Chennai   3     Cory   15           106               Oil  Bangalore
4           107       Laptop  Electronics  79999.0   Bengalore   9   Jeremy   23           107            Laptop     Mumbai

python – pandasでINNER JOIN (Join key名が複数)

次に、キーが複数の場合について実装していきます。

In [6]: pd.merge(df_product, df_customer, how='inner', left_on=['Product_ID','Seller_City'], right_on=['Product_ID','City'])

Out[6]:
   Product_ID Product_name Category   Price Seller_City  id     name  age Purchased_Product     City
0         103        Shoes  Fashion  2999.0     Chennai   5  Dominic   30             Shoes  Chennai

ここでは、製品と顧客のData frameを”Product_ID”カラムと”Seller_City”を複合キーにして、内部結合(INNER JOIN)してみました。SQLは以下のイメージとなるはずです。

# SQL
SELECT
    *
FROM
    product AS p
INNER JOIN
    customer AS c
ON
    p.Product_ID = c.Product_ID AND
    p.Seller_City = c.City
;

# Out:
   Product_ID Product_name Category   Price Seller_City  id     name  age Purchased_Product     City
0         103        Shoes  Fashion  2999.0     Chennai   5  Dominic   30             Shoes  Chennai

3. pandasでSQLのLEFT JOINのように実装

では、pandasを使用して、SQLのLEFT JOINのように実装する方法を見ていきます。

SQLでのLEFT JOIN

まずは、SQLで実装すると以下のような記載となるはずです。前回のINNER JOINの時と同様に、product, customerがテーブルとして存在することを仮定して実装します。

# SQL
SELECT
    *
FROM
    product AS p
LEFT JOIN
    customer AS c
ON
    p.Product_ID = c.Product_ID
;

# Out:
   Product_ID Product_name     Category    Price Seller_City   id     name   age Purchased_Product       City
0         101        Watch      Fashion    299.0       Delhi  1.0   Olivia  20.0             Watch     Mumbai
1         102          Bag      Fashion   1350.5      Mumbai  NaN      NaN   NaN               NaN        NaN
2         103        Shoes      Fashion   2999.0     Chennai  5.0  Dominic  30.0             Shoes    Chennai
3         104   Smartphone  Electronics  14999.0     Kolkata  6.0    Tyler  65.0        Smartphone      Delhi
4         105        Books        Study    145.0       Delhi  NaN      NaN   NaN               NaN        NaN
5         106          Oil      Grocery    110.0     Chennai  3.0     Cory  15.0               Oil  Bangalore
6         107       Laptop  Electronics  79999.0   Bengalore  9.0   Jeremy  23.0            Laptop     Mumbai

python – pandasでLEFT JOIN (Join key名が同じ)

それでは、pythonのコードに移ります。pandasを使用して、先ほどと同様のデータである製品、顧客のData frameを使用します。

In [7]: pd.merge(df_product, df_customer, how='left', on='Product_ID')

Out[7]:
   Product_ID Product_name     Category    Price Seller_City   id     name   age Purchased_Product       City
0         101        Watch      Fashion    299.0       Delhi  1.0   Olivia  20.0             Watch     Mumbai
1         102          Bag      Fashion   1350.5      Mumbai  NaN      NaN   NaN               NaN        NaN
2         103        Shoes      Fashion   2999.0     Chennai  5.0  Dominic  30.0             Shoes    Chennai
3         104   Smartphone  Electronics  14999.0     Kolkata  6.0    Tyler  65.0        Smartphone      Delhi
4         105        Books        Study    145.0       Delhi  NaN      NaN   NaN               NaN        NaN
5         106          Oil      Grocery    110.0     Chennai  3.0     Cory  15.0               Oil  Bangalore
6         107       Laptop  Electronics  79999.0   Bengalore  9.0   Jeremy  23.0            Laptop     Mumbai

python – pandasでLEFT JOIN (Join key名が複数)

キーとなるカラム名が異なる場合は以下のようになります。これまでと同様のデータセットを使用し、pandasのmerge関数を使用して、実装していきます。

In [8]: pd.merge(df_product, df_customer, how='left', left_on='Product_name', right_on='Purchased_Product')

Out[8]:
   Product_ID_x Product_name     Category    Price Seller_City   id     name   age  Product_ID_y Purchased_Product       City
0           101        Watch      Fashion    299.0       Delhi  1.0   Olivia  20.0         101.0             Watch     Mumbai
1           102          Bag      Fashion   1350.5      Mumbai  NaN      NaN   NaN           NaN               NaN        NaN
2           103        Shoes      Fashion   2999.0     Chennai  5.0  Dominic  30.0         103.0             Shoes    Chennai
3           104   Smartphone  Electronics  14999.0     Kolkata  6.0    Tyler  65.0         104.0        Smartphone      Delhi
4           105        Books        Study    145.0       Delhi  NaN      NaN   NaN           NaN               NaN        NaN
5           106          Oil      Grocery    110.0     Chennai  3.0     Cory  15.0         106.0               Oil  Bangalore
6           107       Laptop  Electronics  79999.0   Bengalore  9.0   Jeremy  23.0         107.0            Laptop     Mumbai

キーが複数の場合は、以下のようにリスト形式で記載します。

In [9]: pd.merge(df_product, df_customer, how='left', left_on=['Product_ID','Seller_City'], right_on=['Product_ID','City'])
Out[9]:
   Product_ID Product_name     Category    Price Seller_City   id     name   age Purchased_Product     City
0         101        Watch      Fashion    299.0       Delhi  NaN      NaN   NaN               NaN      NaN
1         102          Bag      Fashion   1350.5      Mumbai  NaN      NaN   NaN               NaN      NaN
2         103        Shoes      Fashion   2999.0     Chennai  5.0  Dominic  30.0             Shoes  Chennai
3         104   Smartphone  Electronics  14999.0     Kolkata  NaN      NaN   NaN               NaN      NaN
4         105        Books        Study    145.0       Delhi  NaN      NaN   NaN               NaN      NaN
5         106          Oil      Grocery    110.0     Chennai  NaN      NaN   NaN               NaN      NaN
6         107       Laptop  Electronics  79999.0   Bengalore  NaN      NaN   NaN               NaN      NaN

キーが一つの場合も私はリスト形式で記載していたり、Joinの形式(how)も明示的にこの記載方法を使用しているように思います。

4. pandasでSQLのRIGHT JOIN

実は、私は好みの問題でほぼRIGHT JOINは使用しませんが、最後にpandasを使用して、SQLのRIGHT JOINのように実装してみます。

python – pandasでRIGHT JOIN (Join key名が同じ)

In [10]: pd.merge(df_product, df_customer, how='right', on='Product_ID')

Out[10]:
   Product_ID Product_name     Category    Price Seller_City  id     name  age Purchased_Product       City
0         101        Watch      Fashion    299.0       Delhi   1   Olivia   20             Watch     Mumbai
1           0          NaN          NaN      NaN         NaN   2   Aditya   25                NA      Delhi
2         106          Oil      Grocery    110.0     Chennai   3     Cory   15               Oil  Bangalore
3           0          NaN          NaN      NaN         NaN   4  Isabell   10                NA    Chennai
4         103        Shoes      Fashion   2999.0     Chennai   5  Dominic   30             Shoes    Chennai
5         104   Smartphone  Electronics  14999.0     Kolkata   6    Tyler   65        Smartphone      Delhi
6           0          NaN          NaN      NaN         NaN   7   Samuel   35                NA    Kolkata
7           0          NaN          NaN      NaN         NaN   8   Daniel   18                NA      Delhi

pandasでSQLのLEFT JOINのような実装とほぼ同様です。

In [11]: pd.merge(df_product, df_customer, how='right', left_on='Product_name', right_on='Purchased_Product')

Out[11]:
   Product_ID_x Product_name     Category    Price Seller_City  id     name  age  Product_ID_y Purchased_Product       City
0         101.0        Watch      Fashion    299.0       Delhi   1   Olivia   20           101             Watch     Mumbai
1           NaN          NaN          NaN      NaN         NaN   2   Aditya   25             0                NA      Delhi
2         106.0          Oil      Grocery    110.0     Chennai   3     Cory   15           106               Oil  Bangalore
3           NaN          NaN          NaN      NaN         NaN   4  Isabell   10             0                NA    Chennai
4         103.0        Shoes      Fashion   2999.0     Chennai   5  Dominic   30           103             Shoes    Chennai
5         104.0   Smartphone  Electronics  14999.0     Kolkata   6    Tyler   65           104        Smartphone      Delhi
6           NaN          NaN          NaN      NaN         NaN   7   Samuel   35             0                NA    Kolkata
7           NaN          NaN          NaN      NaN         NaN   8   Daniel   18             0                NA      Delhi
8         107.0       Laptop  Electronics  79999.0   Bengalore   9   Jeremy   23           107            Laptop     Mumbai

python – pandasでLEFT JOIN (Join key名が複数)

こちらも、pandasでSQLのLEFT JOINのような実装とほぼ同様で、howのみ変更しているだけです。

In [12]: pd.merge(df_product, df_customer, how='right', left_on=['Product_ID','Seller_City'], right_on=['Product_ID','City'])

Out[12]:
   Product_ID Product_name Category   Price Seller_City  id     name  age Purchased_Product       City
0         101          NaN      NaN     NaN         NaN   1   Olivia   20             Watch     Mumbai
1           0          NaN      NaN     NaN         NaN   2   Aditya   25                NA      Delhi
2         106          NaN      NaN     NaN         NaN   3     Cory   15               Oil  Bangalore
3           0          NaN      NaN     NaN         NaN   4  Isabell   10                NA    Chennai
4         103        Shoes  Fashion  2999.0     Chennai   5  Dominic   30             Shoes    Chennai
5         104          NaN      NaN     NaN         NaN   6    Tyler   65        Smartphone      Delhi
6           0          NaN      NaN     NaN         NaN   7   Samuel   35                NA    Kolkata
7           0          NaN      NaN     NaN         NaN   8   Daniel   18                NA      Delhi
8         107          NaN      NaN     NaN         NaN   9   Jeremy   23            Laptop     Mumbai

5. まとめ

ということで今回は、python codeの中でpandasでSQLのようにINNER/LEFT/RIGHT JOINした時のことを記録してみました。

(Visited 126 times, 2 visits today)