Python – INNER/LEFT/RIGHT JOIN like SQL in pandas

Python – INNER/LEFT/RIGHT JOIN like SQL in pandas

When in the business, I often come across cases where data must be obtained directly from the data lake and preprocessed. Therefore, I think there are many opportunities to process large amounts of data with pandas.

So, this time, I would like to record an INNER/LEFT/RIGHT JOIN like SQL with python-pandas, which I personally use a little more frequently.

1. Preprocessing

Create a data frame for product data (product) and customer data (customer) used in JOIN. First, import pandas and then create product data.

In [1]: import pandas as pd

In [2]: df_product = pd.DataFrame({
   ...:     'Product_ID':[101,102,103,104,105,106,107],
   ...:     'Product_name':['Watch','Bag','Shoes','Smartphone','Books','Oil','Laptop'],
   ...:     'Category':['Fashion','Fashion','Fashion','Electronics','Study','Grocery','Electronics'],
   ...:     'Price':[299.0,1350.50,2999.0,14999.0,145.0,110.0,79999.0],
   ...:     'Seller_City':['Delhi','Mumbai','Chennai','Kolkata','Delhi','Chennai','Bengalore']
   ...: })
   ...: df_product

Out[2]:
   Product_ID Product_name     Category    Price Seller_City
0         101        Watch      Fashion    299.0       Delhi
1         102          Bag      Fashion   1350.5      Mumbai
2         103        Shoes      Fashion   2999.0     Chennai
3         104   Smartphone  Electronics  14999.0     Kolkata
4         105        Books        Study    145.0       Delhi
5         106          Oil      Grocery    110.0     Chennai
6         107       Laptop  Electronics  79999.0   Bengalore

Create customer data in the same way.

In [3]: df_customer = pd.DataFrame({
   ...:     'id':[1,2,3,4,5,6,7,8,9],
   ...:     'name':['Olivia','Aditya','Cory','Isabell','Dominic','Tyler','Samuel','Daniel','Jeremy'],
   ...:     'age':[20,25,15,10,30,65,35,18,23],
   ...:     'Product_ID':[101,0,106,0,103,104,0,0,107],
   ...:     'Purchased_Product':['Watch','NA','Oil','NA','Shoes','Smartphone','NA','NA','Laptop'],
   ...:     'City':['Mumbai','Delhi','Bangalore','Chennai','Chennai','Delhi','Kolkata','Delhi','Mumbai']
   ...: })
   ...: df_customer

Out[3]:
   id     name  age  Product_ID Purchased_Product       City
0   1   Olivia   20         101             Watch     Mumbai
1   2   Aditya   25           0                NA      Delhi
2   3     Cory   15         106               Oil  Bangalore
3   4  Isabell   10           0                NA    Chennai
4   5  Dominic   30         103             Shoes    Chennai
5   6    Tyler   65         104        Smartphone      Delhi
6   7   Samuel   35           0                NA    Kolkata
7   8   Daniel   18           0                NA      Delhi
8   9   Jeremy   23         107            Laptop     Mumbai

2. Implement like SQL INNER JOIN in pandas

First, let’s do an INNER JOIN.

INNER JOIN in SQL

If each data frame exists in a table called product and customer, the SQL should look like this.

# SQL
SELECT
    *
FROM
    product AS p
INNER JOIN
    customer AS c
ON
    p.Product_ID = c.Product_ID
;

# Out:
   Product_ID Product_name     Category    Price Seller_City  id     name  age Purchased_Product       City
0         101        Watch      Fashion    299.0       Delhi   1   Olivia   20             Watch     Mumbai
1         103        Shoes      Fashion   2999.0     Chennai   5  Dominic   30             Shoes    Chennai
2         104   Smartphone  Electronics  14999.0     Kolkata   6    Tyler   65        Smartphone      Delhi
3         106          Oil      Grocery    110.0     Chennai   3     Cory   15               Oil  Bangalore
4         107       Laptop  Electronics  79999.0   Bengalore   9   Jeremy   23            Laptop     Mumbai

Python – INNER JOIN with pandas (join key names are the same)

Then, I will use pandas in the python code of the main topic to implement it like SQL INNER JOIN.

In [4]: pd.merge(df_product, df_customer, on='Product_ID')

Out[4]:
   Product_ID Product_name     Category    Price Seller_City  id     name  age Purchased_Product       City
0         101        Watch      Fashion    299.0       Delhi   1   Olivia   20             Watch     Mumbai
1         103        Shoes      Fashion   2999.0     Chennai   5  Dominic   30             Shoes    Chennai
2         104   Smartphone  Electronics  14999.0     Kolkata   6    Tyler   65        Smartphone      Delhi
3         106          Oil      Grocery    110.0     Chennai   3     Cory   15               Oil  Bangalore
4         107       Laptop  Electronics  79999.0   Bengalore   9   Jeremy   23            Laptop     Mumbai

In this example, I used the Product_ID column as a key to perform an INNER JOIN on the product and customer data frames. However, I think that there are many cases where the keys are not in the same column. In that case, try implementing it with pandas like the LEFT JOIN in the following SQL.

In [5]: pd.merge(df_product, df_customer, left_on='Product_name', right_on='Purchased_Product')

Out[5]:
   Product_ID_x Product_name     Category    Price Seller_City  id     name  age  Product_ID_y Purchased_Product       City
0           101        Watch      Fashion    299.0       Delhi   1   Olivia   20           101             Watch     Mumbai
1           103        Shoes      Fashion   2999.0     Chennai   5  Dominic   30           103             Shoes    Chennai
2           104   Smartphone  Electronics  14999.0     Kolkata   6    Tyler   65           104        Smartphone      Delhi
3           106          Oil      Grocery    110.0     Chennai   3     Cory   15           106               Oil  Bangalore
4           107       Laptop  Electronics  79999.0   Bengalore   9   Jeremy   23           107            Laptop     Mumbai

Python – INNER JOIN with pandas (Multiple Join key names)

Next, I will implement the case where there are multiple keys.

In [6]: pd.merge(df_product, df_customer, how='inner', left_on=['Product_ID','Seller_City'], right_on=['Product_ID','City'])

Out[6]:
   Product_ID Product_name Category   Price Seller_City  id     name  age Purchased_Product     City
0         103        Shoes  Fashion  2999.0     Chennai   5  Dominic   30             Shoes  Chennai

In this example, I used the “Product_ID” column and “Seller_City” as a composite key for the product and customer data frames, and performed an inner join (INNER JOIN). The SQL should look like the image below.

# SQL
SELECT
    *
FROM
    product AS p
INNER JOIN
    customer AS c
ON
    p.Product_ID = c.Product_ID AND
    p.Seller_City = c.City
;

# Out:
   Product_ID Product_name Category   Price Seller_City  id     name  age Purchased_Product     City
0         103        Shoes  Fashion  2999.0     Chennai   5  Dominic   30             Shoes  Chennai

3. Implement like SQL LEFT JOIN in pandas

Now let’s see how to implement it like SQL’s LEFT JOIN using pandas.

LEFT JOIN in SQL

First of all, when implemented in SQL, it should be described as follows. As with the previous INNER JOIN, we will assume that product and customer exist as tables.

# SQL
SELECT
    *
FROM
    product AS p
LEFT JOIN
    customer AS c
ON
    p.Product_ID = c.Product_ID
;

# Out:
   Product_ID Product_name     Category    Price Seller_City   id     name   age Purchased_Product       City
0         101        Watch      Fashion    299.0       Delhi  1.0   Olivia  20.0             Watch     Mumbai
1         102          Bag      Fashion   1350.5      Mumbai  NaN      NaN   NaN               NaN        NaN
2         103        Shoes      Fashion   2999.0     Chennai  5.0  Dominic  30.0             Shoes    Chennai
3         104   Smartphone  Electronics  14999.0     Kolkata  6.0    Tyler  65.0        Smartphone      Delhi
4         105        Books        Study    145.0       Delhi  NaN      NaN   NaN               NaN        NaN
5         106          Oil      Grocery    110.0     Chennai  3.0     Cory  15.0               Oil  Bangalore
6         107       Laptop  Electronics  79999.0   Bengalore  9.0   Jeremy  23.0            Laptop     Mumbai

Python – LEFT JOIN with pandas (join key name is the same)

Now let’s move on to the python code. Using pandas, I will use the same data as before, the Product, Customer Data frame.

In [7]: pd.merge(df_product, df_customer, how='left', on='Product_ID')

Out[7]:
   Product_ID Product_name     Category    Price Seller_City   id     name   age Purchased_Product       City
0         101        Watch      Fashion    299.0       Delhi  1.0   Olivia  20.0             Watch     Mumbai
1         102          Bag      Fashion   1350.5      Mumbai  NaN      NaN   NaN               NaN        NaN
2         103        Shoes      Fashion   2999.0     Chennai  5.0  Dominic  30.0             Shoes    Chennai
3         104   Smartphone  Electronics  14999.0     Kolkata  6.0    Tyler  65.0        Smartphone      Delhi
4         105        Books        Study    145.0       Delhi  NaN      NaN   NaN               NaN        NaN
5         106          Oil      Grocery    110.0     Chennai  3.0     Cory  15.0               Oil  Bangalore
6         107       Laptop  Electronics  79999.0   Bengalore  9.0   Jeremy  23.0            Laptop     Mumbai

Python – LEFT JOIN with pandas (Multiple Join key names)

If the key column name is different, it will be as follows. We will use the same dataset as before and use the pandas merge function to implement it.

In [8]: pd.merge(df_product, df_customer, how='left', left_on='Product_name', right_on='Purchased_Product')

Out[8]:
   Product_ID_x Product_name     Category    Price Seller_City   id     name   age  Product_ID_y Purchased_Product       City
0           101        Watch      Fashion    299.0       Delhi  1.0   Olivia  20.0         101.0             Watch     Mumbai
1           102          Bag      Fashion   1350.5      Mumbai  NaN      NaN   NaN           NaN               NaN        NaN
2           103        Shoes      Fashion   2999.0     Chennai  5.0  Dominic  30.0         103.0             Shoes    Chennai
3           104   Smartphone  Electronics  14999.0     Kolkata  6.0    Tyler  65.0         104.0        Smartphone      Delhi
4           105        Books        Study    145.0       Delhi  NaN      NaN   NaN           NaN               NaN        NaN
5           106          Oil      Grocery    110.0     Chennai  3.0     Cory  15.0         106.0               Oil  Bangalore
6           107       Laptop  Electronics  79999.0   Bengalore  9.0   Jeremy  23.0         107.0            Laptop     Mumbai

If there are multiple keys, list them as follows.

In [9]: pd.merge(df_product, df_customer, how='left', left_on=['Product_ID','Seller_City'], right_on=['Product_ID','City'])
Out[9]:
   Product_ID Product_name     Category    Price Seller_City   id     name   age Purchased_Product     City
0         101        Watch      Fashion    299.0       Delhi  NaN      NaN   NaN               NaN      NaN
1         102          Bag      Fashion   1350.5      Mumbai  NaN      NaN   NaN               NaN      NaN
2         103        Shoes      Fashion   2999.0     Chennai  5.0  Dominic  30.0             Shoes  Chennai
3         104   Smartphone  Electronics  14999.0     Kolkata  NaN      NaN   NaN               NaN      NaN
4         105        Books        Study    145.0       Delhi  NaN      NaN   NaN               NaN      NaN
5         106          Oil      Grocery    110.0     Chennai  NaN      NaN   NaN               NaN      NaN
6         107       Laptop  Electronics  79999.0   Bengalore  NaN      NaN   NaN               NaN      NaN

Even when there is only one key, I use the list format, and I think that the Join format (how) also explicitly uses this description method.

4. SQL RIGHT JOIN in pandas

Actually, I almost never use RIGHT JOIN because of my preference, but I will use pandas to implement it like SQL RIGHT JOIN.

Python – RIGHT JOIN with pandas (join key name is the same)

In [10]: pd.merge(df_product, df_customer, how='right', on='Product_ID')

Out[10]:
   Product_ID Product_name     Category    Price Seller_City  id     name  age Purchased_Product       City
0         101        Watch      Fashion    299.0       Delhi   1   Olivia   20             Watch     Mumbai
1           0          NaN          NaN      NaN         NaN   2   Aditya   25                NA      Delhi
2         106          Oil      Grocery    110.0     Chennai   3     Cory   15               Oil  Bangalore
3           0          NaN          NaN      NaN         NaN   4  Isabell   10                NA    Chennai
4         103        Shoes      Fashion   2999.0     Chennai   5  Dominic   30             Shoes    Chennai
5         104   Smartphone  Electronics  14999.0     Kolkata   6    Tyler   65        Smartphone      Delhi
6           0          NaN          NaN      NaN         NaN   7   Samuel   35                NA    Kolkata
7           0          NaN          NaN      NaN         NaN   8   Daniel   18                NA      Delhi

It is almost similar to the LEFT JOIN like implementation of SQL in pandas.

In [11]: pd.merge(df_product, df_customer, how='right', left_on='Product_name', right_on='Purchased_Product')

Out[11]:
   Product_ID_x Product_name     Category    Price Seller_City  id     name  age  Product_ID_y Purchased_Product       City
0         101.0        Watch      Fashion    299.0       Delhi   1   Olivia   20           101             Watch     Mumbai
1           NaN          NaN          NaN      NaN         NaN   2   Aditya   25             0                NA      Delhi
2         106.0          Oil      Grocery    110.0     Chennai   3     Cory   15           106               Oil  Bangalore
3           NaN          NaN          NaN      NaN         NaN   4  Isabell   10             0                NA    Chennai
4         103.0        Shoes      Fashion   2999.0     Chennai   5  Dominic   30           103             Shoes    Chennai
5         104.0   Smartphone  Electronics  14999.0     Kolkata   6    Tyler   65           104        Smartphone      Delhi
6           NaN          NaN          NaN      NaN         NaN   7   Samuel   35             0                NA    Kolkata
7           NaN          NaN          NaN      NaN         NaN   8   Daniel   18             0                NA      Delhi
8         107.0       Laptop  Electronics  79999.0   Bengalore   9   Jeremy   23           107            Laptop     Mumbai

Python – LEFT JOIN with pandas (Multiple Join key names)

This is also almost the same as LEFT JOIN implementation of SQL in pandas, only how is changed.

In [12]: pd.merge(df_product, df_customer, how='right', left_on=['Product_ID','Seller_City'], right_on=['Product_ID','City'])

Out[12]:
   Product_ID Product_name Category   Price Seller_City  id     name  age Purchased_Product       City
0         101          NaN      NaN     NaN         NaN   1   Olivia   20             Watch     Mumbai
1           0          NaN      NaN     NaN         NaN   2   Aditya   25                NA      Delhi
2         106          NaN      NaN     NaN         NaN   3     Cory   15               Oil  Bangalore
3           0          NaN      NaN     NaN         NaN   4  Isabell   10                NA    Chennai
4         103        Shoes  Fashion  2999.0     Chennai   5  Dominic   30             Shoes    Chennai
5         104          NaN      NaN     NaN         NaN   6    Tyler   65        Smartphone      Delhi
6           0          NaN      NaN     NaN         NaN   7   Samuel   35                NA    Kolkata
7           0          NaN      NaN     NaN         NaN   8   Daniel   18                NA      Delhi
8         107          NaN      NaN     NaN         NaN   9   Jeremy   23            Laptop     Mumbai

5. Summary

So this time, I recorded the INNER/LEFT/RIGHT JOIN like SQL with pandas in the python code.

(Visited 9 times, 1 visits today)