Python – INNER/LEFT/RIGHT JOIN like SQL in pandas
When in the business, I often come across cases where data must be obtained directly from the data lake and preprocessed. Therefore, I think there are many opportunities to process large amounts of data with pandas.
So, this time, I would like to record an INNER/LEFT/RIGHT JOIN like SQL with python-pandas, which I personally use a little more frequently.
1. Preprocessing
Create a data frame for product data (product) and customer data (customer) used in JOIN. First, import pandas and then create product data.
In [1]: import pandas as pd
In [2]: df_product = pd.DataFrame({
...: 'Product_ID':[101,102,103,104,105,106,107],
...: 'Product_name':['Watch','Bag','Shoes','Smartphone','Books','Oil','Laptop'],
...: 'Category':['Fashion','Fashion','Fashion','Electronics','Study','Grocery','Electronics'],
...: 'Price':[299.0,1350.50,2999.0,14999.0,145.0,110.0,79999.0],
...: 'Seller_City':['Delhi','Mumbai','Chennai','Kolkata','Delhi','Chennai','Bengalore']
...: })
...: df_product
Out[2]:
Product_ID Product_name Category Price Seller_City
0 101 Watch Fashion 299.0 Delhi
1 102 Bag Fashion 1350.5 Mumbai
2 103 Shoes Fashion 2999.0 Chennai
3 104 Smartphone Electronics 14999.0 Kolkata
4 105 Books Study 145.0 Delhi
5 106 Oil Grocery 110.0 Chennai
6 107 Laptop Electronics 79999.0 Bengalore
Create customer data in the same way.
In [3]: df_customer = pd.DataFrame({
...: 'id':[1,2,3,4,5,6,7,8,9],
...: 'name':['Olivia','Aditya','Cory','Isabell','Dominic','Tyler','Samuel','Daniel','Jeremy'],
...: 'age':[20,25,15,10,30,65,35,18,23],
...: 'Product_ID':[101,0,106,0,103,104,0,0,107],
...: 'Purchased_Product':['Watch','NA','Oil','NA','Shoes','Smartphone','NA','NA','Laptop'],
...: 'City':['Mumbai','Delhi','Bangalore','Chennai','Chennai','Delhi','Kolkata','Delhi','Mumbai']
...: })
...: df_customer
Out[3]:
id name age Product_ID Purchased_Product City
0 1 Olivia 20 101 Watch Mumbai
1 2 Aditya 25 0 NA Delhi
2 3 Cory 15 106 Oil Bangalore
3 4 Isabell 10 0 NA Chennai
4 5 Dominic 30 103 Shoes Chennai
5 6 Tyler 65 104 Smartphone Delhi
6 7 Samuel 35 0 NA Kolkata
7 8 Daniel 18 0 NA Delhi
8 9 Jeremy 23 107 Laptop Mumbai
2. Implement like SQL INNER JOIN in pandas
First, let’s do an INNER JOIN.
INNER JOIN in SQL
If each data frame exists in a table called product and customer, the SQL should look like this.
# SQL
SELECT
*
FROM
product AS p
INNER JOIN
customer AS c
ON
p.Product_ID = c.Product_ID
;
# Out:
Product_ID Product_name Category Price Seller_City id name age Purchased_Product City
0 101 Watch Fashion 299.0 Delhi 1 Olivia 20 Watch Mumbai
1 103 Shoes Fashion 2999.0 Chennai 5 Dominic 30 Shoes Chennai
2 104 Smartphone Electronics 14999.0 Kolkata 6 Tyler 65 Smartphone Delhi
3 106 Oil Grocery 110.0 Chennai 3 Cory 15 Oil Bangalore
4 107 Laptop Electronics 79999.0 Bengalore 9 Jeremy 23 Laptop Mumbai
Python – INNER JOIN with pandas (join key names are the same)
Then, I will use pandas in the python code of the main topic to implement it like SQL INNER JOIN.
In [4]: pd.merge(df_product, df_customer, on='Product_ID')
Out[4]:
Product_ID Product_name Category Price Seller_City id name age Purchased_Product City
0 101 Watch Fashion 299.0 Delhi 1 Olivia 20 Watch Mumbai
1 103 Shoes Fashion 2999.0 Chennai 5 Dominic 30 Shoes Chennai
2 104 Smartphone Electronics 14999.0 Kolkata 6 Tyler 65 Smartphone Delhi
3 106 Oil Grocery 110.0 Chennai 3 Cory 15 Oil Bangalore
4 107 Laptop Electronics 79999.0 Bengalore 9 Jeremy 23 Laptop Mumbai
In this example, I used the Product_ID column as a key to perform an INNER JOIN on the product and customer data frames. However, I think that there are many cases where the keys are not in the same column. In that case, try implementing it with pandas like the LEFT JOIN in the following SQL.
In [5]: pd.merge(df_product, df_customer, left_on='Product_name', right_on='Purchased_Product')
Out[5]:
Product_ID_x Product_name Category Price Seller_City id name age Product_ID_y Purchased_Product City
0 101 Watch Fashion 299.0 Delhi 1 Olivia 20 101 Watch Mumbai
1 103 Shoes Fashion 2999.0 Chennai 5 Dominic 30 103 Shoes Chennai
2 104 Smartphone Electronics 14999.0 Kolkata 6 Tyler 65 104 Smartphone Delhi
3 106 Oil Grocery 110.0 Chennai 3 Cory 15 106 Oil Bangalore
4 107 Laptop Electronics 79999.0 Bengalore 9 Jeremy 23 107 Laptop Mumbai
Python – INNER JOIN with pandas (Multiple Join key names)
Next, I will implement the case where there are multiple keys.
In [6]: pd.merge(df_product, df_customer, how='inner', left_on=['Product_ID','Seller_City'], right_on=['Product_ID','City'])
Out[6]:
Product_ID Product_name Category Price Seller_City id name age Purchased_Product City
0 103 Shoes Fashion 2999.0 Chennai 5 Dominic 30 Shoes Chennai
In this example, I used the “Product_ID” column and “Seller_City” as a composite key for the product and customer data frames, and performed an inner join (INNER JOIN). The SQL should look like the image below.
# SQL
SELECT
*
FROM
product AS p
INNER JOIN
customer AS c
ON
p.Product_ID = c.Product_ID AND
p.Seller_City = c.City
;
# Out:
Product_ID Product_name Category Price Seller_City id name age Purchased_Product City
0 103 Shoes Fashion 2999.0 Chennai 5 Dominic 30 Shoes Chennai
3. Implement like SQL LEFT JOIN in pandas
Now let’s see how to implement it like SQL’s LEFT JOIN using pandas.
LEFT JOIN in SQL
First of all, when implemented in SQL, it should be described as follows. As with the previous INNER JOIN, we will assume that product and customer exist as tables.
# SQL
SELECT
*
FROM
product AS p
LEFT JOIN
customer AS c
ON
p.Product_ID = c.Product_ID
;
# Out:
Product_ID Product_name Category Price Seller_City id name age Purchased_Product City
0 101 Watch Fashion 299.0 Delhi 1.0 Olivia 20.0 Watch Mumbai
1 102 Bag Fashion 1350.5 Mumbai NaN NaN NaN NaN NaN
2 103 Shoes Fashion 2999.0 Chennai 5.0 Dominic 30.0 Shoes Chennai
3 104 Smartphone Electronics 14999.0 Kolkata 6.0 Tyler 65.0 Smartphone Delhi
4 105 Books Study 145.0 Delhi NaN NaN NaN NaN NaN
5 106 Oil Grocery 110.0 Chennai 3.0 Cory 15.0 Oil Bangalore
6 107 Laptop Electronics 79999.0 Bengalore 9.0 Jeremy 23.0 Laptop Mumbai
Python – LEFT JOIN with pandas (join key name is the same)
Now let’s move on to the python code. Using pandas, I will use the same data as before, the Product, Customer Data frame.
In [7]: pd.merge(df_product, df_customer, how='left', on='Product_ID')
Out[7]:
Product_ID Product_name Category Price Seller_City id name age Purchased_Product City
0 101 Watch Fashion 299.0 Delhi 1.0 Olivia 20.0 Watch Mumbai
1 102 Bag Fashion 1350.5 Mumbai NaN NaN NaN NaN NaN
2 103 Shoes Fashion 2999.0 Chennai 5.0 Dominic 30.0 Shoes Chennai
3 104 Smartphone Electronics 14999.0 Kolkata 6.0 Tyler 65.0 Smartphone Delhi
4 105 Books Study 145.0 Delhi NaN NaN NaN NaN NaN
5 106 Oil Grocery 110.0 Chennai 3.0 Cory 15.0 Oil Bangalore
6 107 Laptop Electronics 79999.0 Bengalore 9.0 Jeremy 23.0 Laptop Mumbai
Python – LEFT JOIN with pandas (Multiple Join key names)
If the key column name is different, it will be as follows. We will use the same dataset as before and use the pandas merge function to implement it.
In [8]: pd.merge(df_product, df_customer, how='left', left_on='Product_name', right_on='Purchased_Product')
Out[8]:
Product_ID_x Product_name Category Price Seller_City id name age Product_ID_y Purchased_Product City
0 101 Watch Fashion 299.0 Delhi 1.0 Olivia 20.0 101.0 Watch Mumbai
1 102 Bag Fashion 1350.5 Mumbai NaN NaN NaN NaN NaN NaN
2 103 Shoes Fashion 2999.0 Chennai 5.0 Dominic 30.0 103.0 Shoes Chennai
3 104 Smartphone Electronics 14999.0 Kolkata 6.0 Tyler 65.0 104.0 Smartphone Delhi
4 105 Books Study 145.0 Delhi NaN NaN NaN NaN NaN NaN
5 106 Oil Grocery 110.0 Chennai 3.0 Cory 15.0 106.0 Oil Bangalore
6 107 Laptop Electronics 79999.0 Bengalore 9.0 Jeremy 23.0 107.0 Laptop Mumbai
If there are multiple keys, list them as follows.
In [9]: pd.merge(df_product, df_customer, how='left', left_on=['Product_ID','Seller_City'], right_on=['Product_ID','City'])
Out[9]:
Product_ID Product_name Category Price Seller_City id name age Purchased_Product City
0 101 Watch Fashion 299.0 Delhi NaN NaN NaN NaN NaN
1 102 Bag Fashion 1350.5 Mumbai NaN NaN NaN NaN NaN
2 103 Shoes Fashion 2999.0 Chennai 5.0 Dominic 30.0 Shoes Chennai
3 104 Smartphone Electronics 14999.0 Kolkata NaN NaN NaN NaN NaN
4 105 Books Study 145.0 Delhi NaN NaN NaN NaN NaN
5 106 Oil Grocery 110.0 Chennai NaN NaN NaN NaN NaN
6 107 Laptop Electronics 79999.0 Bengalore NaN NaN NaN NaN NaN
Even when there is only one key, I use the list format, and I think that the Join format (how) also explicitly uses this description method.
4. SQL RIGHT JOIN in pandas
Actually, I almost never use RIGHT JOIN because of my preference, but I will use pandas to implement it like SQL RIGHT JOIN.
Python – RIGHT JOIN with pandas (join key name is the same)
In [10]: pd.merge(df_product, df_customer, how='right', on='Product_ID')
Out[10]:
Product_ID Product_name Category Price Seller_City id name age Purchased_Product City
0 101 Watch Fashion 299.0 Delhi 1 Olivia 20 Watch Mumbai
1 0 NaN NaN NaN NaN 2 Aditya 25 NA Delhi
2 106 Oil Grocery 110.0 Chennai 3 Cory 15 Oil Bangalore
3 0 NaN NaN NaN NaN 4 Isabell 10 NA Chennai
4 103 Shoes Fashion 2999.0 Chennai 5 Dominic 30 Shoes Chennai
5 104 Smartphone Electronics 14999.0 Kolkata 6 Tyler 65 Smartphone Delhi
6 0 NaN NaN NaN NaN 7 Samuel 35 NA Kolkata
7 0 NaN NaN NaN NaN 8 Daniel 18 NA Delhi
It is almost similar to the LEFT JOIN like implementation of SQL in pandas.
In [11]: pd.merge(df_product, df_customer, how='right', left_on='Product_name', right_on='Purchased_Product')
Out[11]:
Product_ID_x Product_name Category Price Seller_City id name age Product_ID_y Purchased_Product City
0 101.0 Watch Fashion 299.0 Delhi 1 Olivia 20 101 Watch Mumbai
1 NaN NaN NaN NaN NaN 2 Aditya 25 0 NA Delhi
2 106.0 Oil Grocery 110.0 Chennai 3 Cory 15 106 Oil Bangalore
3 NaN NaN NaN NaN NaN 4 Isabell 10 0 NA Chennai
4 103.0 Shoes Fashion 2999.0 Chennai 5 Dominic 30 103 Shoes Chennai
5 104.0 Smartphone Electronics 14999.0 Kolkata 6 Tyler 65 104 Smartphone Delhi
6 NaN NaN NaN NaN NaN 7 Samuel 35 0 NA Kolkata
7 NaN NaN NaN NaN NaN 8 Daniel 18 0 NA Delhi
8 107.0 Laptop Electronics 79999.0 Bengalore 9 Jeremy 23 107 Laptop Mumbai
Python – LEFT JOIN with pandas (Multiple Join key names)
This is also almost the same as LEFT JOIN implementation of SQL in pandas, only how is changed.
In [12]: pd.merge(df_product, df_customer, how='right', left_on=['Product_ID','Seller_City'], right_on=['Product_ID','City'])
Out[12]:
Product_ID Product_name Category Price Seller_City id name age Purchased_Product City
0 101 NaN NaN NaN NaN 1 Olivia 20 Watch Mumbai
1 0 NaN NaN NaN NaN 2 Aditya 25 NA Delhi
2 106 NaN NaN NaN NaN 3 Cory 15 Oil Bangalore
3 0 NaN NaN NaN NaN 4 Isabell 10 NA Chennai
4 103 Shoes Fashion 2999.0 Chennai 5 Dominic 30 Shoes Chennai
5 104 NaN NaN NaN NaN 6 Tyler 65 Smartphone Delhi
6 0 NaN NaN NaN NaN 7 Samuel 35 NA Kolkata
7 0 NaN NaN NaN NaN 8 Daniel 18 NA Delhi
8 107 NaN NaN NaN NaN 9 Jeremy 23 Laptop Mumbai
5. Summary
So this time, I recorded the INNER/LEFT/RIGHT JOIN like SQL with pandas in the python code.