Python – List string with pandas groupby

Python – List string with pandas groupby

If you are doing business, you may have to acquire data directly from the data lake. Therefore, I think there are many opportunities to process large amounts of data with pandas for preprocessing.

So, this time, although it is used less frequently, I would like to record a recent python code that used pandas’ group by syntax to list strings.

Please refer to the following article for basic groupby in pandas.

1. Preprocessing

Let’s create a dataset.

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'a':['A','A','B','B','B','C'], 'b':[1,2,5,5,4,6],'c':[3,3,3,4,4,4]})
   ...: df
Out[2]:
   a  b  c
0  A  1  3
1  A  2  3
2  B  5  3
3  B  5  4
4  B  4  4
5  C  6  4

2. List single column when groupby with pandas

Then, using pandas, list column “a” as key and column “b” as group by.

In [3]: df.groupby('a')['b'].apply(list)
Out[3]:
a
A       [1, 2]
B    [5, 5, 4]
C          [6]
Name: b, dtype: object

In [4]: df.groupby('a')['b'].apply(list).reset_index()
Out[4]:
   a          b
0  A     [1, 2]
1  B  [5, 5, 4]
2  C        [6]

I was able to get the result easily by just using apply and list for the normal groupby syntax. As another method, I will also note how to use the lambda function.

In [3]: df.groupby('a')['b'].agg(lambda x: list(x))
Out[3]:
a
A       [1, 2]
B    [5, 5, 4]
C          [6]
Name: b, dtype: object

In [4]: df.groupby('a')['b'].agg(lambda x: list(x)).reset_index()
Out[4]:
   a          b
0  A     [1, 2]
1  B  [5, 5, 4]
2  C        [6]

I was able to get the result easily by using the agg function and lambda. I use this method a lot.

Next, I had an opportunity to use the string concatenation pattern instead of the list format, so I will make a note of it.

Cast col: b to string and try string concatenation with commas.

In [5]: df['b'] = df['b'].astype(str)

In [6]: df.groupby('a')['b'].apply(', '.join)
Out[6]:
   a        b
0  A     1, 2
1  B  5, 5, 4
2  C        6

3. List multiple columns with pandas when groupby

Now, let’s list multiple columns in one record.

In [7]: df.groupby('a').agg(lambda x: list(x))
Out[7]:
           b          c
a
A     [1, 2]     [3, 3]
B  [5, 5, 4]  [3, 4, 4]
C        [6]        [4]

In [8]: df.groupby('a').agg(lambda x: list(x)).reset_index()
Out[8]:
   a          b          c
0  A     [1, 2]     [3, 3]
1  B  [5, 5, 4]  [3, 4, 4]
2  C        [6]        [4]

Finally, let’s also do string concatenation with commas. Cast the column “c” to String type in advance, and then convert the list to comma-separated by groupby.

In [9]: df['c'] = df['c'].astype(str)

In [10]: df.groupby('a').agg(lambda x: ', '.join(sorted(list(x)))).reset_index()
Out[10]:
   a        b        c
0  A     1, 2     3, 3
1  B  4, 5, 5  3, 4, 4
2  C        6        4

4. Summary

So, this time, I made a note of when I made a list of strings with the pandas group by syntax in the python code. I think it was very easy to list strings with pandas group by syntax in python.

(Visited 28 times, 1 visits today)