The pivot table is a powerful tool to summarize and present data in Excel. However, in Python, Pandas has a function which allows you to quickly convert a DataFrame to a pivot table.
This function is very useful but sometimes it can be tricky to remember how to use it to get the data formatted in a way you need.
Read in the data
import pandas as pd
import numpy as np
Read in the sales funnel data into DataFrame
.
df = pd.read_excel("../sales-funnel.xlsx")
df.head()
Account | Name | Rep | Manager | Product | Quantity | Price | Status | |
---|---|---|---|---|---|---|---|---|
0 | 714466 | Trantow-Barrows | Craig Booker | Debra Henley | CPU | 1 | 30000 | presented |
1 | 714466 | Trantow-Barrows | Craig Booker | Debra Henley | Software | 1 | 10000 | presented |
2 | 714466 | Trantow-Barrows | Craig Booker | Debra Henley | Maintenance | 2 | 5000 | pending |
3 | 737550 | Fritsch, Russel and Anderson | Craig Booker | Debra Henley | CPU | 1 | 35000 | declined |
4 | 146832 | Kiehn-Spinka | Daniel Hilton | Debra Henley | CPU | 2 | 65000 | won |
For convenience sake, I define the status column as a category
and set the order we’d like to view. This isn’t strictly required but helps us keep the order we want as we work through analyzing the data.
df["Status"] = df["Status"].astype("category")
df["Status"].cat.set_categories(["won","pending","presented","declined"],inplace=True)
Pivot the data
As we build up the pivot table, I think it’s easiest to take it one step at a time. Add items one at a time and check each step to verify you are getting the results you expect.
The simplest pivot table must have a dataframe and an index.
pd.pivot_table(df,index=["Name"])
Account | Price | Quantity | |
---|---|---|---|
Name | |||
Barton LLC | 740150.0 | 35000.0 | 1.000000 |
Fritsch, Russel and Anderson | 737550.0 | 35000.0 | 1.000000 |
Herman LLC | 141962.0 | 65000.0 | 2.000000 |
Jerde-Hilpert | 412290.0 | 5000.0 | 2.000000 |
Kassulke, Ondricka and Metz | 307599.0 | 7000.0 | 3.000000 |
Keeling LLC | 688981.0 | 100000.0 | 5.000000 |
Kiehn-Spinka | 146832.0 | 65000.0 | 2.000000 |
Koepp Ltd | 729833.0 | 35000.0 | 2.000000 |
Kulas Inc | 218895.0 | 25000.0 | 1.500000 |
Purdy-Kunde | 163416.0 | 30000.0 | 1.000000 |
Stokes LLC | 239344.0 | 7500.0 | 1.000000 |
Trantow-Barrows | 714466.0 | 15000.0 | 1.333333 |
You can have multiple indexes as well. In fact, most of the pivot_table args can take multiple values via a list.
pd.pivot_table(df,index=["Name","Rep","Manager"])
Account | Price | Quantity | |||
---|---|---|---|---|---|
Name | Rep | Manager | |||
Barton LLC | John Smith | Debra Henley | 740150.0 | 35000.0 | 1.000000 |
Fritsch, Russel and Anderson | Craig Booker | Debra Henley | 737550.0 | 35000.0 | 1.000000 |
Herman LLC | Cedric Moss | Fred Anderson | 141962.0 | 65000.0 | 2.000000 |
Jerde-Hilpert | John Smith | Debra Henley | 412290.0 | 5000.0 | 2.000000 |
Kassulke, Ondricka and Metz | Wendy Yule | Fred Anderson | 307599.0 | 7000.0 | 3.000000 |
Keeling LLC | Wendy Yule | Fred Anderson | 688981.0 | 100000.0 | 5.000000 |
Kiehn-Spinka | Daniel Hilton | Debra Henley | 146832.0 | 65000.0 | 2.000000 |
Koepp Ltd | Wendy Yule | Fred Anderson | 729833.0 | 35000.0 | 2.000000 |
Kulas Inc | Daniel Hilton | Debra Henley | 218895.0 | 25000.0 | 1.500000 |
Purdy-Kunde | Cedric Moss | Fred Anderson | 163416.0 | 30000.0 | 1.000000 |
Stokes LLC | Cedric Moss | Fred Anderson | 239344.0 | 7500.0 | 1.000000 |
Trantow-Barrows | Craig Booker | Debra Henley | 714466.0 | 15000.0 | 1.333333 |
This is interesting but not particularly useful. What we probably want to do is look at this by Manager and Director. It’s easy enough to do by changing the index.
pd.pivot_table(df,index=["Manager","Rep"])
Account | Price | Quantity | ||
---|---|---|---|---|
Manager | Rep | |||
Debra Henley | Craig Booker | 720237.0 | 20000.000000 | 1.250000 |
Daniel Hilton | 194874.0 | 38333.333333 | 1.666667 | |
John Smith | 576220.0 | 20000.000000 | 1.500000 | |
Fred Anderson | Cedric Moss | 196016.5 | 27500.000000 | 1.250000 |
Wendy Yule | 614061.5 | 44250.000000 | 3.000000 |
Now we start to get a glimpse of what a pivot table can do for us.
For this purpose, the Account and Quantity
columns aren’t really useful. Let’s remove it by explicitly defining the columns we care about using the values field.
pd.pivot_table(df,index=["Manager","Rep"],values=["Price"])
Price | ||
---|---|---|
Manager | Rep | |
Debra Henley | Craig Booker | 20000 |
Daniel Hilton | 38333 | |
John Smith | 20000 | |
Fred Anderson | Cedric Moss | 27500 |
Wendy Yule | 44250 |
The Price
column automatically averages the data but we can do a count or a sum. Adding them is simple using aggfunc
.
pd.pivot_table(df,index=["Manager","Rep"],values=["Price"],aggfunc=np.sum)
Price | ||
---|---|---|
Manager | Rep | |
Debra Henley | Craig Booker | 80000 |
Daniel Hilton | 115000 | |
John Smith | 40000 | |
Fred Anderson | Cedric Moss | 110000 |
Wendy Yule | 177000 |
aggfunc
can take a list of functions. Let’s try a mean
using the numpy functions and len
to get a count.
pd.pivot_table(df,index=["Manager","Rep"],values=["Price"],aggfunc=[np.mean,len])
mean | len | ||
---|---|---|---|
Price | Price | ||
Manager | Rep | ||
Debra Henley | Craig Booker | 20000 | 4 |
Daniel Hilton | 38333 | 3 | |
John Smith | 20000 | 2 | |
Fred Anderson | Cedric Moss | 27500 | 4 |
Wendy Yule | 44250 | 4 |
If we want to see sales broken down by the Products
, the columns variable
allows us to define one or more columns.
pd.pivot_table(df,index=["Manager","Rep"],values=["Price"],
columns=["Product"],aggfunc=[np.sum])
sum | |||||
---|---|---|---|---|---|
Price | |||||
Product | CPU | Maintenance | Monitor | Software | |
Manager | Rep | ||||
Debra Henley | Craig Booker | 65000.0 | 5000.0 | NaN | 10000.0 |
Daniel Hilton | 105000.0 | NaN | NaN | 10000.0 | |
John Smith | 35000.0 | 5000.0 | NaN | NaN | |
Fred Anderson | Cedric Moss | 95000.0 | 5000.0 | NaN | 10000.0 |
Wendy Yule | 165000.0 | 7000.0 | 5000.0 | NaN |
The NaN’s are a bit distracting. If we want to remove them, we could use fill_value
to set them to 0
.
pd.pivot_table(df,index=["Manager","Rep"],values=["Price"],
columns=["Product"],aggfunc=[np.sum],fill_value=0)
sum | |||||
---|---|---|---|---|---|
Price | |||||
Product | CPU | Maintenance | Monitor | Software | |
Manager | Rep | ||||
Debra Henley | Craig Booker | 65000 | 5000 | 0 | 10000 |
Daniel Hilton | 105000 | 0 | 0 | 10000 | |
John Smith | 35000 | 5000 | 0 | 0 | |
Fred Anderson | Cedric Moss | 95000 | 5000 | 0 | 10000 |
Wendy Yule | 165000 | 7000 | 5000 | 0 |
It would be useful to add the quantity as well. Add Quantity
to the values list.
pd.pivot_table(df,index=["Manager","Rep"],values=["Price","Quantity"],
columns=["Product"],aggfunc=[np.sum],fill_value=0)
sum | |||||||||
---|---|---|---|---|---|---|---|---|---|
Price | Quantity | ||||||||
Product | CPU | Maintenance | Monitor | Software | CPU | Maintenance | Monitor | Software | |
Manager | Rep | ||||||||
Debra Henley | Craig Booker | 65000 | 5000 | 0 | 10000 | 2 | 2 | 0 | 1 |
Daniel Hilton | 105000 | 0 | 0 | 10000 | 4 | 0 | 0 | 1 | |
John Smith | 35000 | 5000 | 0 | 0 | 1 | 2 | 0 | 0 | |
Fred Anderson | Cedric Moss | 95000 | 5000 | 0 | 10000 | 3 | 1 | 0 | 1 |
Wendy Yule | 165000 | 7000 | 5000 | 0 | 7 | 3 | 2 | 0 |
What’s interesting is that you can move items to the index to get a different visual representation. We can add the Products
to the index.
pd.pivot_table(df,index=["Manager","Rep","Product"],
values=["Price","Quantity"],aggfunc=[np.sum],fill_value=0)
sum | ||||
---|---|---|---|---|
Price | Quantity | |||
Manager | Rep | Product | ||
Debra Henley | Craig Booker | CPU | 65000 | 2 |
Maintenance | 5000 | 2 | ||
Software | 10000 | 1 | ||
Daniel Hilton | CPU | 105000 | 4 | |
Software | 10000 | 1 | ||
John Smith | CPU | 35000 | 1 | |
Maintenance | 5000 | 2 | ||
Fred Anderson | Cedric Moss | CPU | 95000 | 3 |
Maintenance | 5000 | 1 | ||
Software | 10000 | 1 | ||
Wendy Yule | CPU | 165000 | 7 | |
Maintenance | 7000 | 3 | ||
Monitor | 5000 | 2 |
For this data set, this representation makes more sense. Now, what if I want to see some totals? margins=True
does that for us.
pd.pivot_table(df,index=["Manager","Rep","Product"],
values=["Price","Quantity"],
aggfunc=[np.sum,np.mean],fill_value=0,margins=True)
sum | mean | |||||
---|---|---|---|---|---|---|
Price | Quantity | Price | Quantity | |||
Manager | Rep | Product | ||||
Debra Henley | Craig Booker | CPU | 65000.0 | 2.0 | 32500.000000 | 1.000000 |
Maintenance | 5000.0 | 2.0 | 5000.000000 | 2.000000 | ||
Software | 10000.0 | 1.0 | 10000.000000 | 1.000000 | ||
Daniel Hilton | CPU | 105000.0 | 4.0 | 52500.000000 | 2.000000 | |
Software | 10000.0 | 1.0 | 10000.000000 | 1.000000 | ||
John Smith | CPU | 35000.0 | 1.0 | 35000.000000 | 1.000000 | |
Maintenance | 5000.0 | 2.0 | 5000.000000 | 2.000000 | ||
Fred Anderson | Cedric Moss | CPU | 95000.0 | 3.0 | 47500.000000 | 1.500000 |
Maintenance | 5000.0 | 1.0 | 5000.000000 | 1.000000 | ||
Software | 10000.0 | 1.0 | 10000.000000 | 1.000000 | ||
Wendy Yule | CPU | 165000.0 | 7.0 | 82500.000000 | 3.500000 | |
Maintenance | 7000.0 | 3.0 | 7000.000000 | 3.000000 | ||
Monitor | 5000.0 | 2.0 | 5000.000000 | 2.000000 | ||
All | 522000.0 | 30.0 | 30705.882353 | 1.764706 |
Let’s move the analysis up a level and look at our pipeline at the manager level. Notice how the Status
is ordered based on our earlier category definition.
pd.pivot_table(df,index=["Manager","Status"],values=["Price"],
aggfunc=[np.sum],fill_value=0,margins=True)
sum | ||
---|---|---|
Price | ||
Manager | Status | |
Debra Henley | won | 65000.0 |
pending | 50000.0 | |
presented | 50000.0 | |
declined | 70000.0 | |
Fred Anderson | won | 172000.0 |
pending | 5000.0 | |
presented | 45000.0 | |
declined | 65000.0 | |
All | 522000.0 |
A really handy feature is the ability to pass a dictionary
to the aggfunc
so you can perform different functions on each of the values you select.
pd.pivot_table(df,index=["Manager","Status"],columns=["Product"],values=["Quantity","Price"],
aggfunc={"Quantity":len,"Price":np.sum},fill_value=0)
Price | Quantity | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product | CPU | Maintenance | Monitor | Software | CPU | Maintenance | Monitor | Software | |
Manager | Status | ||||||||
Debra Henley | won | 65000 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
pending | 40000 | 10000 | 0 | 0 | 1 | 2 | 0 | 0 | |
presented | 30000 | 0 | 0 | 20000 | 1 | 0 | 0 | 2 | |
declined | 70000 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | |
Fred Anderson | won | 165000 | 7000 | 0 | 0 | 2 | 1 | 0 | 0 |
pending | 0 | 5000 | 0 | 0 | 0 | 1 | 0 | 0 | |
presented | 30000 | 0 | 5000 | 10000 | 1 | 0 | 1 | 1 | |
declined | 65000 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
You can provide a list of aggfunctions
to apply to each value too:
table = pd.pivot_table(df,index=["Manager","Status"],columns=["Product"],values=["Quantity","Price"],
aggfunc={"Quantity":len,"Price":[np.sum,np.mean]},fill_value=0)
table
Price | Quantity | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
mean | sum | len | |||||||||||
Product | CPU | Maintenance | Monitor | Software | CPU | Maintenance | Monitor | Software | CPU | Maintenance | Monitor | Software | |
Manager | Status | ||||||||||||
Debra Henley | won | 65000 | 0 | 0 | 0 | 65000 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
pending | 40000 | 5000 | 0 | 0 | 40000 | 10000 | 0 | 0 | 1 | 2 | 0 | 0 | |
presented | 30000 | 0 | 0 | 10000 | 30000 | 0 | 0 | 20000 | 1 | 0 | 0 | 2 | |
declined | 35000 | 0 | 0 | 0 | 70000 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | |
Fred Anderson | won | 82500 | 7000 | 0 | 0 | 165000 | 7000 | 0 | 0 | 2 | 1 | 0 | 0 |
pending | 0 | 5000 | 0 | 0 | 0 | 5000 | 0 | 0 | 0 | 1 | 0 | 0 | |
presented | 30000 | 0 | 5000 | 10000 | 30000 | 0 | 5000 | 10000 | 1 | 0 | 1 | 1 | |
declined | 65000 | 0 | 0 | 0 | 65000 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
It can look daunting to try to pull this all together at once but as soon as you start playing with the data and slowly add the items, you can get a clear understanding for how it works.
Pivot Table Filtering
Once you have generated your data, it is in a DataFrame
so you can filter on it using your normal DataFrame functions.
table.query('Manager == ["Debra Henley"]')
Price | Quantity | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
mean | sum | len | |||||||||||
Product | CPU | Maintenance | Monitor | Software | CPU | Maintenance | Monitor | Software | CPU | Maintenance | Monitor | Software | |
Manager | Status | ||||||||||||
Debra Henley | won | 65000 | 0 | 0 | 0 | 65000 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
pending | 40000 | 5000 | 0 | 0 | 40000 | 10000 | 0 | 0 | 1 | 2 | 0 | 0 | |
presented | 30000 | 0 | 0 | 10000 | 30000 | 0 | 0 | 20000 | 1 | 0 | 0 | 2 | |
declined | 35000 | 0 | 0 | 0 | 70000 | 0 | 0 | 0 | 2 | 0 | 0 | 0 |
table.query('Status == ["pending","won"]')
Price | Quantity | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
mean | sum | len | |||||||||||
Product | CPU | Maintenance | Monitor | Software | CPU | Maintenance | Monitor | Software | CPU | Maintenance | Monitor | Software | |
Manager | Status | ||||||||||||
Debra Henley | won | 65000 | 0 | 0 | 0 | 65000 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
pending | 40000 | 5000 | 0 | 0 | 40000 | 10000 | 0 | 0 | 1 | 2 | 0 | 0 | |
Fred Anderson | won | 82500 | 7000 | 0 | 0 | 165000 | 7000 | 0 | 0 | 2 | 1 | 0 | 0 |
pending | 0 | 5000 | 0 | 0 | 0 | 5000 | 0 | 0 | 0 | 1 | 0 | 0 |
This is a brief walk-through on how to use pivot tables on your data sets.
comments powered by Disqus