Aggregate dataframe

Home Page

In this article, we will use the aggregate function to answer questions on a dataframe. First, let’s create the dataframe we will use in this example. It will represent products sold by sales reps.

Transaction.Nb <- c(1, 2, 3, 4, 5, 6)
Sales <- c("John", "John", "Jane", "Jane", "John", "Jane")
Product <- c("Prod A", "Prod B", "Prod A", "Prod C", "Prod A", "Prod A")
Quantity <- c(10, 3, 5, 7, 10, 10)
Revenue <- c(100, 3000, 50, 3500, 120, 110)

df <- data.frame(Transaction.Nb, Sales, Product, Quantity, Revenue)
df

##   Transaction.Nb Sales Product Quantity Revenue
## 1              1  John  Prod A       10     100
## 2              2  John  Prod B        3    3000
## 3              3  Jane  Prod A        5      50
## 4              4  Jane  Prod C        7    3500
## 5              5  John  Prod A       10     120
## 6              6  Jane  Prod A       10     110

Question 1: Total Revenue per Sales

We need to sum the Revenue per Sales. We use the aggregate function with the following argument:

x: what we aggregate. In this case, the Revenue.
by: by what do we aggregate. In this case, by Sales. This argument must be a list even if there is only one item.
FUN: the function used to aggregate. In this case, we sum.

Note: The output of the aggregate function will not keep the column names so we need to rename them.

df.agg = aggregate(x = df$Revenue, by = list(df$Sales), FUN = sum)
colnames(df.agg) = c("Sales", "Total Revenue")
df.agg

##   Sales Total Revenue
## 1  Jane          3660
## 2  John          3220

Question 2: Total Revenue per Sales and per Product

We use the aggregate function the same way, except for the second argument where we group by Sales and Product.

df.agg = aggregate(x = df$Revenue, by = list(df$Sales, df$Product), FUN = sum)
colnames(df.agg) = c("Sales", "Product", "Total Revenue")
df.agg

##   Sales Product Total Revenue
## 1  Jane  Prod A           160
## 2  John  Prod A           220
## 3  John  Prod B          3000
## 4  Jane  Prod C          3500

Question 3: Total Revenue and Quantity sold by Sales and Product

We use the aggregate function the same way, except for the first argument where we aggregate the Revenue and the Quantity. In this case, x becomes a list.

df.agg = aggregate(x = list(df$Revenue, df$Quantity), by = list(df$Sales, df$Product), FUN = sum)
colnames(df.agg) = c("Sales", "Product", "Total Revenue", "Total Quantity")
df.agg

##   Sales Product Total Revenue Total Quantity
## 1  Jane  Prod A           160             15
## 2  John  Prod A           220             20
## 3  John  Prod B          3000              3
## 4  Jane  Prod C          3500              7

Question 4: Average Revenue per Rep (per Transaction)

We use the aggregate function the same way, except for the third argument where aggregate using the mean function.

df.agg = aggregate(x = df$Revenue, by = list(df$Sales), FUN = mean)
colnames(df.agg) = c("Sales", "Avg Revenue")
df.agg

##   Sales Avg Revenue
## 1  Jane    1220.000
## 2  John    1073.333

Question 5: List of Products sold per Sales

The aggregate can use a custom function. So let’s create our own function my.paste we will use to paste (comma separated) the strings of characters containing the Product.

We also need to remove duplicates as each rep sold the same product multiple times. To achieve it, we create a function that breaks a string of characters (str) and identify elements (using unlist) separated by a comma (using strsplit). The function will then return the unique (using unique) elements only. We now need to run the function on each row of the Product column with a loop (using for).

my.paste = function(str){
  output = paste(str, collapse = ", ")
}

df.agg = aggregate(x = df$Product, by = list(df$Sales), FUN = my.paste)
colnames(df.agg) = c("Sales", "Products")

my.remove.duplicate = function(str){
  d = unlist(strsplit(str, split=", "))
  paste(unique(d), collapse = ", ")
}

for (i in 1:nrow(df.agg)){
  df.agg[i, "Products"] = my.remove.duplicate(df.agg[i, "Products"])
}

df.agg