In this article, we will use the aggregate function to answer questions on a dataframe. First, let’s create the dataframe we will use in this example. It will represent products sold by sales reps.
Transaction.Nb <- c(1, 2, 3, 4, 5, 6)
Sales <- c("John", "John", "Jane", "Jane", "John", "Jane")
Product <- c("Prod A", "Prod B", "Prod A", "Prod C", "Prod A", "Prod A")
Quantity <- c(10, 3, 5, 7, 10, 10)
Revenue <- c(100, 3000, 50, 3500, 120, 110)
df <- data.frame(Transaction.Nb, Sales, Product, Quantity, Revenue)
df
## Transaction.Nb Sales Product Quantity Revenue
## 1 1 John Prod A 10 100
## 2 2 John Prod B 3 3000
## 3 3 Jane Prod A 5 50
## 4 4 Jane Prod C 7 3500
## 5 5 John Prod A 10 120
## 6 6 Jane Prod A 10 110
We need to sum the Revenue per Sales. We use the aggregate function with the following argument:
x: what we aggregate. In this case, the Revenue.by: by what do we aggregate. In this case, by Sales. This argument must be a list even if there is only one item.FUN: the function used to aggregate. In this case, we sum.Note: The output of the aggregate function will not keep the column names so we need to rename them.
df.agg = aggregate(x = df$Revenue, by = list(df$Sales), FUN = sum)
colnames(df.agg) = c("Sales", "Total Revenue")
df.agg
## Sales Total Revenue
## 1 Jane 3660
## 2 John 3220
We use the aggregate function the same way, except for the second argument where we group by Sales and Product.
df.agg = aggregate(x = df$Revenue, by = list(df$Sales, df$Product), FUN = sum)
colnames(df.agg) = c("Sales", "Product", "Total Revenue")
df.agg
## Sales Product Total Revenue
## 1 Jane Prod A 160
## 2 John Prod A 220
## 3 John Prod B 3000
## 4 Jane Prod C 3500
We use the aggregate function the same way, except for the first argument where we aggregate the Revenue and the Quantity. In this case, x becomes a list.
df.agg = aggregate(x = list(df$Revenue, df$Quantity), by = list(df$Sales, df$Product), FUN = sum)
colnames(df.agg) = c("Sales", "Product", "Total Revenue", "Total Quantity")
df.agg
## Sales Product Total Revenue Total Quantity
## 1 Jane Prod A 160 15
## 2 John Prod A 220 20
## 3 John Prod B 3000 3
## 4 Jane Prod C 3500 7
We use the aggregate function the same way, except for the third argument where aggregate using the mean function.
df.agg = aggregate(x = df$Revenue, by = list(df$Sales), FUN = mean)
colnames(df.agg) = c("Sales", "Avg Revenue")
df.agg
## Sales Avg Revenue
## 1 Jane 1220.000
## 2 John 1073.333
The aggregate can use a custom function. So let’s create our own function my.paste we will use to paste (comma separated) the strings of characters containing the Product.
We also need to remove duplicates as each rep sold the same product multiple times. To achieve it, we create a function that breaks a string of characters (str) and identify elements (using unlist) separated by a comma (using strsplit). The function will then return the unique (using unique) elements only. We now need to run the function on each row of the Product column with a loop (using for).
my.paste = function(str){
output = paste(str, collapse = ", ")
}
df.agg = aggregate(x = df$Product, by = list(df$Sales), FUN = my.paste)
colnames(df.agg) = c("Sales", "Products")
my.remove.duplicate = function(str){
d = unlist(strsplit(str, split=", "))
paste(unique(d), collapse = ", ")
}
for (i in 1:nrow(df.agg)){
df.agg[i, "Products"] = my.remove.duplicate(df.agg[i, "Products"])
}
df.agg
## Sales Products
## 1 Jane Prod A, Prod C
## 2 John Prod A, Prod B