In this article, we will use the aggregate
function to answer questions on a dataframe. First, let’s create the dataframe we will use in this example. It will represent products sold by sales reps.
Transaction.Nb <- c(1, 2, 3, 4, 5, 6)
Sales <- c("John", "John", "Jane", "Jane", "John", "Jane")
Product <- c("Prod A", "Prod B", "Prod A", "Prod C", "Prod A", "Prod A")
Quantity <- c(10, 3, 5, 7, 10, 10)
Revenue <- c(100, 3000, 50, 3500, 120, 110)
df <- data.frame(Transaction.Nb, Sales, Product, Quantity, Revenue)
df
## Transaction.Nb Sales Product Quantity Revenue
## 1 1 John Prod A 10 100
## 2 2 John Prod B 3 3000
## 3 3 Jane Prod A 5 50
## 4 4 Jane Prod C 7 3500
## 5 5 John Prod A 10 120
## 6 6 Jane Prod A 10 110
We need to sum the Revenue
per Sales
. We use the aggregate
function with the following argument:
x
: what we aggregate. In this case, the Revenue
.by
: by what do we aggregate. In this case, by Sales
. This argument must be a list
even if there is only one item.FUN
: the function used to aggregate. In this case, we sum
.Note: The output of the aggregate
function will not keep the column names so we need to rename them.
df.agg = aggregate(x = df$Revenue, by = list(df$Sales), FUN = sum)
colnames(df.agg) = c("Sales", "Total Revenue")
df.agg
## Sales Total Revenue
## 1 Jane 3660
## 2 John 3220
We use the aggregate
function the same way, except for the second argument where we group by Sales
and Product
.
df.agg = aggregate(x = df$Revenue, by = list(df$Sales, df$Product), FUN = sum)
colnames(df.agg) = c("Sales", "Product", "Total Revenue")
df.agg
## Sales Product Total Revenue
## 1 Jane Prod A 160
## 2 John Prod A 220
## 3 John Prod B 3000
## 4 Jane Prod C 3500
We use the aggregate
function the same way, except for the first argument where we aggregate the Revenue
and the Quantity
. In this case, x
becomes a list
.
df.agg = aggregate(x = list(df$Revenue, df$Quantity), by = list(df$Sales, df$Product), FUN = sum)
colnames(df.agg) = c("Sales", "Product", "Total Revenue", "Total Quantity")
df.agg
## Sales Product Total Revenue Total Quantity
## 1 Jane Prod A 160 15
## 2 John Prod A 220 20
## 3 John Prod B 3000 3
## 4 Jane Prod C 3500 7
We use the aggregate
function the same way, except for the third argument where aggregate using the mean
function.
df.agg = aggregate(x = df$Revenue, by = list(df$Sales), FUN = mean)
colnames(df.agg) = c("Sales", "Avg Revenue")
df.agg
## Sales Avg Revenue
## 1 Jane 1220.000
## 2 John 1073.333
The aggregate
can use a custom function. So let’s create our own function my.paste
we will use to paste (comma separated) the strings of characters containing the Product
.
We also need to remove duplicates as each rep sold the same product multiple times. To achieve it, we create a function that breaks a string of characters (str
) and identify elements (using unlist
) separated by a comma (using strsplit
). The function will then return the unique (using unique
) elements only. We now need to run the function on each row of the Product
column with a loop (using for
).
my.paste = function(str){
output = paste(str, collapse = ", ")
}
df.agg = aggregate(x = df$Product, by = list(df$Sales), FUN = my.paste)
colnames(df.agg) = c("Sales", "Products")
my.remove.duplicate = function(str){
d = unlist(strsplit(str, split=", "))
paste(unique(d), collapse = ", ")
}
for (i in 1:nrow(df.agg)){
df.agg[i, "Products"] = my.remove.duplicate(df.agg[i, "Products"])
}
df.agg
## Sales Products
## 1 Jane Prod A, Prod C
## 2 John Prod A, Prod B