In this article, we will use the merge
function to join 2 tables (dataframe). This is an equivalent of a join
in SQL or “a kind of” vlookup
in Excel. The 2 tables need a key (identifier) upon which the merge will operate. The objective is not to address all the possible uses of this function but just the most common usages.
The main arguments of the merge
function are:
x
: name of the first tabley
: name of the second tableby
: name of the column containing the key. If the 2 tables have different column names, we will use by.x
and by.y
.all.x
is TRUE
(left outer join), the ouput will contains all the table x rows even if there is no match with table y (red and purple).all.y
is TRUE
(right outer join), the ouput will contains all the table y rows even if there is no match with table x (purple and blue).all
is TRUE
(full outer join), the output will contains all the rows from table x and y even the rows with no match between the 2 tables (red, purple and blue).Let’s consider a first Table (Opportunity
) containing a list of sales opportunities with Opportunity Number, Opportunity Value and the ID of the employee in charge.
## Opportunity.Nb Employee.ID Opportunity.Value
## 1 8123 111 1000
## 2 8124 112 1500
## 3 8125 111 10000
## 4 8126 110 99
Let’s consider a second Table (Employee
) containing the Employee Name and Employee ID. We ensured this table does not contain any duplicate.
## Employee.ID Employee.Name
## 1 111 John
## 2 112 Mike
## 3 113 Mary
## 4 114 Jane
## 5 115 Bob
If we need to find the name the employee in charge of each opportunity, we perform a merge of Opportunity
and Employee
keeping all the Opportunity
rows (all.x = TRUE
and all.y = FALSE
). We note Employee
does not contain any Employee ID 110, so no employee name was found for Opportunity Number 8126.
Table.all.x <- merge(x = Opportunity, y = Employee, by = "Employee.ID", all.x = TRUE, all.y = FALSE)
Table.all.x
## Employee.ID Opportunity.Nb Opportunity.Value Employee.Name
## 1 110 8126 99 <NA>
## 2 111 8123 1000 John
## 3 111 8125 10000 John
## 4 112 8124 1500 Mike
If we keep all Employee
rows (all.x = FALSE
and all.y = TRUE
), we obtain the opportunities for which the employee ID was found (Opportunity number 8126 is not in the output) but all employees contained in Employee
are present in the output even if there is no corresponding opportunity in Opportunity
.
Table.all.y <- merge(x = Opportunity, y = Employee, by = "Employee.ID", all.x = FALSE, all.y = TRUE )
Table.all.y
## Employee.ID Opportunity.Nb Opportunity.Value Employee.Name
## 1 111 8123 1000 John
## 2 111 8125 10000 John
## 3 112 8124 1500 Mike
## 4 113 <NA> NA Mary
## 5 114 <NA> NA Jane
## 6 115 <NA> NA Bob
If we keep all Opportunity
and Employee
rows (all = TRUE
), the output contains all the opportunities and all the employees.
Table.all <- merge(x = Opportunity, y = Employee, by = "Employee.ID", all = TRUE )
Table.all
## Employee.ID Opportunity.Nb Opportunity.Value Employee.Name
## 1 110 8126 99 <NA>
## 2 111 8123 1000 John
## 3 111 8125 10000 John
## 4 112 8124 1500 Mike
## 5 113 <NA> NA Mary
## 6 114 <NA> NA Jane
## 7 115 <NA> NA Bob