Merging¶
To merge different datasets, various scenarios can be explored, as illustrated in the following diagram.

According to the figure presented, we can state
Left: Collects all data from the left dataframe and common data from both the left and right dataframes.
Right: Collects all data from the right dataframe and common data from both the left and right dataframes.
Inner: Collects only the common data from both the left and right dataframes, essentially performing an intersection operation.
Outer: Collects all data from both the left and right dataframes, including the common data.
Pandas is very useful for merging datasets. To merge data, consider the following datasets where 'id1' and 'id2' include the data IDs.
import pandas as pd
raw_data = {'id1': range(4),'income': [10,12,14,16]}
dat1 =pd.DataFrame(raw_data, columns = ['id1', 'income'])
raw_data = {'id2': range(6),'pay': [9,11,13,15,17,19]}
dat2 =pd.DataFrame(raw_data, columns = ['id2', 'pay'])
Obviously, the ID variables may not be the same, and they can be compared using:
dat1['id1'].isin(dat2['id2']).value_counts()
dat2['id2'].isin(dat1['id1']).value_counts()
pd.merge can merge different dataframes, and the merging is typically done based on the identities of the left dataset. If there is no match in the right dataframe, Python adds `NaN`` values for the missing data.
result = pd.merge(dat1, dat2, left_on='id1', right_on='id2', how='left')
On the contrary, you can treat the right dataset as the matching one, and if there's no match in the left dataset, Python adds NaN values accordingly.
result = pd.merge(dat1, dat2, left_on='id1', right_on='id2', how='right')
Since the IDs are not the same, one can perform merging based on the intersection of the IDs.
result = pd.merge(dat1, dat2, left_on='id1', right_on='id2',how='inner')
Merging can also be done based on the union of the IDs.
result = pd.merge(dat1, dat2, left_on='id1', right_on='id2',how='outer')
If the names of ID variables are the same in both datasets, you can use on=id_name instead of left_on= and right_on=. If you want to identify where the elements in rows are from, add the argument indicator=True, then a new column named _merge will be added to the merged data, which shows its origin.
result = pd.merge(dat1, dat2, left_on='id1', right_on='id2', how='outer', indicator=True)
Concating¶
To concatenate datasets row-wise, use concat.
result = pd.concat([dat1, dat2],axis=1)