In this section, we provide the mobility problem formalization, the main concepts related to it, and introduce the Ensemble Random Forest-Markov predictor (ERFM). It mainly consists of the following steps, each of them is detailed in the following subsections.
-
1
Data Acquisition/Preparation: In the first step, we collect data and split it into two subsets: train and test.
-
2
Features Engineering: In the second step, we build trajectories of different lengths based on the model order k. Also, extract features from the data, such as bearing and Haversine distance between every two locations.
-
3
Model Building/Training/Aggregation: It is the main. Here, we build the base models for the ensemble predictor, tune the hyperparameters for each model based on the Grid Search approach, train each model using the selected parameters, and aggregate them based on the Out-Of-Bag error.
-
4
Model Evaluation: In the last step, we evaluate the ensemble predictor (ERFM).
3.1 System model
In this article, we introduce an ensemble model based on LBSN data to predict the user’s next location. Therefore, for a better understanding, we provide a brief definition of the principal concepts related to it, including the mobility prediction problem formalization.
Definition 1
(check-ins) The check-in is defined as a 5-tuple c={id,lat,lon,loc,t}, where ‘id’ represents the user id; ‘lat’ and ‘lng’ denotes the location coordinates and is defined by latitude and longitude, respectively, ‘loc’ is the location id and ‘t’ represents the timestamp. We denote the set of check-ins of all users as \(\mathbb {C}\) and the set of check-ins for a specific user as \(\mathbb {C}_{id}\), where the index is the user id. For instance, \(\mathbb {C}_{i}\) is the check-ins set for the user i.
Definition 2
(trajectory) The trajectory trm(i) is defined as the m-th time-ordered sequence of locations that the user ‘i’ just passed. For instance, for a sequence of length three (k=3), tr1(u)={loc1,loc2,loc3} is the frist trajectory of user ‘u’, indicating he just checked-in at these locations in such order. The set of all trajectories of all users is defined as \(\mathbb {T}\) while the set of all trajectories of a specific user is defined as \(\mathbb {T}_{id}\), where the index is the user id.
Definition 3
(mobility prediction) We formalize the mobility prediction problem as follows. Given a user u whose current check-in is c={u,lat,lon,t}, we aim to rank the set of possible locations so that the next location to be visited will be ranked at the highest possible position in the list. Therefore, the mobility prediction problem is essentially a ranking task, where we compute a ranking score for all venues in \(\mathbb {L}\).
3.2 Data acquisition/preparation
We used the United States region from Global-scale Check-in Dataset [26]. It has over 12 million check-ins by about 400 thousand users at about 2 million locations over a period of 22 months (from Apr. 2012 to Jan. 2014). This dataset consists of the following fields: (i) User ID (anonymized); (ii) Latitude; (iii) Longitude; (iv) Timestamp/DateTime; (v) Location ID; (vi) category. Even though this dataset has a high number of users, only a few (<1%) was used. It occurs due to the number of check-ins per location or the total number of check-ins per user. In this sense, we considered only users that checked-in at least 10 different locations and 5 times on each. Also, we filtered users with a total of check-ins of less than 500.
Figure 1 illustrates ERFM pipeline, in which the first process is the data splitting. There are many ways to split the data into training and testing sets. The most common approach is to use some version of random sampling since it is a straightforward strategy to implement and usually protects the process from being biased towards any characteristic of the data. However, this approach can be problematic when the response is not evenly distributed across the outcome. In this context, a less risky splitting strategy would be to use a stratified random sample based on the outcome. Therefore, for classification models, this is accomplished by randomly selecting samples within each class. It ensures that the frequency distribution of the outcome is approximately equal within the training and test sets.
Also, the data can be sliced sequentially, in which the first p% data is the training set and the remainder data is the testing set. However, sequential data such as mobility trajectories is subjected to auto-correlation, where the assumption made by the currently splitting approaches of i.i.d observations does not hold. Therefore, techniques such as random sampling are not applied to time series data, since they do not consider its main aspect: time. Moreover, for large datasets, such as Global-scale Check-in, splitting the whole data sequentially is not a good option, since the testing set may not be correlated with the training set. In this sense, ERFM is based on the Block-rolling Time Series split (BRTS). It leverages the time dependence by splitting the data into N small partitions (folds), and for each one, it applies a sequential split given a training and testing data proportion (see Fig. 1, item 1).
3.3 Features engineering
In many cases, the assumption that “the next place that is going to be visited is only dependent on the current location” becomes unsuitable or even false because it can be not enough to extract the patterns. For instance, the mobility pattern may be associated with several consecutive user movements than low-order transitions. On the other hand, building higher-order transitions may lead to long trajectories that are not directly related to the user’s next location and a reduced number of samples, making the mobility prediction difficult. In this context, we used a varied-order approach, where for a defined model order k, we build trajectories ranging from size 1 to k. For instance, for a k=5, we also build trajectories of sizes from 1 to 4, totaling trajectories of different sizes, each responsible for extracting a different pattern.
In this sense, the user trajectories were built based on two aspects: (i) Individual and (ii) General. The former assumes that user mobility is only influenced by his behavior while the general aspect assumes that behaviors of different users can be someway correlated. Firstly, we cluster the sequence of locations according to the day of the week. Then, for each cluster, we group the check-ins based on the timestamp difference between two consecutive check-ins from the same user. If it is lower or equal than a threshold β, we just add to the same group, otherwise, we create a new one. After that, assuming the memoryless property and the maximum trajectory length k, we iterate the groups up to k times using an overlapping rolling window with variable size (from 2 to k+1). It is important to notice that the rolling window length is fixed for each iteration. As a result, we split each group into other overlapping subgroups of size from 2 to k+1, where the first locations are the trajectory and the last location is the destination.
In the context of general aspect, it is also categorized into other two classes: (i) Collective and (ii) Hybrid. In the collective approach, all the individual trajectories set are merged into unique collective trajectories set. Hence, it assumes that the trajectories are the same for all users. For instance, let \(\mathbb {T}_{i}\) and \(\mathbb {T}_{j}\) be the individual trajectories set for the users i and j, the collective trajectories set is given by \(\mathbb {T} = \mathbb {T}_{i} \cup \mathbb {T}_{j}\). The main advantage of this approach compared to the individual one is the number of possible next locations. For instance, Markov-based algorithms fail to correctly predict future movements if the new location has never been visited by a user. On the other hand, in the collective approach, the chances of the location has never been visited is lower. In contrast, this approach may lead to incorrect predictions, since it does not take into account the individuality movement of each user.
In the hybrid approach, user similarity enhances the spatial and temporal information for mobility prediction since the mobility from a user could be correlated with some user but not all. In this way, we find users with similar routines for mobility prediction. As in Araujo et al. [18], we computed the similarity based on the spatial factor. Therefore, first, we calculated the normalized frequency (f) for each user based on the number of times he visited each location. Hence, the normalized frequency is given by Eq. (1):
$$ f_{uid} = \frac{\text{\# user}\,\, uid\text{ visited} \,\,loc}{\text{total visits of user }uid}\text{ }\forall\text{ }loc \in \mathbb{L} $$
(1)
where uid is the user, loc is the location, and \({\mathbb {L}}\) is the locations set. After that, since the output of the normalized frequency of each user uid is a probability distribution, we computed the similarity between any two users i and j (i≠j), denoted as SRE(i,j), based on Kullback-Leibler divergene (DKL), more specifically on Jensen-Shannon divergence (DJS). In this context, both measures (DKL and DJS) are usually used to measure the divergence (or similarity) between any two probability distributions. However, differently from DKL, Jensen-Shannon is symmetric and has a normalized value (ranges from 0 to 1). Therefore, we considered similar users those whose SRE metric was above a given threshold γ, where γ=0.7. We computed the threshold γ by rouding the average of all SRE values. The user similarity is given by Eq. (2):
$$ SRE\left(i, j\right) = 1 - D_{JS}\left(f_{i},f_{j}\right) $$
(2)
$$ D_{JS}\left(f_{i}, f_{j}\right) = \frac{D_{KL}\left(f_{i}, M\right) + D_{KL}\left(f_{j}, M\right)}{2} $$
(3)
$$ D_{KL}\left(i,j\right) = \left[\sum\limits_{loc} f_{i,loc}\text{ }log\left(\frac{f_{i, loc}}{f_{j, loc}}\right)\right] $$
(4)
where M=0.5(fi+fj) while fi,loc and fj,loc are the normalized frequencies of the users i and j, respectively, for the location loc. Therefore, in the hybrid approach, there will be a trajectory set for each user as in the individual approach. However, each hybrid trajectory set contains own user’s individual trajectories and trajectories from the similar users.
Figure 1 (item 2) illustrates the process of extracting features. Hence, in order to build a more sophisticated ML model, besides the coordinates of the sequence of locations, we added two more features for every two subsequent locations: bearing and the distance. The bearing feature (θ) is the angle measured clockwise from the north direction from a location to another and the calculation is given by the Eq. (5). The distance feature is the geodesic distance in kilometers between two locations and it is given by the Haversine formula, since we are working with latitude and longitude values and it is usually used for computing the distance. Therefore, for trajectories with length k≥2 the features are extracted. For instance, for a trajectory with length k=3, two bearing features and two distance features are added, each representing the angle and distance of each user movement.
$$\begin{array}{*{20}l} &A = \sin \Delta \lambda \cdot \cos \varphi_{2} \\ &B = \cos \varphi_{1} \cdot \sin \varphi_{2} - \sin \varphi_{1} \cdot \cos \varphi_{2} \cdot \cos \Delta \lambda \\ &\theta=\operatorname{arctan}\left(A, B\right) \end{array} $$
(5)
3.4 Model building/training/aggregation
Ensemble learning is an ML technique where multiple predictors (often called “weak learners” or “basic models”) are trained to solve the same problem and combined to get better results [17]. These basics models often perform not so well by themselves either because they have a high bias, such as low degree of freedom models or because they have too much variance (e.g., a high degree of freedom models). Then, the idea of ensemble methods is to try reducing bias and/or variance of such weak learners by combining several of them to create an aggregated learner (or ensemble model) that achieves better performances.
Traditional ensemble learning approaches only have one layer, i.e., they use ensemble learning once. In this article, we propose ERFM, a two-layer ensemble learning model, in which the weak learners are ensemble learning models. Therefore, in the inner layer, we combine collections of Decision Trees (DT) to create Random Forest models, each of which is based on a different trajectory set according to the trajectory length k. Hence, for an order-k model, there will be k different Random Forest models. In the outer layer, the outputs from the previous layer are aggregated based on the classification performance of each weak learner.
RF performs better than an individual DT on two aspects: overfitting and anomaly isolation. During the RF training process, the outliers are in some of the trees but not in all of them, and thus the aggregation system guarantees the anomalies will be isolated. Also, RF uses the Bagging (Bootstrap Aggregation) approach, which allows each tree to randomly sample from the training dataset with replacement (bootstrap sample), resulting in different trees. Therefore, the voting system minimizes the effect of overfitting concerning the individual decision tree. Also, since each DT takes a different set of training data as input, the deviations in the original training dataset do not impact the final result obtained from the aggregation of DT. Therefore, bagging as a concept reduces variance without changing the bias of the complete ensemble. Moreover, Random Forest can be evaluated using the Out-Of-Bag error (OOB). In this sense, the OOB error is the average error for each training sample zi calculated using predictions from the trees that do not contain zi in their respective bootstrap sample.
In the context of hyperparameters optimization, we used a Grid Search approach. Therefore, we split the training set using BRTS strategy into two equally subsets: training and validation and for a given parameter, it chooses the best parameters for a model based on the validation classification performance (see Fig. 1, item 3). In this article, we used the following parameters:
-
n_estimator: It specifies the number of trees in the forest of the model. The list of values used was [20,50,100].
-
max_depth: It specifies the maximum depth of each tree. The list of values used was [5,10,20,50]
After the Grid Search, each RF is trained with the best parameters using the full training dataset (Fig. 1, item 4). Then, ERFM combines all RFs using a weighted average method, where the weight of each base predictor is inversely proportional to the OOB error rate (Fig. 1, item 5). Therefore, RFs with a high rate of error receive a low weight value. In the end, we normalize the predictions using the output probabilities. Also, we rank the results from the highest possible location to the lowest one.
3.5 Model evaluation
We can distinguish models according to the type: classification or regression. In the first one, the output is a categorical class label. On the other hand, in the regression problem, the model learns a continuous function. It is common for classification models to predict a continuous value as the probability of a given example belonging to each output class. The probabilities can be interpreted as the likelihood or confidence of a given example belonging to each class. A predicted probability can be converted into a class value by selecting the class label that has the highest probability. In this article, we return a vector containing the highest predicted probabilities. Finally, we select the location with the highest probability.
In order to evaluate the classification performance, we compare different ML methods using two metrics based on the testing set (see Fig. 1, item 6): accuracy and f1-score (see Eqs. (6) and (7)). The former measures the number of correct predictions among the predictions made. F1-score is the harmonic mean of Precision and Recall, where the first is the ratio of correctly predicted positive observations to the total predicted positive observations while the second is the ratio of correctly predicted positive to the total number of actually positive observations.
$$ accuracy = \frac{\# \text { correctly predicted}}{\# \text { predictions}} $$
(6)
$$ F1 = 2 \times \frac{precision \times recall}{precision+recall} $$
(7)