Smartphone-based outlier detection: a complex event processing approach for driving behavior detection

Vasconcelos, Igor; Vasconcelos, Rafael Oliveira; Olivieri, Bruno; Roriz, Marcos; Endler, Markus; Junior, Methanias Colaço

doi:10.1186/s13174-017-0065-0

Research
Open access
Published: 26 September 2017

Smartphone-based outlier detection: a complex event processing approach for driving behavior detection

Igor Vasconcelos^1,2,
Rafael Oliveira Vasconcelos^1,2,
Bruno Olivieri¹,
Marcos Roriz¹,
Markus Endler¹ &
…
Methanias Colaço Junior³

Journal of Internet Services and Applications volume 8, Article number: 13 (2017) Cite this article

5591 Accesses
13 Citations
7 Altmetric
Metrics details

Abstract

The majority of fatal car crashes are caused by reckless driving. With the sophistication of vehicle instrumentation, reckless maneuvers, such as abrupt turns, acceleration, and deceleration, can now be accurately detected by analyzing data related to the driver-vehicle interactions. Such analysis usually requires very specific in-vehicle hardware and infrastructure sensors (e.g. loop detectors and radars), which can be costly. Hence, in this paper, we investigated if off-the-shelf smartphones can be used to online detect and classify the driver’s behavior in near real-time. To do so, we first modeled and performed an intrinsic evaluation to assess the performance of three outlier detection algorithms formulated as a data stream processing network which receives as input and processes data streams of smartphone and vehicle sensors. Next, we implemented a novel scoring mechanism based on online outlier detection to quantitatively evaluate drivers’ maneuvers as either cautious or reckless. Thus, we adapted a data mining mechanism which takes into account a sensor’s data rates and power to determine driver behavior in the scoring process. Finally, as the intrinsic evaluation does not necessarily reveal how well an algorithm will perform in a real-world scenario, we evaluated the algorithm that achieved the best result in a real-world case study to assess drivers’ driving behavior. Our results indicate that the algorithm performs quickly and accurately; the algorithm classifies driver behavior with 95.45% accuracy. Moreover, such results are obtained within 100 milliseconds of processing time on average.

1 Introduction

Driving is an everyday task that has become a necessity for modern society, primarily in large cities. According to Owsley [1], in some cases driving is associated with quality of life. However, reckless driving has caused a growing number of traffic accidents. Reckless driving is defined as driving behavior defined by Tasca [2] as behavior that “deliberately increases the risk of collision and is motivated by impatience, annoyance, hostility or an attempt to save time.” According to a global safety report traffic by the World Health Organization [3], 1.24 million people die each year in traffic deaths and an estimated 20–50 million are involved in non-fatal accidents. In addition, it is estimated that $518 billion dollars is spent on the consequences of accidents [4]. However, studies indicate that drivers tend to be relatively safer when monitored or when feedback on their maneuvers is provided [5, 6].

Nevertheless, current intelligent transportation systems (ITS) continue to rely on an infrastructure composed of static sensors and cameras installed on roads, making it difficult to collect, aggregate, and analyze data, especially in real-time [7,8,9]. Moreover, due to the high cost of installation and maintenance, ITS are often restricted to particular roads or neighborhoods [9,10,11]. By contrast, the Internet of Things (IoT) aims to pervasively connect billions [12] of things or smart objects such as vehicles, sensors, actuators, and smartphones. The IoT poses a more complicated challenge in multi-stream environments where multiple data streams compete for available memory and processing resources, especially in resource-constrained systems such as sensors and mobile devices [13]. For this reason, several IoT solutions—including tools for safe driving analysis—have been designed and developed to perform data processing in a cloud environment, due to a cloud’s virtually unlimited capabilities and resources in terms of storage and processing power. For instance, Quintero, Lopez, and Cuervo [14] proposed an approach to classify driving behavior by collecting data onboard the vehicle and forwarding them for processing in the cloud. Leng and Zhao [15] and He, Yan and Xu [16] proposed a cloud computing middleware for the so-called Internet of Vehicles. However, due to the large volume of data generated by some mobile smart device (i.e., sensors in vehicles), it has become impractical and costly to transmit all data to a cloud [17]. Among the many challenges of the IoT, such as heterogeneity and interoperability, the authors of [18] also highlighted the following: (i) middleware for communication with the cloud [19]; (ii) technologies supporting dynamic configuration [20]; and (iii) robust, real-time mechanisms for data filtering and data mining to cope with the large amount of raw data provided by smart devices and reduce the amount of data transmitted to the cloud.

Therefore, mining and processing mobile data streams are a key technique for real-time data analysis [21]. In this scenario, a mobile device is used to receive and analyze a vehicle’s sensor’s data stream such as speed and rotation readings rather than sending these data streams for analysis in the cloud. Furthermore, with this approach, the mobile device can also use its own sensor’s data streams—accelerometer and gyroscope readings—to enrich the vehicle’s data stream and analyze driver behavior. We argue that analysis of driving behavior using the received data streams can be mapped to the outlier detection problem which refers to the problem of finding data patterns that do not conform to (or deviate sufficiently from) expected behaviors [22, 23], e.g., sudden lane changes and hard breaks. Although outlier detection has been widely studied, little research has been done on detecting outliers in dynamic data streams on mobile platforms. Given that dynamic means a non-stationary context, the pattern discovery algorithm must adapt to the available data streams. For instance, approaches for modeling and recognizing driving behavior assume a fixed set of data extracted from onboard vehicle sensors. However, in a real-world scenario, modeled sensors may not be available in a specific type/make of vehicle. For instance, most automobile manufacturers have introduced, in addition to the original, standard onboard diagnostics (OBD-II) [24], another set of sensor data such as steering wheel angle, breaks, airbag triggers, [24, 25] and stability control. Thus, the format of vehicular sensor reading and the available data depends on either the manufacturer or the vehicle [26]. The outlier detection algorithm must precisely perform within the limited computational resources of mobile smart devices, in contrast to the virtually unlimited cloud environment. Moreover, the outlier detection algorithm must operate online, continually processing data items as they are delivered in the input buffer. In contrast, offline algorithms require the complete and finite set of input data for processing. Offline algorithms usually buffer all the input data as a batch before processing them. Consequently, in offline algorithms the detection of a pattern is delayed until the buffer is filled [27].

With the goal of enabling online detection of reckless driving behavior, this paper investigates online outlier detection algorithms for dynamic data streams on mobile devices with limited computational resources. It is important to highlight that these algorithms need to adapt to the available vehicle’s sensors or mobile device’s sensors. Our study adapts and compares three classical offline outlier detection algorithms to perform online data stream processing using the Complex Event Processing (CEP) [28, 29] paradigm. Thus, we propose and evaluate a lightweight approach for detecting outliers through CEP in dynamic data streams generated from mobile devices’ sensors and the vehicle’s onboard sensors. Such an approach should (i) perform online data stream mining to identify outliers while respecting the intrinsic computational and storage limitations of mobile devices, and (ii) be able to adapt to the available input data (i.e., sensors) streams. Specifically, the main contributions of this research include a mechanism to perform online outlier detection over multiple data streams in a resource-constrained device, and a prototype application that implements these requirements to classify driving behavior. A case study was carried out in a real-world scenario in Brazil with the aim of validating the prototype. The results indicate a fast (i.e. ~100 milliseconds of processing time) and accurate (i.e., 95.45% accurate) performance.

1.1 Problem statement

Outlier detection refers to finding patterns in data that do not conform to expected behavior [23]. An outlier commonly contains useful information about abnormal characteristics of the system or entity [30]. Outlier detection is a multidisciplinary field of study that investigates how to extract patterns from large datasets covering a broad spectrum of techniques, such as statistical inference, machine learning, and data mining [23, 31, 32]. Moreover, it has been extensively applied in a variety of applications, for instance, detection of financial fraud, network intrusion, failures in critical systems, sensor faults in sensor networks, speech recognition, and traffic monitoring [23, 32].

Despite extensive research on outlier detection, most existing methods require the entire dataset (or at least a large portion of it) to detect outliers [23, 33], and are designed to perform offline analysis [22] for a large volume of data. These algorithms have no or only restricted support for real-time data analysis requirements - such as meeting timing constraints - and have difficulty adapting to continuous non-stationary data [34]. Online data analytics is particularly significant for applications that need real-time analysis of continuous data streams. This analysis needs to be performed in a manner enabling it to run with partial data and with the limited computational resources of mobile devices. It is challenging to adapt existing outlier detection solutions to mobile data streams since they were designed and developed for cloud environments with abundant available resources, in which they normally compute with the complete data input [35]. Furthermore, such solutions are treated as a “black boxes” [34], wherein changes can scarcely be made to internal algorithms. The data mining community has conducted studies addressing outlier detection in data streams; however, these proposals mainly solve difficulties that are not the focus of the current paper, such as clustering [36, 37], mining frequent patterns [38, 39], data analysis [40, 41], and query processing [42] in the cloud environment. The aim of this paper is to investigate online outlier detection over multiple and dynamic data streams. Moreover, the outlier detection is performed in a smart object with limited processing and storage capabilities (unlike cloud environments) within a mobile scenario in which there is no guarantee that all data will always be available. Based on a systematic review of approaches conducted in this paper, we claim that, to the best of our knowledge, the literature offers no solutions to this problem.

A data stream is a continuous and online sequence of unbounded items for which it is not possible to control the order of the produced and processed data [43]. One characteristic of the data stream is its dynamic nature [44], meaning the properties of data instances may evolve or change over time. Additionally, context changes in a mobile scenario. For instance, it may modify the available data stream’s inputs, and therefore, it is necessary for an algorithm to adapt to the stream’s evolution [45]. Recently, online outlier detection within data streams has attracted attention in many constrained emerging applications, such as mobile crowd sensing, mobile activity recognition, ITS, and mobile healthcare [21]. In these applications, multiple and continuous streams are generated by mobile sensors and these streams need to be analyzed in real time. Based on this scenario, it can be seen that the adaptation of strategies for classical outlier detection algorithms to operate with mobile data streams, thus enabling their operation on mobile devices to be efficient, is a challenging research task [21, 35, 46]. This is because (i) outlier detection for data streams is restricted to the partial set of events within a time window; (ii) random access on the set is not possible; (iii) the algorithm must adapt to hardware resources and available sensor data; and (iv) patterns must be discovered within a single pass over the data stream. Moreover, Chandola, Banerjee and Kumar [23] highlight additional factors that make the outlier detection problem more difficult in such a situation:

Defining a region encompassing all possible normal driving behavior is difficult. Furthermore, the threshold between normal and abnormal driving behavior is not often precise. Thus, an outlier observation lying close to the boundary may be normal or abnormal.
The lack of availability of datasets for training and validation is often a major problem. The exact notion of an outlier differs depending on the application domain. An outlier detection formulation is generated by both the nature of data and the availability of labeled data.

1.2 Motivating scenario

Intelligent transportation systems have received increasing attention from academia, industry, and governments, and have been considered the next technological change in individuals’ daily lives [47]. Automobile manufacturers, in an attempt to overcome the aforementioned ITS limitations, have developed products that help drivers, called Advanced Driver Assistance Systems (ADAS). These systems [48, 49] obtain vehicle data from sensors or embedded devices (e.g., cameras and stability control sensors) for the prevention and detection of collisions (e.g., crash sensors can activate airbags), assisted driving, and the generation of offline driving reports. The advantages of ADAS include the rare occurrence of false positives [50] when accessing sensors and devices that are embedded in the vehicle. However, the key impediment of ADAS lies in the fact that they are typically available only in new and high-standard vehicles that have prohibitive prices for most drivers [50,51,52], even in developed countries. Furthermore, the installation of ADAS in older car models is either impossible or inordinately expensive. Finally, when ADAS become obsolete, upgrading or changing to a newer, more efficient system is a difficult task [50], and exorbitant for most drivers.

By contrast, studies have proposed the use of smartphones to understand and evaluate a driver’s behavior [5, 26, 49, 53,54,55,56,57]. The choice of using a smartphone is made due to its affordability and wide adoption, sufficient storage capacity and processing power, as well as its equipment with a variety of sensors. Moreover, a smartphone can act as a processing hub that receives and analyzes data from different vehicle sensors. For instance, with Bluetooth, a smartphone is able to connect and receive data from multiple in-vehicle sensors using the OBD-II standard, simultaneously receiving and processing speed and accelerometer data streams. Furthermore, in-vehicle sensors’ data streams can be combined with smartphone-embedded sensors (such as direction and location) to further enrich the analysis. Finally, smartphones allow for the development of ubiquitous and loosely connected systems that provide rich data for the analysis of driving behavior.

Currently, approaches that evaluate driving behavior in general use models and techniques (e.g., Neural Networks, Fuzzy Theory, and Hidden Markov models) with good accuracy [58]. However, they were not designed for data stream processing [40], and according to Lin et al. and Wang, Xi, and Chen [58, 59], have low processing performances, require a long training phase, artificial assumptions, or prior knowledge to formulate rules. Moreover, since these approaches are statics (i.e., non-adaptable), they have difficulty quickly and accurately recognizing parameters [58], for instance, neural networks have subjective methods for adjusting the topology (number of layers and neurons) and require a fixed number of input parameters. A final drawback, highlighted by Wang [59], is that these approaches are “black boxes” with little ability to identify causal relationships making it impossible to understand physical behaviors. However, driving conditions (which are influenced by the state of the driver, traffic, and weather conditions) are dynamic and as such, all information that a technique or algorithm needs as an input will not always be available onboard. Thus, we believe that an assessment of a driver’s driving behavior would benefit from an online outlier detection approach in dynamic mobile data streams.

1.3 Assumptions

This paper considers the following assumptions.

Most data instances in data stream are normal. Only a small portion of the data consists of outliers [60, 61].
Outliers are statistically different from normal data [2, 62].
Battery power consumption is not a critical requirement because in a vehicle a smartphone can be charged easily when necessary.

However, it is important to note that the first two assumptions complement themselves. Considering only the first assumption, some outliers may have behaviors similar to normal data. The second assumption, however, states that outliers are a set of data with behaviors that differ from normal data.

The remainder of the paper is organized as follows. Section 2 presents an overview of the key concepts and system modeling used throughout this work. Section 3 details the proposed approach to online outlier detection for mobile, dynamic data stream. Section 4 highlights definitions and planning of the case study. Section 5 summarizes the main results of the assessment conducted to evaluate the proposal. Section 6 discusses related work. Finally, Section 7 reviews and discusses the central ideas presented in this paper and proposes paths for future work on the subject.

2 Fundamentals

This section presents the main concepts of complex event processing, as well as outlier detection algorithms.

2.1 Complex event processing

Complex event processing (CEP) is a set of techniques and tools that provides an in-memory processing model for an asynchronous data stream in real time (i.e., minimum delay) for online detection of situations of interest [28]. Complex event processing offers [28]: (i) situation awareness through the use of continuous queries that correlate data from different sensors data streams; (ii) context awareness by subdividing data streams into different views, such as temporal windows or key partitions; and (iii) flexibility, since it can specify events at any time, that is, the specification of events can be dynamically changed while a system is running (i.e., on-the-fly).

The CEP central concept is a declarative event processing language (EPL) to express event processing rules (continuous queries and patterns). These rules are based on the event-condition-action triad, and use operators (e.g., logic, counting, temporal, causal, and spatial) on input events, searching for correlations, exceptional conditions, and the occurrence of patterns. The central task of CEP is to provide mechanisms for event pattern matching, i.e., from hundreds or even thousands of events, to identify significant patterns in the application domain [63]. Event processing and pattern detection are made by so-called event processing agents (EPAs) that process an event’s stream. Essentially, an EPA filter separates, aggregates, transforms, and synthesizes new complex events from simple events. A reckless maneuver (e.g., rapidly turning at a high speed) is an example of a complex event, in so far as it is based on the composition of primitive events, such as acceleration, speed, and wheel direction. To perform the detection of such complex events, it is necessary to collect and analyze the data stream generated by various primitive sensors, looking for patterns and correlations. To detect the pattern of a maneuver, it is necessary to use an important concept of CEP called the time window (or just window). A window is a temporal context that defines which portions of the input data stream are considered during the execution of an EPL rule [64], i.e., events in the last 30 s, or a snapshot of such recent events [63]. The most common time window models are the batch and sliding window [64]. The former have a fixed lower bound while the upper bound advances every time a new information item enters the system, that is, the CEP engine buffers and processes all events in a time interval. The latter has a fixed size, however, both lower and upper bounds advance when new items enter the system. In others words, it is a moving batch window. An event processing network (EPN) is a network of interconnected EPAs that implement the global processing logic for pattern detection through event processing [29]. In an EPN, EPAs are conceptually connected to each other—output events from one EPA are forwarded and further processed by other EPAs—without regard to the particular kind of underlying communication mechanism for event dissemination.

2.2 Outlier detection

Outlier detection techniques typically assume that outliers in data are rare compared to normal instances. A variety of outlier detection techniques have been developed in several research communities. Many of these techniques have been specifically developed for specific application domains, while others are more generic. The techniques explained in this paper are used widely in several research areas for identifying outliers in data. The earliest algorithms used for outlier detection were statistical approaches which assume that normal instances occur in high probability regions, while anomalies occur in low probability regions. The standard score (more commonly referred to as the Z-score) is a simple statistical technique that enables one-pass computation over a data stream to identify outliers, making different kinds of data comparable and easier to interpret [65]. The Z-score describes a raw score’s location in terms of how far above or below the mean it is when measured in standard deviations [65]. A z-score of 0 means that the raw data instance is equal to the mean. The Z-score is calculated as shown in eq. (1), where Z is the Z-score of a data instance, X stands for the sample value, μ stands for the mean of the sampling, and σx stands for the standard deviation of the mean. This computation creates a unitless score that is no longer relates to the original units (e.g., km/h and m/s²) as it measures the number of standard deviation units and therefore can more readily be used for comparisons [66].

$$ Z=\frac{X-\mu }{\sigma x} $$

(1)

According to Heiman [65], a Z-score basically has two components: (1) a sign, positive or negative, indicating whether the raw score is above or below the mean; and (2) the absolute Z-score value, indicating the score’s distance from the mean when measured in standard deviations. According to Chandola, Banerjee and Kumar [23], all data instances whose Z-score module is greater than 3 are declared an outlier. After computing the Z-score for each data instance, the algorithm calculates the Z-distribution (i.e., the relative frequency of the raw Z-scores of a population or sample). Figure 1 shows a perfect normal Z-distribution (a.k.a., a standard normal curve). It should be noted that 50% of the scores fall below the mean, 50% fall above the mean, approximately 68% of the distribution is between ±1 σx from the mean, and Z-scores higher than +3 and lower than −3 occur less than 1% of the time. If these Z-scores were obtained from driving data, this would imply, for instance, that most of the time a driver maintained a driving behavior without abrupt changes in speed or direction. In cases where outliers are detected, the driver may have conducted evasive maneuvers to avoid accidents or indeed behaved recklessly, but the number of outliers would still be insufficient to consider the driver reckless. The strength of the Z-score arises from the fact that this technique does not require user parameters and outliers are discovered with a single pass over the data stream. However, it is susceptible to the number of data instances in the dataset and has a unidimensional nature [32].

The box plot is likely the simplest statistical technique to detect outliers in both univariate and multivariate data sets that makes no assumptions about the data distribution model [23, 32]. The box plot has become a standard technique for presenting a simple display of a 5-number summary, which consists of the smallest non-anomaly observation (min), lower quartile (Q1), median (Q2), upper quartile (Q3), largest non-anomaly observation (max), and interquartile range (IQR)—the difference between Q3 and Q1. This means that 25% of observations are smaller than the first quartile, 50% are smaller than the second quartile, and 75% are smaller than the third quartile. Outliers are points beyond the upper and lower values of the box plot [32]. Laurikkala, Juhola and Kentala [67] suggest a heuristic of (1.5 x IQR) beyond the higher and lower values for outliers; however, according to [32], such a heuristic would need to vary across different datasets. A typical box plot can be seen in Fig. 2. Different from the Z-score, box plots make no assumptions about the data distribution model, however, for multivariate datasets, it is possible to perform a pairwise distance measure. This technique can have quadratic complexity (i.e., in the worst case) since it is founded on the calculation of distances between all data instances [32].

The clustering [68] approach is an exploratory data analysis technique in which a set of input objects, normally multidimensional, are classified into groups (i.e., clusters) of similar objects. Furthermore, it is essentially an unsupervised technique which is preceded by a short and semi-supervised testing and training phase [30, 69] used to identify outliers. Distance-based clustering approaches use a particular clustering measure, such as Euclidian distance. As in the box plot technique, distance-based clustering can have quadratic complexity. Such an approach is based on the following hypothesis according to Chandola, Banerjee and Kumar [23]: normal data instances lie close to their closest cluster centroid, while outliers are far away from their closest cluster centroid.

Following the aforementioned hypothesis considering two clusters, as shown in Fig. 3, points P1 and P2 are considered outliers since they are far away from the clusters’ centroids. However, as outliers form clusters by themselves, this technique is not able to detect such outliers because data instances that lie close to a cluster centroid are considered normal data. To overcome this limitation, a second category of clustering relies on the following hypothesis [23]: normal data instances lie close to their closest cluster centroid, while outliers are far away from their closest cluster centroid. Based on this hypothesis, as P1 is closer to the cautious cluster centroid, it is considered normal data, while P2 is closer to the reckless cluster centroid and thus considered an outlier. However, it can be extremely costly to collect and label abnormal data [32]. For instance, collecting data that represents reckless driving behavior can even be dangerous, as it may cause traffic accidents. Thus, a clustering algorithm should be capable of identifying outliers with a few data instances that represent reckless driving behavior. For more details regarding outlier detection, we encourage readers to refer to [23, 31, 32].

The K-means algorithm is likely the most popular and the widely used unsupervised clustering algorithm [68] which can classify multidimensional data into different groups on the basis of certain dissimilarity measures. The classical K-means algorithm initially chooses random cluster prototypes according to a user-defined selection process. Next, the input data is applied iteratively and the algorithm identifies the best matching cluster, updating the cluster centroid to reflect the new exemplar and minimize the sum-of-squares clustering function given by eq. (2), where μ is the mean of the points (xⁿ) in cluster S_j. However, other distance measurements can be used, such as Euclidean distance [23].

$$ \sum_{j=1}^K\sum_{n\in Sj}{\left\Vert {x}^n-\left.{\mu}_j\right\Vert \right.}^2 $$

(2)

Through the combination of EPL rules, it is possible to write algorithms to classify drivers’ driving behaviors. Thus, we adapted three outlier detection algorithms to EPL rules to perform online processing of a data stream generated by sensors onboard a vehicle. Sections 3.1, 3.2, and 3.3 explains the algorithms.

3 Related work

Kontaki et al. [70] propose four distance-based algorithms for continuous outlier monitoring in data streams. The primary concerns are improving efficiency and reducing memory consumption. To do this, the algorithms use the concept of outliers and inliers. A data instance x is considered an outlier if there are less than k data instances at a distance, at most D, from x, excluding x itself. On the other hand, if the number of data instances in the D-neighborhood of x is enough (i.e., more than k), then x is characterized as an inlier. To improve efficiency, the concept of micro-clusters is used to reduce the number of distance computations. The window size determines the memory size and the number of data instances considered in the time window approach, that is, all data instances in a time window are stored in the main memory and processed by an algorithm. However, the authors control the arrivals and departures of instances in the time window. In these events, if the number of neighbors of a given data instance of x is greater than k then x will never be an outlier and is called a safe inlier. Thus, safe inliers are not stored for further processing and consequently reduce computation and memory use. Each algorithm has a few variations of this process, however, none have been designed to run on devices with memory and processing constraints.

An online outlier exploration platform, or in short, ONION [71], is proposed for modeling and exploring outliers in large datasets based on a distance-based approach. An ONION employs an offline preprocessing phase followed by an online exploration phase, enabling users to establish connections among outliers. As it is difficult to set appropriate D and k values [70, 71], the offline phase is a preprocessing three-dimensional phase that computes all possible combinations of D, k, and entire dataset instances. In fact, k can take in the universe of natural numbers and the user must specify lower and upper bounds for k. This phase outputs all outlier candidates. Then, the online phase, with some rules, determines which candidates are actually outliers.

Zhao et al. [55] propose a driver behavior evaluation scoring mechanism named Join Driving (based on the ISO 2631 standard [72]). This mechanism analyzes passengers’ comfort level based on their exposure to vibrations to classify drivers as cautious or reckless. As human’s feelings in response to vibration depend on the level, frequency, and duration of acceleration, the mechanism analyzes three-axis accelerometer data. However, because a smartphone is likely in an arbitrary position inside a vehicle, the authors also have developed a novel algorithm for reorientation using GPS and orientation sensor data. The evaluation shows that the mechanism can accurately score driving behaviors in high and mid-value smartphones. The main difference between this approach and those of the current paper is that in Zhao et al. [55] the analysis of the data is offline—performed when a driver reports their arrival arrived at a destination—while our approach is based on online processing, able to help a driver while driving.

To understand and model reckless driving behavior, Hong, Margines and Dey [56] implemented a low-cost, in-vehicle sensing platform. Unlike Zhao et al. [55], which only uses a smartphone’s sensors, this platform added an OBD-II diagnostic device to collect data from the vehicle, such as speed, rpm, speed, and throttle position. Furthermore, to detect steering wheel movement, a device called an inertial measurement unit (IMU) was added. Both devices communicated via Bluetooth with a smartphone. To characterize driving behavior, a machine learning-based model analyzed data from acceleration (smartphone sensor), OBD-II, and the IMU. To determine driving behavior, a trip profile is constructed by summarizing trip-profiles obtained from the last three weeks, called driver’s profiles. Finally, the driver’s driving behavior is determined from the driver profile and machine learning. As in the work proposed by Zhao et al. [55], analysis of driver behavior is off-line. However, Hong, Margines and Dey [56], templates must be stored, such as acceleration, deceleration, and curves, which are used for comparison with the driver’s maneuvers and a subsequent classification as cautious or reckless. According to Banovic et al. [73], this is a weakness because machine learning algorithms classify and predict only the most frequent behaviors. In this respect, infrequent variations in drivers’ behaviors are difficult to detect.

Vehicle data stream mining (VEDAS) [74] aims to identify outliers using a device with low computational power—low processing and storage capacity—and was designed to mine a vehicle’s data stream. Data are collected through an OBD-II device and stored in a data stream management system (DSMS) that provides mechanisms to control and access the data through queries. The DSMS provides operators with the ability to compute statistical aggregation such as mean, variance, and covariance. After pre-processing aggregate data, VEDAS constructs a representation of low dimensional data through three techniques: Incremental principal component analysis (PCA), Fourier transformations, or linear online segmentation. Although it is possible to dynamically choose which of the techniques will be used, the authors emphasize that PCA does not work well for online monitoring with limited computational resources. According to the authors, VEDAS implements a collection of techniques and algorithms, including proprietary ones, to perform data stream analysis. The authors discuss techniques based on clustering and statistical tests. First, OBD-II data is grouped by K-means to detect abnormal vehicle health monitoring patterns. The goal of clustering is to identify representations in space that correspond to safe vehicle operation. The detection of unusual driving patterns is performed through acceleration analysis with a linear approximation algorithm, the piecewise linear approximation [75]. In addition, a statistical test is performed on the smoothed data with the algorithm assuming a Gaussian distribution to identify unusual patterns. The data used for validation of the proposal were extracted from Live For Speed, a driving simulator. However, no driver behavior classification is performed.

The study of Aljaafreh, Alshabatat, and Najim [76] proposes the use of inference by fuzzy logic for online identification of abnormal driving data and driver behavior classification based on acceleration and speed. Lateral and longitudinal acceleration are categorized in three intervals: low, medium, and high. Speed is categorized into five ranges, from very low to very high. The values of these outputs are used to classify drivers’ behavior. The proposal of Quintero, Lopez, and Cuervo [14] also uses fuzzy logic; however, the output variables are inserted in a neural network properly trained to classify driver behavior. However, the neural network is on a remote server, so all fuzzy system outputs must be sent to this server which performs offline analysis and driver behavior classification. The authors used a backpropagation algorithm, and the best performing architecture was a two-layer neural network, with nine neurons in the intermediate layer and 31 inputs.

4 Online CEP-based outlier detection algorithms

This section presents the aforementioned outlier detection algorithms expressed as a set of CEP rules for online outlier detection to operate over a multiple mobile data stream, enabling their efficient operation on mobile devices. The algorithms are generic, however, as highlighted by Chandola, Banerjee, and Kumar [23], the exact notion of “outlierness” differs according to the application domain. Therefore, in our case study, we aim to classify driver behavior based on outlier detection. Our driving behavior characterization algorithms are based on a pattern-recognition approach. Although the modeling relies on an idea proposed by Zhang [77], the difference between the proposals is the fact that Zhang’s work aims to identify the driver’s skill level (e.g., expert or novice) through receiving driver behavior measurements as input. Our research aims to identify driver behavior based on online outlier detection through measurements of the signals from different sensors embedded onboard the vehicle, as well as sensors of the mobile device onboard the vehicle.

The processing workflow begins with the interaction between the driver and the vehicle. Each driver exhibits behaviors that can be divided into two types, namely short-term and long-term driving behaviors. The former concerns drivers’ instantaneous behavior that should be taken into account separately, such as pressing the accelerator or the brake. The latter represents larger driving maneuvers, such as making a turn. In this case, it is necessary to consider several issues, namely how the driver accelerates or brakes, the steering wheel angle, and the driver’s speed [78]. From these behavior types, it is possible to detect a unique driving pattern for each driver, enabling the formulation of a profile representing the driver’s behavior [78].

To sense data from the driving behavior, the stream module is responsible for discovering, connecting, and reading both onboard vehicle and built-in mobile device sensors. This module is able to communicate with onboard sensors via short-range wireless communication technologies, such as Bluetooth and Bluetooth Low Energy. More details are available in our previous work [79]. This module acts as a hub and forwards the data stream to the CEP engine for preprocessing. This raw data preprocessing consists in producing higher-level data (referred in this paper as evidence) that best represent the driver behavior. This process is known as feature extraction. A feature is a measurable property that best represent a phenomenon and feature extraction is the processes of deriving the values of such features [80]. As discussed, to measure long-term driver’s behavior, some available features need to be analyzed and correlated over a time period. These features include speed (S = [s₁, s₂, …, s _n]^T) and acceleration (A = [a ₁, a ₂, …, a _n]^T). The parameter n denotes the number of instantaneous sampled values and T denotes a specific time period. Additional features, such as mean speed excluding stops, mean acceleration, mean deceleration, both acceleration/deceleration changes, yaw, and a combination of other physical measurements may be used to measure a driver’s behavior, as discussed in [80].

The online outlier detection module is responsible for finding patterns in the available evidence that deviate sufficiently from expected behavior. The online outlier detection algorithm adapted for CEP rules runs in this module. An important feature of any outlier detection algorithm is the manner in which outliers are reported [23]. On one hand, scoring algorithms, such as Z-scores, assign a score to each evidence estimating the “outlierness”. On the other hand, label algorithms, such as box plots, assign a label (normal or outlier) to each evidence. Finally, these scores or labels are analyzed by the analysis module to classify the driver behavior (i.e., cautious or reckless) and update the driver profile. In practice, the mobile modules act as an EPN, that is, there is a set of interconnected EPAs in each of them with their respective set of EPL rules. The prototype application architecture is shown in Fig. 4.

To compare different analysis approaches, the three outlier detection algorithms shown in Section 2.2 were adapted for continuous outlier monitoring over data stream, discussed in Sections 3.1, 3.2 and 3.3.

4.1 Online CEP-based Z-score algorithm

Because the online Z-score algorithm receives a stream of data instances and cannot wait until all evidences have been received, it needs to divide the stream into a sequence of windows, each of which contains a set of evidence. Therefore, the online Z-score, shown in Fig. 5, is calculated according to eq. ( 3 ). Unlike the classical Z-score algorithm, the online Z-score sample mean values and standard deviation of the mean are computed over evidences in a specific sliding window T. So, temporal context rules determine which data instances are admitted into which window. Then, the algorithm calculates the Z-score of the available evidence in each window. Finally, the Z-distribution is analyzed to classify the driver behavior.

$$ Z={\left(\frac{X-\mu }{\sigma x}\right)}^T $$

(3)

The EPL statement that implements the Z-score algorithm is illustrated in Fig. 6. The time clause in line 3 is a temporal operator that segments the evidence data stream instances into a sliding window of windowLength, a time period argument. The statement output is inserted in the stream of Z-score events for further processing, denoted here by z_score_event. Then, another EPL statement computes the Z-distribution from the z_score_event data stream according to Fig. 1.

4.2 Online CEP-based box plot algorithm

The design of the algorithm for driving behavior detection with a box plot technique is shown in Fig. 7. First, as with the Z-score, a temporal context needs to be performed. Additionally, to avoid computation of pairwise distances for all evidence that can have quadratic complexity [23], we chose to perform the computations for each dimension individually. For the last step, the analyzes EPA just need correlate the outliers. The EPL statement that implements these two steps is illustrated in Fig. 8. As an output, this statement inserts the computed median into a stream of box_plot_q2_event. This computation is shown in Fig. 8. Second, Q1 and Q3 are computed. To do so, we designed an EPL that subscribes to box_plot_q2_event and uses them as threshold to compute Q1 and Q3. The result is inserted into the box_plot_q1_q3_event stream, as shown in Fig. 9.

Third, three computations are performed simultaneously: (i) min, max, and IQR are computed as shown in Fig. 10. The non_outlier_expression filters all data instances in the stream that are not outliers and outlier_expression filters all instances of outlier in the data flow. These two expressions use the heuristic proposed by Laurikkala, Juhola and Kentala [67], as explained in Section 2.2, and are expressed respectively by eqs. (4) and (5). The fmin and fmax functions return both the lowest and highest values, respectively, which are not considered outliers in the box_plot_q1_q3_event data stream, respecting the restrictions imposed by non_outlier_expression. As an output, this statement inserts the computed 5-number summary into a stream of box_plot_event streams. (ii) An EPL rule filters all data instances in the flow that are not outliers using the non_outlier_expression. (iii) The other EPL rule filters all outliers’ instances using the outlier_expression in data flow. These outputted data are forwarded to an EPA responsible for the analysis. Finally, similar to Z-score analysis, if most of the time the evidence is close to the median, then the driver is classified as cautious. Otherwise, if most of the time the median is close to Q1 or Q3, or IQR is high, the driver is classified as reckless.

$$ rawValue\ge \left(q1-\left({1.5}^{\ast } IQR\right)\right)\&\mathrm{rawValue}\le \left(\mathrm{q}3+\left({1.5}^{\ast}\mathrm{IQR}\right)\right) $$

(4)

$$ rawValue<\left(q1-\left({1.5}^{\ast } IQR\right)\right)\mid \mid rawValue>\left(q3+\left({1.5}^{\ast } IQR\right)\right) $$

(5)

4.3 Online CEP-based K-means algorithm

An overview of the K-means algorithm work flow for online driving behavior detection is shown in Fig. 11. Unlike the previous modeling approaches, this is an iterative algorithm. Therefore, it is necessary to use a batch window to make iterating possible until the algorithm converges. Thus, the incoming evidence data stream is first separated in different temporal contexts (batch windows). Second, cluster centroids are chosen. The traditional implementation of the K-means algorithm chooses K random instances and defines them as clusters’ centroids. The main disadvantage of this method lies in its sensitivity to initial values of the cluster centroids. Our developing tests displayed a poor performance with the traditional random choice of initial centroid’s values of the clusters. To overcome this problem, the algorithm starts with previously acquired knowledge in the training phase. More details are given in Section 3.5.

Third, the distances between evidence and clusters centroids are calculated and evidence are assigned to the nearest cluster, as shown in Fig. 12. While some evidence changes from one cluster to another (Fig. 13 from line 1 to 6), the algorithm performs two parallel processing only if there is a min_distance_matrix event followed by loopEvent equal to true in a specific time window: (i) new cluster centroids are calculated (Fig. 13, line 8–14) and (ii) the evidence is put back into the flow for another calculation of the distances to the cluster centroids (Fig. 13, line 16–19). Otherwise, the clusters are forwarded for driver behavior analysis. At the end of each batch window, if for most of the time the evidence belongs to the cautious cluster, then the driver is classified as cautious. In any other way, the driver is classified as reckless. Thus, if a driver maintains a driving behavior without abrupt changes of speed or direction, for instance, the percentage of evidences that belong to the cautious cluster is higher than to reckless cluster and the driver is classified as cautious.

4.4 Limitations

While CEP provides several benefits for data stream processing, such as continuous query, pattern detection, and temporal windows, it is difficult to express iterative controls, e.g., while and for repetition structures, using its primitives. Typically, CEP-based applications follow a pipeline stage topology, with data flow in a given direction from one stage to one or more stages, but without returning to previous stages. This can be troublesome when describing iterative algorithms, such as K-means, which require iterations to converge. To overcome this problem, we simulated the loop check by using two EPL rules. If the loop check is false, i.e., if evidence has not change its centroid, we push the event to the next processing stages. However, if the loop check is true, i.e., evidence has changed its centroid, we reinsert these events into the initial loop stage, which recomputes the centroid distance. We do so by translating the events to EvidenceStreamEvents, the event type that the initial loop phase (distance computation) expects.

Although useful, the independency of each CEP processing stage can difficult to coordinate between them. For instance, the proposed K-means algorithm buffers the received evidence in batches of Δ time period which are sent to the next processing stage, as shown in Fig. 12. This is required so that during interactions, the algorithm analyzes the same set of evidences to partition them into clusters. Thus, even though the events are buffered, the batch events are analyzed one by one by default in the sequential processing stages. To do so, the stages buffer the incoming events in a minimum window so they can be output as a batch. This minimum time window is associated with the mobile device memory and processing power and is usually less than a second. This limitation is shown in the EPL rule at the top of Fig. 13. The timeWindowLength parameter is the minimum time for to buffer all outputted streams in the min_disntance_matrix event and check if evidence has changed its centroid.

5 Defining and planning the case study

In this section, the case study is presented with a focus on the definition and planning of the objective.

5.1 Objectives and contributions

The aim of this case study is to evaluate the designed and implemented algorithms, as shown in Section 4, to identify drivers’ driving behavior based on outlier detection. More specifically, the objectives are as follows.

- Evaluate the effectiveness of online outlier detection algorithms. That is, evaluate the performance analysis of the pattern recognition algorithms for online outlier detection in the context of limited computational resources (i.e., with a smartphone).
- Perform a case study to assess a driver’s driving behavior on driveway sections, such as roundabout, turns, tangent sections, semaphores, intersections (all-way stop), and crosswalks based on online outlier detection.
- Provide an open dataset of driver behavior with a rich set of sensed data, such as speed, rpm, throttle position, accelerometer, and gyroscope.

5.2 Research questions

The research questions that need to be answered through the case study are:

Which is the best algorithm in terms of accuracy?
Which is the best algorithm in terms of precision?
Which is the best algorithm in terms of recall?
Which is the best algorithm in terms of F-measure?
Which is the best algorithm in terms of average execution time?
Which is the best algorithm in terms of resource consumption (memory and CPU)?

5.3 Drivers and route selection

Due to the difficulty of recruiting drivers and the costs associated with assessing driving behavior, the process of driver selection was a matter of convenience and sampling was completed by quota. However, we attempted to establish a sample that represented a broad swath of drivers, preserving the same behavioral characteristics. Thus, 25 drivers were chosen for the study. Sixteen were male and nine female, their ages ranged from 20 to 60 years. Another important factor is driver experience. In our sample, driver experience ranged from 2 to 42 years. Finally, all drivers were familiar with local traffic condition and regulations. This is important so that during the assessment their behaviors reflect the daily driving habits.

Regarding route selection, several potential test locations were evaluated. We chose a paved route comprised of streets and avenues ranging from one to three lanes covering approximately 19.4 km in Aracaju-SE Brazil. In addition, the route, shown in Fig. 14, contains traffic lights, pedestrian crossings, and turns (including 45° and 90° turns). The speed limit on the route was 60 km/h. A pilot study was conducted with all the 25 drivers on the chosen route and this pilot study provided insight into drivers’ behaviors.

5.4 Instrumentation

The instrumentation process began with the implementation of the algorithms through CEP rules as described in Section 4. The algorithms were implemented in EPL, an structured query like language (SQL-like) where streams replace tables as the source of data with events replacing rows as the basic unit of data for running in ASPER, a CEP processing engine based on ESPER (an open source CEP engine [81]) and adapted for Android.

A Brazilian version of the Citröen C3 manual transmission was equipped with a Samsung Galaxy SIII 1.4 GHz Quad Core with 1GB of RAM and a Bluetooth OBD-II device. Our prototype was installed in the smartphone running the online Z-score algorithm. Further details regarding the choice of algorithm are given in Section 6.4. The data collected by the prototype and processed by the Z-score were stored in SQLite [82], an embedded and free SQL database engine.

5.5 Measurement metrics

A confusion matrix is a suitable technique to evaluate the predictive ability of an algorithm to classify data instances, [83]. For n classes, the confusion matrix is table of n x n. The actual class column corresponds to the correct classifications and the predicted class represents the algorithms’ classifications. When there are only two classes, one is considered positive (in our case, cautious driving) and the other is considered negative (reckless driving) [83], as shown in Table 1.

Table 1 Confusion matrix

Smartphone-based outlier detection: a complex event processing approach for driving behavior detection

Abstract

1 Introduction

1.1 Problem statement

1.2 Motivating scenario

1.3 Assumptions

2 Fundamentals

2.1 Complex event processing

2.2 Outlier detection

3 Related work

4 Online CEP-based outlier detection algorithms

4.1 Online CEP-based Z-score algorithm

4.2 Online CEP-based box plot algorithm

4.3 Online CEP-based K-means algorithm

4.4 Limitations

5 Defining and planning the case study

5.1 Objectives and contributions

5.2 Research questions

5.3 Drivers and route selection

5.4 Instrumentation

5.5 Measurement metrics

6 Operation of the case study

6.1 Preparation

6.1.1 Intrinsic evaluation of the knowledge model

6.2 Comparison with the state of the art

6.3 Execution

6.3.1 Data collection

6.4 Extrinsic evaluation of the knowledge model

6.4.1 Scoring driving behaviors

6.4.2 Brazilian road conditions

6.4.3 Threat to validity

7 Conclusion and future

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords