Design and implementation of a system for recognizing patterns and unique events in IoT data streams

Data streams are perhaps one of the most important underlying concepts of the age of interconnected big data. However, data streams are generally quite difficult to be managed due to their noticeably volatile nature which can be attributed to the combination high speed and short information duration...

Πλήρης περιγραφή

Λεπτομέρειες βιβλιογραφικής εγγραφής
Κύριος συγγραφέας: Καρράς, Χρήστος
Άλλοι συγγραφείς: Karras, Christos
Γλώσσα:English
Έκδοση: 2022
Θέματα:
Διαθέσιμο Online:http://hdl.handle.net/10889/15850
Περιγραφή
Περίληψη:Data streams are perhaps one of the most important underlying concepts of the age of interconnected big data. However, data streams are generally quite difficult to be managed due to their noticeably volatile nature which can be attributed to the combination high speed and short information duration. In turn this combination dictates the need for new and efficient stream analytics for gaining useful insights into the actual stream dynamics without the need to process its entirety, since collecting and communicating stream samples is challenging while storing globally or even locally, transmitting and relaying, or otherwise computing a function over the entire stream or even a large segment thereof is plainly out of the question. In response to this research challenge a number of strategies tailored explicitly for streaming scenarios were designed. These differentiate themselves from previous ones mainly in the point of available computational resources. Specifically, stream approaches assume that there is a limited capacity of one or more resources including processing power and memory, while additionally they may operate under time or accuracy constraints. The number and nature of these limitations is exactly the distinguishing factor between the various streaming strategy families. A prime example of these families is aggregating algorithms which collect streaming data in windows of finite length, compute a small number of selected analytics over each window, typically in the form of a vector, and progressively construct a stream summary consisting of successive such vectors. A related stream strategy family is that of sampling methods where one or more representative stream values are kept as the stream description. Reservoir sampling methods belong to this last category by selecting and storing values which are probabilistically determined to be important. In the above context the primary research objective of this MSc thesis is to develop and outline the inner workings of a weighted random sampling technique based on the generic reservoir sampling algorithmic framework for identifying unique events. In brief, this is accomplished by keeping k stream elements deemed to be representative for the whole stream by a progressively constructed estimation of the joint stream distribution taken over all possible elements. Once confidence is high in said estimation these k samples are selected uniformly. The elements are selected with a complexity of O(min(k, n – k) where n is the number of elements examined. The proposed algorithm proceeds as follows. Given that as a general rule events in the recent literature are considered as outliers, then it suffices in order to extract element patterns and drive them to an alternative version of the k-means technique which was developed in this work. To perform data stream clustering, during its execution the proposed algorithm chooses the location of k centroids and subsequently the distance among each item and its centroids is calculated. Similarly to the standard k-means methodology each item is assigned to the closer centroid. Assuming that the nature of the stream data allows averaging, then a new centroid is derived by averaging the samples allocated in each cluster. The respective distances from the new centroids are recalculated and each item is repeatedly re-allocated until no centroids move. In addition to the steps of the classical k-means the sum of squared errors (SSE) is also computed for each cluster in each step and it is used not only as a measure of convergence but also as a quantification and an indirect measure of the approximation accuracy of the element distribution. This clustering allows the discovery of outliers as the stream moves on with a probability based on the distances from the centroids representing normal events. Experimental results show that the weighted sampling in conjunction with the k-means alternative outperform standard methods for stream event detection. As an additional step towards stream visualization, the events detected along with the clusters of normal events are depicted as knowledge graphs. This allows the future evaluation of the proposed algorithms, perhaps through a crowdsourcing platform, from a large number of individuals who can provide immediate feedback.