Anomaly Detection using Proximity Graph and PageRank Algorithm


Anomaly Detection using Proximity Graph and PageRank Algorithm

Project Descriptions

Anomaly detection, refers to the problem of discovering data points or patterns in a given dataset that do not conform to some normal behavior. Anomaly detection techniques are applied in a variety of domains, including credit card fraud prevention, financial turbulence detection, virus or system intrusion discovery, and network monitoring, to name a few.

In this project, we propose an unsupervised anomaly detection scheme using proximity graphs and the PageRank algorithm (ADPP). Our algorithm takes as input a set of unlabeled data points and determines a ranking of which are most anomalous. We construct a proximity graph from data measurements, with one node for each data point and edges between nodes indicating similar data points. We then examine the stationary distribution of a random walk on this graph, following a variation of the well-known PageRank algorithm. The stationary distribution of the random walk is used as a surrogate for density estimates at the locations of data points, allowing us to bypass running more intensive kernel density estimation procedures and leading to a faster and more scalable algorithm.


Detailed Description

A proximity graph is a weighted graph G = (V, E) with n vertices x1 , . . . , xn. Two vertices are connected by an edge if and only if they satisfy certain particular geometric requirements, which implies the points are close to each other or are somehow related. We view high dimensional measurements as vertices, and form a meaningful proximity graph on top of them. Whether or not a particular edge connects two end vertices depends on their Euclidean distance in between. The resulting proximity graph therefore contains the probability density information. We then invoke a random walk on this graph, treating the stationary distribution as the anomaly indicator. Two criteria are proposed for constructing the proximity graph, one (critical threshold) has nice theoretical supports, the other (minimum spanning tree) paves the way for actual implementation.

In a classic discrete time finite state random walk model, it is well known that a unique stationary distribution s exists if the transition matrix P corresponds to an aperiodic and irreducible Markov chain. PageRank is a modified version of the random walk model, incorporating a teleport vector t and a damping parameter α, which ensures that the stationary distribution exists. By choosing a specific teleport t whose elements are proportional to the degree of each vertex, we prove that any choice of α always converges to the same equilibrium s. We also prove that this particular s is a pointwise consistent estimate for the probability distribution in an asymptotic sense. Moreover, t and α provide us an possibility to incorporate partial label information in our framework to try out other semi-supervised learning algorithms.

In this work, we analyze the time complexity of our approach to be O(n2). We also give out the analytic solution for controlling false alarm rate. The most popular algorithm PCA can run extremely fast in high dimensions, however it is limited to detect linear patterns. Nonlinear extensions, such as Kernel PCA, are able to overcome this constraint, but only with an unacceptable running speed. Experiments show that our approach preserves the nonlinear power while retaining a fast performance, even in a high dimensional case.



Detected Anomalies on Different Datasets


Running Time in Seconds for Phases of ADPP (2 dimensions)
n Distance Computation Graph Formation Weight Assignment PageRank Total
200 0.0012 ± 0.0001 0.0025 ± 0.0003 5.2257×10-4 ± 5.5551×10-5 1.0274×10-4 ± 1.2637×10-5 0.0042
400 0.0054 ± 0.0005 0.0083 ± 0.0008 0.0024 ± 0.0003 3.7261×10-4 ± 3.9830×10-5 0.0165
800 0.0289 ± 0.0029 0.0343 ± 0.0035 0.0151 ± 0.0016 0.0012 ± 0.0001 0.0795
1600 0.1172 ± 0.0118 0.1307 ± 0.0132 0.0618 ± 0.0065 0.0041 ± 0.0004 0.3138
3200 0.4252 ± 0.0430 0.4627 ± 0.0468 0.1869 ± 0.0211 0.0155 ± 0.0016 1.0903
6400 1.7147 ± 0.1732 1.8077 ± 0.1827 0.7231 ± 0.0819 0.0589 ± 0.0060 4.3044
12800 7.0034 ± 0.7076 7.1144 ± 0.7188 2.8038 ± 0.3142 0.2308 ± 0.0235 17.1524


ROC curves for ADPP and PCA on USPS dataset (256 dimensions)
Digit 3 Digit 8 Digit 0