Week 8 Review: 7/23 - 7/27
After week 7's setback, our research group had to restrategize our data analysis tactic. Instead of using the Degree of change metric with 4 clusters compared to log(Throughput), we chose a method using 2 clusters per window. In a very general sense, the performance of one data transfer could be considered "good" or "bad." We hypothesized that by using two clusters, one cluster would contain normal transfers, the "good" ones, and anomalous points, the "bad" ones. By considering the proportion of "good" points to all the points in one time window, we can make an approximation of what the throughput should be for the window, most notably, if the throughput is abnormally low.
We didn't consider a way to identify the normal cluster from the anomalous because we weren't sure how well it would work in the first place. For the sake of time, we bypassed this hurdle by only considering the smaller of the two clusters. Our choice to simplify the problem means that when throughput is exceptionally high, the smaller cluster will almost certainly be the anomalous cluster and vice-versa for the normal cluster. We formulated the method and tested it by observing graphs to see if it could track log(throughput) like last week.
As I'm, sure you can tell, there is a much clearer relationship between the sequences aside from when throughput is higher than average. This is okay though because it's what we anticipated. This is an example of one of the 4 feature subsets that performed well compared to the others.
We considered how similar points may be labeled inconsistantly between long spans of adjacent windows for unsuccessful feature sets. We used t-SNE to see weather or not this is happening in certain cases.
We didn't consider a way to identify the normal cluster from the anomalous because we weren't sure how well it would work in the first place. For the sake of time, we bypassed this hurdle by only considering the smaller of the two clusters. Our choice to simplify the problem means that when throughput is exceptionally high, the smaller cluster will almost certainly be the anomalous cluster and vice-versa for the normal cluster. We formulated the method and tested it by observing graphs to see if it could track log(throughput) like last week.
As I'm, sure you can tell, there is a much clearer relationship between the sequences aside from when throughput is higher than average. This is okay though because it's what we anticipated. This is an example of one of the 4 feature subsets that performed well compared to the others.
We considered how similar points may be labeled inconsistantly between long spans of adjacent windows for unsuccessful feature sets. We used t-SNE to see weather or not this is happening in certain cases.
The C2S mapping shows overlapping of points of different color. This indicated the points are similar but labeled differently. As a result the C2S subset performed poorly in detecting throughput. S2C, which stands for server to client, has no overlapping points of different labels. We can clearly see that this set should perform well, which it does in the first figure.
Comments
Post a Comment