On Saturday, Alina and I went on a hike up to Wildcat Peak. It took about 2 hours full circle but the sight was worth it all. I definitely plan on going on more hikes around the area because of this experience.
This week has been very busy. The results section of my paper was due so the majority of my time was spent typing and editing and re-editing that. There were also some final decisions made about future steps. Instead of the NA cost strategy we've been following, we've decided to employ a delete strategy. In this strategy (which only works because there are only missing values at the end of the sequences) we will delete all missing values and be left with sequences of different lengths. Then these mismatching sequences will be normalized and compared. This method gives faster and more accurate results than simply picking a substitution cost. Other than that change in beginning procedure, the rest of our methods have not changed. In addition to this decision, we have also decided on an ending age of 39 for our sequences. The new variables are finally fixed and we have a final dataset. We are ready to receive our final clustering results and have ensured they will be accurate wi...
In this week much effort was put into being able to compare average window throughput and degree of change. One challenge to overcome was how to scale the throughput data. To best understand the distribution of the average throughput's for every window, we made a histogram using the unscaled throughputs. It was evident that the magnitudes of the very high throughput dwarfed the small throughput by at least 10^2. Two methods were considered to scale the throughput; apply the reciprocal operation, (^-1), or apply a logarithmic function. The logarithmic function most evenly represents small throughput and high throughput. After choosing our scaling method we graphed the log(throughput) with the degree of change over time. From these figures, it seems there is no discernible relationship between degree of change and log(throughput). This was a big step back as Alina and I had to rethink an approach to this project.
The plots work! It took all of Monday and most of Tuesday, but we finally found the problems in the code. Ironically, as we were trying to fix the plots they progressed from not clustering at all to all clustering finally to what they should look like. Once those problems were figured out we were able to do slight alterations and use different information until we were left with a robust graphical interpretation of the optimal NA cost. Here are a couple of the graphs: These 2 are colored by age group. For example the read in the top plot are the people born before 1983 and the green those born after. So the clusters tell us that the age is playing a significant role in the placement of individuals and thus are creating clusters. Because age is our alignment variable, different age groups are going to have a different level of missing values and thus the higher cost for replacing missing values will affect the younger survey participants much more than the older part...
Comments
Post a Comment