Research update 3 ~Alex
I was looking over what I posted last time and actually couldn't believe how much we have accomplished in just one week. I also realized I never explained how Optimal Matching is. Optimal Matching to a non-Euclidean distance metric which can be used on large sequences of data. It works through a series of substitutions; the less substitution it takes to match two series the "further apart" they are. Determining the substitution cost for missing values can be tricky especially for missing values due to the alignment of data. Too high of a cost can cause artificial clusters to form by the alignment variable, and too little can under represent possible similarities in the missing data. So determining the optimal cost for missing values, "NA cost", will be an important part of our analysis. Since I last posted there have been good times and bad times. Monday and Tuesday were good days. We worked through the methods used in the last publication on our data set to get a feel for the data and to help me understand all of the twists and turns that are involved with machine learning algorithms. Throughout that process we ran internal clustering measures, ASW and PBC, on the clusters described last week. These two measures are used to check the stability of the clusters and can be used to help determine the optimal number of clusters. After a lot of failed attempts these tests produced the following plots for our 5 demographics variables.
As you can see in the PBC plot the lines start to stabilize are 4 or 5, which means this is the optimal amount of clusters. You can also see that the NA 2's lines are much higher than the NA 0's, which means the NA 2's are clustering better than the NA 0's (this is expected due to the possibility of false clusters due to a high NA cost). You also probably noticed that the NA 0 lines are miraculously missing from the ASW plots. Apparently the numbers with NA cost 0 were divided by 0 somewhere in ASW. We're not quite sure why or how, but Alina is doing her best to figure it out. (Also this plot has given me nothing but struggles. Shortly after this was saved lines started disappearing, randomly, for no reason. The plot was malicious and for now it has won. I had to move on to more important things). After the internal clustering methods we started looking at dimensionality reduction techniques to more easily visualize our data. We decided t-sne would be the best algorithm to start with. It is fairly new and seems to work almost like magic, to produce 2 dimensional data (below) from 100s of dimensions. It's not perfect and interpreting can be a little tricky as distances aren't exactly to scale and cluster density can be exaggerated
Wednesday we had our weekly meeting with Ling ( she's a partner from another group on our project) she gave us some new directions and a to-do list for the week. It was a pretty extensive list, and I knew I had a lot of work to do so I started right away. At first it went ok at first but it went downhill from there. Everything on the list was new, so unlike following along with the paper, the coding and results were entirely up to me. I was starting from scratch. I had to rearrange data create 5 new variables, compare those variables with preexisting ones and then run t-sne on several new groups of data. There were so many questions to answer, plus the papers and power points we always read and presented. I tried and tried but could not figure out how to create the variables for the life of me. I worked for hours on Wednesday and all of Thursday with no luck. I really was questioning my abilities and even my right to be in the program. I've always felt young and under experienced in the office but this was the first time that I really felt like I didn't belong. Everyone else's research was going so well and despite Alina's attempts to help, I was still floundering. I left Thursday feeling horrible about my abilities and very worried I would disappoint everyone who was counting on my data. When I came in Friday I went right back to attacking the data and suddenly my code worked! It did what I asked it to do! There were no weird missing graph lines or strange misplaced NAs it just worked. After that break through the rest flew by. I was accomplishing tasks in minutes instead of days. I feel like I have a better handle on our project and coding in general after this week. I'm very glad to say I survived my first real test. Now onto the next.
Also, this is unrelated to research but, I saw my first turkey at the lab! it was just wandering around right outside our building.
As you can see in the PBC plot the lines start to stabilize are 4 or 5, which means this is the optimal amount of clusters. You can also see that the NA 2's lines are much higher than the NA 0's, which means the NA 2's are clustering better than the NA 0's (this is expected due to the possibility of false clusters due to a high NA cost). You also probably noticed that the NA 0 lines are miraculously missing from the ASW plots. Apparently the numbers with NA cost 0 were divided by 0 somewhere in ASW. We're not quite sure why or how, but Alina is doing her best to figure it out. (Also this plot has given me nothing but struggles. Shortly after this was saved lines started disappearing, randomly, for no reason. The plot was malicious and for now it has won. I had to move on to more important things). After the internal clustering methods we started looking at dimensionality reduction techniques to more easily visualize our data. We decided t-sne would be the best algorithm to start with. It is fairly new and seems to work almost like magic, to produce 2 dimensional data (below) from 100s of dimensions. It's not perfect and interpreting can be a little tricky as distances aren't exactly to scale and cluster density can be exaggerated
Wednesday we had our weekly meeting with Ling ( she's a partner from another group on our project) she gave us some new directions and a to-do list for the week. It was a pretty extensive list, and I knew I had a lot of work to do so I started right away. At first it went ok at first but it went downhill from there. Everything on the list was new, so unlike following along with the paper, the coding and results were entirely up to me. I was starting from scratch. I had to rearrange data create 5 new variables, compare those variables with preexisting ones and then run t-sne on several new groups of data. There were so many questions to answer, plus the papers and power points we always read and presented. I tried and tried but could not figure out how to create the variables for the life of me. I worked for hours on Wednesday and all of Thursday with no luck. I really was questioning my abilities and even my right to be in the program. I've always felt young and under experienced in the office but this was the first time that I really felt like I didn't belong. Everyone else's research was going so well and despite Alina's attempts to help, I was still floundering. I left Thursday feeling horrible about my abilities and very worried I would disappoint everyone who was counting on my data. When I came in Friday I went right back to attacking the data and suddenly my code worked! It did what I asked it to do! There were no weird missing graph lines or strange misplaced NAs it just worked. After that break through the rest flew by. I was accomplishing tasks in minutes instead of days. I feel like I have a better handle on our project and coding in general after this week. I'm very glad to say I survived my first real test. Now onto the next.
Also, this is unrelated to research but, I saw my first turkey at the lab! it was just wandering around right outside our building.
Comments
Post a Comment