Cambridge, Massachusetts, United States
In summary, I applied machine learning techniques towards the classification of computer programs in the Scratch online community. In particular, I applied NLP techniques to a new context, and demonstrated that they can be used in accomplishing this task to a reasonable level of accuracy.
My work included:
• Applying unsupervised learning methods (k-means clustering) to better understand the structure of Scratch projects, and to motivate use of supervised learning methods for type classification.
• Leading the efforts to construct a labeled dataset of Scratch projects and their types (i.e. “animation”, “game”, “slideshow”, “other”), via a collective process of consensus-based annotation by experts.
• Applying unsupervised NLP techniques to find optimal representation methods for Scratch blocks and projects, by training word embeddings on a 500,000-large Scratch projects dataset (using fastText library).
• Applying supervised NLP techniques to train a classifier model for categorizing Scratch projects by type, by training a (high quality) 873-large labeled projects dataset and using the unsupervised word embeddings as the foundation (using fastText library).
• Tuning hyperparameters to find a set of hyperparameters yielding reasonable classifier model performance.
• Developing and conducting an in-depth analysis of the (trained) unsupervised and supervised models, and an exploration of the elements learned during training.
(This work culminated in a thesis for my Master's degree)