DS3 Final Project - Million Song

Intro

For our final portfolio project in Dr. Jelena's vibrant Math 189: Data Analysis and Inference class of Winter 2015, one of UCSD's newest and exciting Data Science courses designed for students from a variety of interdisciplinary backgrounds, the members of the Data Science Student Society at UCSD accepted the challenge of deriving valuable insights from one of the largest collection of music datasets online: the Million Song Dataset

Data

We utilized a combination of different and smaller subsets of the initial 300GB total music dataset:

The available and kindly provided 1.8GB subset
A filtered half-million tracks with by year labels using 12 timbre segment features and their 78 correlations, available on the UCI Machine Learning repository
Subset of size 4.2 MB with 42 features available on our Google Fusion Tables, gathered as described by this IPython notebook file "grab_msd_data" which takes ballpark 18 hours to search around half a million songs with year label information

Basic Univariate Analysis

Since our general predictive task is to model year, we decided to perform study the year information first in order to see if we can spot some general trends.

Exploring the distribution of labels is a good approach before developing a machine learning model since some models are designed to implicitly learn the structure or probability distribution of our data.

Below is a simple discrete histogram produced in R that shows songs have been increasing in frequency and in what seems to be a nonlinear pattern over time.

Q1: How many songs are in our dataset and how are they distributed?

We notice that the frequency of the counts start to take off in the late 1970s to dawn of 1980s, likely coinciding with the invention of the CDs by Philips and Sony!

Speaking of certain circular objects like the CD, which can be considered as one of the first digital device media used for mass manufacturing music recordings, another excellent visualization we can use to quickly eyeball the proportion of songs per decade is the classic pie chart, easily created through the power of Google Fusion Tables (you can interactively hover your mouse over to show the absolute counts in the subset).

Before building our model to predict year which is later down the road, we take to heart the wise aphorism:

"Essentially, all models are wrong, but some are useful."

George E.P. Box, statistician

That great man inspired and reminded us that we need to closely study the year information of our dataset to a degree so that we can develop a useful, but not necessarily correct, year prediction model for it. Also, his namesake happens to be another great reminder of how we can visualize outliers in our year data too with the Box-and-Whisker plot below.

What we can clearly see is that the central mass between the 1st and 3rd quartiles lies in years ranging from the mid 1990s to the mid 2000s. We also computed the proportion of songs from 1922 to 1976 and found that they only represent a small fraction of our sample -- 5%! Other summary statistics on year we calculated are tabularized using R's fancy stargazer package located to the right of the boxplot.

In the above density plot, when we drop the outliers pre-1976, around the time of the introduction of the CD, we still wonder at what kind of distribution it represents and whether it is normal or not. What we do know is that, given the moment values calculated by R that there is a slight negative (left) skew and that it is super peaky, nearly twice or more so than the 3.0 kurtosis value of a normal distribution!

Q2: Why is the peak song count at year 2007?

We see from the distribution of the number of songs that over 50% of the songs in the dataset are from the 2000-2010 year increment. From looking further into the advancements in technology over time, we observed that the increase in the development of technology used to play and share music such as mp3 players, iPods and iPhone’s during this time period can explain this trend.

As a team, we looked over the history of music recording hardware and found that 2007 is the year that the 1st Generation iPhone got released, but whether that is a signficant contributing factor remains another question to be investigated. And why is their a sudden decline after 2007? Is it related to how the data was processed and gathered? Can another factor like economic climate for music industry sales records be a better explanation than the actual hardware?

Q3: If the song count distribution is not normal, what is it?

We are guessing it looks exponential and thought that Moore's Law might apply.For those who are not tech-savvy, the phenomenon of increasing hardware memory can be summarized within a sentence provided by wikipedia:

"Moore's law" is the observation that, over the history of computing hardware, the number of transistors in a dense integrated circuit doubles approximately every two years.

attribution: Wikipedia on Moore's Law

The idea is that if hardware memory doubles biennially, then the number of songs on those devices like the iPhone should also grow exponentially, with base equal to 2.

We remembered that in one of our homeworks in MATH 189: Data Analysis and Inference we read Nina Zumel's blog post on log transforming skewed distributions, so we tried to apply that to our year data as well.

First we have the absolute counts of the song divied up by every two years

Next we logarithmize Y-axis by 2 and plot a regression line

Finally we summarized the regression by calculating the correlation (using with a 95% confidence interval)

$r^2_{\textrm{Year, }log_2(Count)} = 0.965153$

and finding its equational form:

${{{log_2(Count)}}} = 0.14 \cdot \textrm{Year}_{bi} -272.35$

Although the the relationship between the log-transformed song count and year is very linear, we still have not found a good answer to the question of what sort of distribution song count belongs to or whether Moore's Law applies.

Additional reading:

and testing is required.

And a final note about studying the distribution of any given data in general, on these kinds of tough questions it is often a good strategy to seek help from field experts, ask questions about possible sources the data originated from, as well listen to words of wisdom handed down by more experienced Data Science (wo)mentors. For example in particular, a friend has kindly imparted to us, in regards to the Million Song Dataset, some sound advice:

"Your vibe attracts your tribe"

Kevin Hung, Founder of Data Science Student Society at UCSD

Research: Relationship Resonance

http://www.humanthermodynamics.com/The_Physics_of_Relationships.pdf

Temporal Data

Although we understand that comparing the other feature variables over time neither qualifies it as univariate analysis nor specifically a time series analysis, it still made a huge difference that we decided to proceed and visualize it for the sake of discovering any possible insights. Some of the more interesting relationships are captured in the below super scatter-boxplot-lineplot graphs. By "interesting," we define the term by saying that the blue line plot which connects the means of the values every two years is not too flat and horizontal. Each possesses an interesting historical context in the past.

Loudness Wars

The increase of music loudness over time is basically a result of the economic principle of competition, everyone is trying to get their voices heard!

Modern recordings that use extreme dynamic range compression and other measures to increase loudness therefore can sacrifice sound quality to loudness. The competitive escalation of loudness has led music fans and members of the musical press to refer to the affected albums as "victims of the loudness war."

attribution: Wikipedia on the Loudness Wars

Disco in the 70s

It is hard to draw conclusions from this because of the high variance, but we can see the general tempo increasing and peaking until the 70s (Disco music) where it stays fairly constant. One thing we can gather from the higher variance is that now people can easily find something matching their preference or current mood so all types of music lovers are satisfied. In the 1960s, the pop and rock R&B were slower in tempo than modern day.

Album Oriented Rock Format

After a historic low around the late 1950’s, we see an increase (and maintenance) to longer song duration. This coincides with the rise of “album oriented rock” during the 1960’s and 1970’s, when radio DJ’s were given freedom to play long sets of songs and, consequently, gave rise to songs formats that weren’t necessarily created as “singles,” but rather were constructed with whole albums in mind.

A great deal of thanks to music history expert and captivating storyteller Vicente Malave for helping discover the above insights!

Feature Analysis

We calculated the correlations between features both from the extended year and the UCI subset and visualize their values with the following heatmap below.

Q5: Excluding the year labels, what are good features in our datasets we can use in our model?

Now that we are given a more convenient view on the correlations, we can attempt to find an answer to the above question. We see from the map of the 36 numerical variables that the most highly correlated features are:

Hotness vs. familiarity
Loudness vs. familiarity
Artist tag length vs. Artist familiarity
Artist tag length vs. Artist hotness
Pitch averages for the 12 segments
Timbre averages for the 12 segments

In the following sections we will propose descriptive as well as linear explanations of the variance in the response variables (y-axis) by finding the coefficient of determination $r^2$ and plotting simple linear models in R.

Hotness & Familiarity

This is perhaps the most obvious correlation. The more people who are familiar with an artist, the more hot (popular) they will be!

From the summary of the linear regression model between artist hotness and artist familiarity, we can see that Artist Familiarity and Hotness are positively correlated. This is reasonable to assume as listeners are generally more familiar with ‘hot’ artists.

$\hat{hotness} = 0.742 \cdot \textrm{familiarity} - 0.027$
${r^2}_{familiarity, hotness} = 0.6623$

Loudness & Familiarity

This plot suggests a positive correlation between loudness and hotness, i.e that as “loudness” increases, so does hotness. This is an interesting result since this associates louder songs with higher popularity amongst listeners.

$\hat{familiarity} = 0.006 \cdot \textrm{loudness} + 0.693$
${r^2}_{loudness, familiarity} = 0.05582$

Tag & Hotness

From the regression we see a slightly positive association between number of tags and artist hotness but it hard to make conclusions from these two variables, especially considering the inconsistent data for artist tags. We tried to make observations on how increasing the number of tags would increase the artists hotness. Since the number of tags over the decades is increasing, this may explain the trend.

$\hat{hotness} = 0.299 \cdot \textrm{tags} + 0.428$
${r^2}_{tags, hotness} = 0.2089$

Tags & Decade

The numbers of tags over the decades has seen a general increase. This is very likely a result of feasibility; new technology, especially in terms of information storage and the digitization of music, has placed more importance in music being properly tagged, labeled, and stored. This is especially true for the selling of music. Furthermore, this could also be a result of increased musical experimentation and greater exposure to this experimental music to the masses. As new genres gain more exposure, it becomes increasingly important to be able to identify the “sound” concisely.

Hotness & Decade

The plot above offers an interesting insight into how artists are recognized and heard by consumers. There is a very high density of “artist hotness” around the line y=0.4 and we can see that this trend follows all the way from the start of the timeline to the end of the timeline. What this suggests is that artists considered “hot” are about as popular in each of their respective decades. However, we do see that as time goes on the variance of the data points increases significantly. This increasing deviation, especially around the turn of the 2nd millenium, could be explained the increasing number of ways artists are able to broaden their audiences and self-advertise more easily and effectively, especially through the internet. Furthermore, this also suggests a change in attitude toward music and other forms of media, where “sharing” and “socializing” has become the norm.

Familiarity & Decade

Artists Familiarity : We can see an increase in variance and a slight increase in familiarity over decades. As digital music and the internet became more viable options, it became easier to access and share music. In addition, marketing and media consumption has also increased leading to an audience hungry for more media. We can attribute this higher variance to the increased consumption of music. However, there is not a sharp increase in familiarity because of the competitive nature of the music industry and splintering of genres.

Name Length & Year

This graph displays a slightly downward trend in artist name length over the years. However, as with the “artist hotness” data, there is also an increase in variance are time progresses. We see that artists in the later decades were more experimental in trying out long names due to the increase in artists use of stage names.

Linear Regression Models for Year Prediction: Ridge & LASSO

We have a wealth of data to work with in building our models which can quickly become complicated, so we stick with our original objective and develop simple linear regression models using only the 12 timbre mean segment features in the UCI dataset to predict year.

Many people are familiar with the idea of the linear equation in one dimension:
$$y = ax + b$$

To generalize this idea in $p$ dimensions, we introduce the linear model as:
$$f(X) = \beta_0 + \sum\limits_{j=1}^p X_j \beta_j$$ where the $\beta$'s are the coefficients or parameters of our model, the $X_j\textrm{'s}$ are our explanatory/predictor variables and $f(X) = \hat{y}$ is our response variable.

The regression aspect refers to our goal in predicting year values -- a continuous numerical value. And to make sure that our predictions for year values are good, we introduce another function $\mathcal{L}(X, \beta, y)$, called a "loss", that gives us a sense of the holistic error value of our model. A typical and well known loss function is the infinitely differentiable squared error: $$(f(X; \beta) - y)^2$$ where $y$ is the training label (in our case the year of the song).

To estimate the error over our $p$-dimensional data points ranging from $i = 1, 2, ..., n$ in our sample, we use the residual sum of squares: $$\begin{align} RSS(\beta) &= \sum\limits_{i = 1}^n (f(X^{(i)}) - y^{(i)})^2 \\ &= \sum\limits_{i = 1}^n ((\beta_0 + \sum\limits_{j=1}^p {X_j}^{(i)} \beta_j) - y^{(i)})^2 \\ &= {{\|\hat{y} - y\|}}_2^2 \end{align}$$

The third line is simply the square of the $\ell$-2 norm for our residuals $\hat{y} - y$, and $X_0$ can sometimes be considered to have a constant value of 1 implicitly so that we can also write $RSS(\beta)$ as $(X^T\beta- y)^2$.

Now we have our $RSS$ loss function, everything is great and we can optimize our model by minimizing the error: $$\hat{\beta} = \underset{\beta}{\arg\min} (X\beta- y)^2$$

Solving analytically, the $\hat{\beta}$ we're looking for happens to be: $$\begin{align} X\hat{\beta} &= y \\ X^{T}(X\hat{\beta}) &= X^{T} y \\ (X^{T}X)\hat{\beta} &= X^{T} y \\ (X^TX)^{-1}(X^{T}X)\hat{\beta} &= (X^TX)^{-1}X^{T} y \\ \hat{\beta} &= (X^TX)^{-1}X^{T} y \end{align}$$ Assuming that $X^{T}X$ is nonsingular and invertible (a good source of reference on Linear Regression is provided in Chapter 3 of The Elements of Statistical Learning by Hastie, Tibshirani and Friedman).

Ordinary Least Squares

Since the last 51,630 examples of the UCI dataset were set apart for testing, we used randomly selected 20% of the 463,715 training data points to be used as cross-validation on a task further on and the rest of the 80% as training data. Using R's lm command for linear models, we obtained a training error of $$MSE_{train}(OLS) = 100.7227$$ and a testing error of $$MSE_{test}(OLS) = 100.1526$$, which can be interpreted to say that the average residual is approximately 10 years, since the $MSE$ has units in terms of squared years. For those who are new to $MSE$, it's very simply the $RSS$ divided by the number of data points we have $n$ to give us a sense of how good our predictions are: $MSE = {RSS \over n}$.

A residual plot below visualizes the distribution of the errors between predicted and actual years on the training set, with respect to our OLS model colored in blue. And it looks like most of the residuals are distributed between the mid 1980s to 2000s, which is reasonable since our pie chart that we created in the explore section showed that the proportion of songs of those decades account for roughly 90% of the training sample.

We have now finished developing our first linear regression model: Ordinary Least Squares! And the training error surprisingly has a lower error than training, so everything seems good so far right? Not yet because there's might be still room for improvement!

The OLS model has a reputation of being a "high variance" learner, which is a nice way to say that it has a habit of overfitting the training data and not being able to generalize well to newer and unseen data down the line. We'll try two more linear regression models that we hope will increase the accuracy of our predictions.

Ridge Regression

In the effort to combat overfitting, we proceed to incorporate an extra "tuning" parameter $\lambda$ into our model, which we find in the cross-validation stage to obtain a "penalty" function $\mathcal{R}(\lambda, \beta)$ to help regularize and add onto the original loss function from OLS and obtain a new minimization objective: $$\hat{\beta} = \underset{\beta}{\arg\min} \big\{ {{\|\hat{y} - y\|}}_{2}^2 + \lambda {{\|\beta\|}}_{2}^2 \big\}$$

Running cross-validation on the randomly selected 20% of the training set, we see in the below plot that $\lambda_{min} = 740$, although the $MSE$ values for both OLS and Ridge are close to $101 \textrm{ year}^2 \approx 10 \textrm{ years}$.

The cross-validation plot above was brought to you by the wisdom of Mr. Tibshirani the younger (Ryan, son of Robert who is a coauthor of the ESL book) and his great online resources (lecture slides and accompanying R code for linear regression) from his Stanford data mining class. In lecture 17, he explains the concept of degree of freedom ($df$) to estimate the number of effective parameters to use in our model. By plotting the coefficient path of Ridge regression and intersecting it with $df(\lambda)$ since we found $\lambda$ in the cross-validation stage, we see that nearly all the 12 timbre segments are useful predictors in our regression task. Since little shrinking of coefficients was accomplished given our $\lambda$, the $MSE_{test}(Ridge) = 100.1542$ is close to $MSE_{test}(OLS) = 100.1526$ without much notable difference.

The LASSO: Least Absolute Selection and Shrinkage Operator

We attempt LASSO as another option to improve the $MSE$ for our linear regression models with the LASSO variant by changing the penalty function to use the $\ell$-1 norm: $$\hat{\beta} = \underset{\beta}{\arg\min} \big\{ {{\|\hat{y} - y\|}}_2^2 + \lambda {{\|\beta\|}_1} \big\}$$

Applying similar cross-validation steps as the ridge and using the lars package in R, it's possible there may have been a mistake as the plot below shows no improvement at all in terms of $MSE$ when applying LASSO. The OLS error is still better and as a result $\lambda = 0$.

The following plot confirms the cross-validation error described above in that no coefficients are shrunk since $\lambda = 0$ and that OLS is preferred. What is noticable in both coefficient plots for Ridge and LASSO however is that out of the 12 timbre segments features, the loudness, b6, brightness and flatness seems to be the last ones shrunk as $df \rightarrow 0$.

Q6: Can predictor variables, even if they are all considered effective, be more or less effective than compared to each other?

Based on this observation and plotting the timbre segment values over the decades, we see that loudness is consistently increasing, b6 and flatness are consistently decreasing, and brightness alternates. The other timbre segments appear to be almost the same across time. Although the tuning/regularlization parameter $\lambda$ found in cross-validation and the coefficient plots with $df(\lambda)$ show that all 12 timbre segments are effective predictors, is there something special about the loudness, b6, brightness and flatness features in particular?

Comparison with Classic Model: k-NN Regression

When looking over the original research paper describing the Million Song Dataset, there's an interesting section in 4.2 near the end explaining how the first benchmarked model used is the simple, classic, and non-parametric k-nearest neighbors algorithm. At UCSD, the CSE department's foundational undergraduate machine learning algorithm class is Introduction to AI: A Statistical Approach, and students are sometimes introduced to k-NN as their first algorithm encountered and also use it on the very famous Iris flower dataset for classification.

For the song dataset we're working with however, we will use k-NN for regression by averaging the year labels of the k closest data points under Euclidian distance metric: $$ \textrm{YearLabel}(\textrm{song}_x) = {1 \over{k} } \sum \limits_{\textrm{song}_z \in \mathcal{N}_k} \textrm{YearLabel}(\textrm{song}_z) $$, where $\mathcal{N}_k$ is the neighborhood/set of the $k \textrm{ song}_z$ data points that have the smallest $d = \ell$-2$(\textrm{song}_x, \textrm{song}_z)$ squared Euclidian distance values.

With the power of sci-kit learn in Python, k-NN training on our 370,972 data points which took around 20 minutes (possibly because the training complexity is $O(N^2)$ without using optimizing structures or hashing techniques so that we are looking at $370972^2 \approx 137 \textrm{ billion}$ data points, although we can examine more in depth its implementation at another point in time).

The $MSE$ values on the cross-validation set however were more informative and showed a decreasing trend as the number of neighbors $k$ increased by intervals of 10. Again, the majority of the $MSE$ values with exception to $k = 1$ are around $100 \textrm{ year}^2 \approx 10 \textrm{ year}$ as found by our linear regression models. Since the elbow in the Scree plot above is located at $k = 10$, we choose to use that as our optimal number of neighbors on the testing set instead of the larger values of $k$, which have less $MSE$ but do not possess much difference in error value past the knee point. With 10 neighbors, the testing error on our k-NN regression model is $MSE_{test}(kNN) = 106.7306 \textrm{ year}^2 \approx 10.33 \textrm{ year}$ which is not far from the 10.20 value reported in Table 5 of the Million Song Dataset paper using 50 neighbors. Our ordinary least squares linear regression model has a small victory of around 0.33 years less error on average, and it's also much faster to train!

Experimenting and Combining Models

To conclude our presentation, we intend to reflect this theme of teamwork by creating a visual model that combines the efforts and findings from linear regression (the three interesting timbre segments: loudness, b6 and flatness) and the results from k-nearest neighbors regression on the testing set.

A fleeting observation from this plot brings to light a possible separation boundary for 2000s and 1990s songs when $b6 \approx -40$, and possibly more observations can be drawn simply because we've decided to try and see the data in a novel way. New ideas can be created by visualizing from different perspectives, and we're curious to hear: what's yours?

source: imgur.com

The Million Song Dataset:

An Exploratory Analysis and Year Predictive Model

Presented on behalf of the