A short tutorial on using change point analysis on Spotify data, showing how I should listen to at least one more Jackson Browne album.
Jackson Browne is one of my favorite songwriters. I discovered him after hearing Late for the Sky in the iconic and harrowing Taxi Driver scene, and immediately got my hands on his entire discography.
Browne’s got a way of combining graceful and often melancholic lyrics with simple and honest music that I’ve yet to hear anyone else match. Most of his early work is concept albums, where tracks flow into each other, such as in the beautiful transition between the last two tracks on 1973’s For Everyman. However, throughout his 55 years in the business the quality varies.
I’m very fond of Browne’s first five albums, before everything went a bit belly up in the 1980s. Sure, songs like Lawyers in Love and In the Shape of a Heart are still great songs. However, the albums no longer speak to me. His last two albums, Time the Conqueror and Standing in The Breach, shows a bit of return to the good old days, yet not with the strength of his early work.
In this short Python tutorial, we’ll see if my preconception that his discography follows a good-bad-good trajectory is reflected in audio features from the Spotify API. We’ll use the Generalized Spotify Analyzer to get metadata including audio features from Jackson Browne’s entire discography, and then use change point analysis to see how well my subjective preference is reflected in the audio features Spotify provides.
To follow along, you should first go through my 3-part introduction to the GSA.
If you have downloaded GSA already, please do so again, as we’ll be using some new features.
Building a dataset
First, let’s build our dataset. In this project we collect Browne’s discography into a csv-file, with four columns: Title, AlbumURI, AlbumOrder, and Like.
The first three columns are simply the name of the album, it’s associated URI on Spotify (in the format spotify:album:hashvalue), and it’s chronological order. The last column is a simple binary value indicating whether I like the album (coded as 1), or dislike it (coded as 0).
You can view the dataset here.
Now we’re ready to start our script. Everything listed below is available on GitHub, specifically in GSA_exampleJacksonBrowne.py.
Setting up our script
We start off by importing the packages we need. If you’ve already done the 3-part introduction to GSA you’ll have most of these packages installed. There are two new ones in this example: scikit-learn (sklearn) and ruptures. We’ll use sklearn for its preprocessing module, and ruptures to do the change point analysis.
#%% Do imports# for handling data
import pandas as pd
import numpy as np
from sklearn import preprocessing
from sklearn.decomposition import PCA# plot-related libraries
import seaborn as sns
import matplotlib.pyplot as plt# for easier access to file system
import os# for breakpoints
import ruptures as rpt# import GSA
import GSA# multiprocessing for improved speeed
from joblib import Parallel, delayed# tqdm for progressbar
from tqdm import tqdm# make a folder for plots
if not os.path.exists('Plots'):
We’ll then use GSA to authenticate with Spotify’s API.
Next step is reading in the dataset, and doing a bit of cleaning to get only the actual album ID from the URI.
NOTE: Medium automatically changes quotation marks, so dependent on your system you may have to change them back if you’re copy-pasting code from here.
#%% Read in dataset
dataset = pd.read_csv(‘Data/JacksonBrowne.csv’, encoding=’UTF-8', na_values=’’, index_col=None)# This CSV contains a column “AlbumURI” which we’ll use to get the album IDs
allAlbums = dataset.AlbumURI.tolist()# Extract just ID, by taking index 14:36.
allAlbumsID = [thisAlbum[14:36] for thisAlbum in allAlbums]# Add back into the dataset
dataset[‘AlbumID’] = allAlbumsID
We’re now ready to get metadata for the albums. For this we’ll use GSA.getAlbumInformation, and run it in parallel using the joblib package. For a small project like this it’s not really necessary, but for larger projects it gives a noticeable increase in speed.
#%% Get tracks from album
IDlist_tqdm = tqdm(allAlbumsID, desc=’Getting audio features’)
results = Parallel(n_jobs=6, require=’sharedmem’)(delayed(GSA.getAlbumInformation)(thisAlbum) for thisAlbum in IDlist_tqdm)
# set n_jobs to as many threads you want your to use on your cpu.
Having downloaded the metadata, we’ll read it into a dataframe and add back the supplementary information from the original dataset.
#%% Add the supplementary information to the dataframe# First collect all the albums, as not all might have been successfully downloaded
for thisList in results:
if thisList == ‘error’:
print(‘Found an album not downloaded.’)
thisFrame = pd.read_pickle(thisList)
output = pd.concat(output)# Remove any where TrackName is EMPTYDATAFRAME
empties = output[output[‘TrackName’] == ‘EMPTYDATAFRAME’]
output.drop(empties.index, inplace=True)# Merge with original dataset to get supplementary information
merged_output = dataset.merge(output, on =’AlbumID’, how=’left’)
The goal of our analysis is to see whether my subjective preference is reflected in the audio features. To do that we’ll use change point detection, an analysis method aiming to identify where, if, and how many times, a change in a signal occurs. In this particular case we’re working from the assumption that two such changes occur: One at 1980’s Hold Out album, which is the start of the string of albums I don’t like, and another one at the 2008 album Time the Conqueror which is the start of albums I like again. I’ve illustrated this in the image below:
We’re interested in album level characteristics, but audio features are given per track. A simple way of summarizing an album is by taking the mean values of the audio features. This captures the general trend in the audio features. In addition, we’ll also take the standard deviation of the audio features, to get a sense of the variation present in each album.
Do note that this is a simplification, as many of the audio features does not adhere to a normal distribution. If you intend to do this type of analysis for a real scientific project, you’ll need to think hard about how you summarize album-level data.
Code for how this summary is calculated can be found in the section marked “Getting album level data” in the GSA_exampleJacksonBrowne script.
Change point detection
For an introduction and overview of change point detection methods, see the excellent “Selective review of offline change point detection” paper. Here we’ll use a multivariate offline algorithm with a pre-specified number of change points, and optimize our change point detection algorithm using the radial basis function.
As the audio features vary quite a bit in their range of values, we’ll first normalize the data using a MinMax scaler from sklearn.
# Normalize data using a MinMax scaler
scaler = preprocessing.MinMaxScaler()
scaledData = scaler.fit_transform(audioFeatures.values)
scaledAudioFeatures = pd.DataFrame(scaledData, columns=audioFeatures.columns)
Then we’ll run the change point detection algorithm.
# Calculate breakpoints
model = rpt.Dynp(model=’rbf’, min_size=1).fit(np.array(scaledAudioFeatures))
breakpoints = model.predict(n_bkps=2)
In the last line here, we ask the model to predict 2 change point (n_bkps) in the signal. The results can be seen in the variable breakpoints. The last value in the list is the end of the signal, so the two first are the ones we care about. The algorithm detects the first change point at array position 5 (album number 6), and the other change point at array position 10 (album number 11).
To inspect how the results match my subjective taste we’ll make a plot using seaborn. As plotting all the audio features gives a messy graph, we’ll just use Acousticness and Valence for this plot.
We’ll plot these values as a line plot, draw vertical bars at the change points, and color the background of the plot for a nicer visual presentation. On we’ll plot the album names, and color them according to my preferences. A blue color indicates albums I like, and a red color indicates albums I dislike.
As we can see, the change point detection is a good fit for my subjective preference. It clearly marks Hold Out as a change from the previous five albums. However, the change back occurs earlier than I expected, at the Looking East album, instead of at Time the Conqueror.
Perhaps I should revisit Looking East and The Naked Ride Home to see if they’re to my liking.
In this little tutorial we used the GSA to analyze Jackson Browne’s discography. We tested whether my subjective preference was reflected in the audio features available from Spotify’s API by using change point detection analysis. It turns out my taste is pretty well reflected in these values calculated from the audio waveforms!
The scripts and data used here is available on GitHub, and I’m happy to receive pull request if you have suggestions to improve the code.
If this post has made you curious about Jackson Browne, take a look at this classic live recording!