Back to projects
Playlist Recommendation System
Python

Playlist Recommendation System

Using the Spotify million playlists dataset, the goal is to build a good song recommendation model that can buiid music playlists of the same vibe.

Processes the Spotify Million Playlist Dataset (MPD) through an Apache Spark ETL pipeline, then serves analytics and recommendation data via a FastAPI backend and a Vite React dashboard.


Project Structure

backend/          Python package — ETL pipeline + FastAPI
  etl/            Spark + pandas ETL logic
  api/            FastAPI application
  utils/          Shared config, logging, Spark session, schema
  tests/          pytest test suite
frontend/         Vite + React dashboard (Recharts)
dataset/data/     MPD JSON slices (not tracked in git)
output/           ETL outputs (not tracked in git)
hadoop/           Windows winutils for local Spark (not tracked in git)

Dataset

The Spotify Million Playlist Dataset is stored as ~1000 JSON slices under dataset/data/, each containing 1000 playlists.

{
  "playlists": [{
    "pid": 0,
    "tracks": [{
      "track_uri": "spotify:track:...",
      "track_name": "...",
      "artist_name": "...",
      "album_name": "...",
      "album_uri": "spotify:album:...",
      "artist_uri": "spotify:artist:..."
    }]
  }]
}

ETL Pipeline

Run once to populate output/ before starting the API.

python -m backend.run_etl

The pipeline reads all JSON slices with Spark, flattens nested playlists/tracks, strips Spotify URI prefixes, and writes four outputs:

FileDescription
output/tracks.parquetFlattened, cleaned track records
output/id_mappings.parquetStable track_uri → track_id (contiguous ints for models)
output/stats/track_counts.parquetTrack appearance frequency across playlists
output/stats/artist_counts.parquetArtist appearance frequency
output/stats/playlist_sizes.parquetPer-playlist track counts
output/interaction_matrix.npzSparse playlist–track COO matrix for collaborative filtering

ETL history: started as in-memory CSV, moved to streaming Parquet generators, parallelized with ProcessPoolExecutor (~277s → ~50s on M3), then migrated to Apache Spark.


Backend API

uvicorn backend.api.main:app --reload

Runs at http://localhost:8000. Loads output parquets at startup.

EndpointDescription
GET /api/stats?top_n=10Summary counts + top tracks and artists
GET /api/top-tracks?limit=20Ranked track list
GET /api/top-artists?limit=20Ranked artist list
GET /api/tracks?page=0&size=50Paginated full track table

Interactive docs: http://localhost:8000/docs


Frontend Dashboard

cd frontend
npm install      # first time only
npm run dev      # → http://localhost:5173

Shows four stat cards (unique tracks, artists, playlists, avg playlist size), a top-tracks table, and a top-artists bar chart. Reads from the API at http://localhost:8000.


Setup

API only (no ETL, no PySpark):

pip install -r backend/api/requirements.txt
uvicorn backend.api.main:app --reload

Full local dev (ETL + API):

pip install -r requirements.txt

Requires Java 8+ and Hadoop winutils on Windows for PySpark. See hadoop/ directory.


Tests

pytest -q