A PySpark ETL pipeline that cleans the MovieLens dataset and exports structured CSVs for Neo4j knowledge graph construction and Power BI visual analytics.
Course: CSC 501 — University of Victoria
Movie-Knowledge-Graph-Analysis/
├── data/
│ ├── raw/ # Original MovieLens CSVs (not tracked by git)
│ │ ├── movies.csv
│ │ ├── ratings.csv
│ │ └── tags.csv
│ └── manual fixes/
│ └─── movieId_genre.json # Manual genre patches for movies missing genre info
├── scripts/
│ ├── preprocessor.py # Data loading & cleaning functions
│ ├── data_analysis.py # Standalone analysis report functions
│ ├── io_utils.py # CSV export utilities (Neo4j / Power BI)
│ └── main.py # Pipeline orchestrator (entry point)
├── cypher/
│ ├── schema.cypher # Constraints & indexes
│ ├── import_nodes.cypher # LOAD CSV — create nodes
│ ├── import_relationships.cypher # LOAD CSV — create relationships
│ └── queries.cypher # Sample analytical queries
├── output/ # Generated at runtime
│ ├── neo4j/ # 6 CSVs for Neo4j LOAD CSV
│ └── powerbi/ # 3 wide-table CSVs for Power BI
├── .gitignore
├── requirements.txt
├── README.md
└── README_zh.md
Raw CSVs ──▶ Clean & Normalize ──▶ Remove Junk ──▶ Manual Fixes ──▶ Explore ──▶ Export
│ │ │ │ │
type casting null year + patch title, row counts Neo4j CSVs
deduplication no genre + year, genre genre dist. Power BI CSVs
null filtering no tags from JSON top movies exploration_log
year extraction → delete top users
genre explosion top tags
| Step | What it does |
|---|---|
| Load & Clean | Cast data types, trim strings, deduplicate, filter invalid ratings (outside 0.5–5.0), extract release_year from title, explode pipe-delimited genres into rows, remove IMAX pseudo-genre |
| Remove Junk | Delete movies that have null release year AND (no genres listed) AND no user tags — these are data noise |
| Manual Fixes | Patch 6 known movies with correct titles/years; replace (no genres listed) entries using manual_fixes.json |
| Explore | Print summary statistics — row counts, distinct values, genre distribution, most-rated movies, most active users, highest-rated movies, top tags |
| Export | Write final CSVs to output/neo4j/ and output/powerbi/ |
All exploration output is simultaneously printed to console and saved to output/exploration_log.txt.
git clone https://github.com/<your-username>/Movie-Knowledge-Graph-Analysis.git
cd Movie-Knowledge-Graph-Analysis
pip install -r requirements.txt
Download the MovieLens Latest Small dataset and place the three CSVs into data/raw/:
data/raw/movies.csv
data/raw/ratings.csv
data/raw/tags.csv
cd scripts
python main.py
Copy the CSVs from output/neo4j/ into your Neo4j import/ directory, then run the Cypher scripts in order:
neo4j/schema.cypher ← constraints & indexes
neo4j/import_nodes.cypher ← create User, Movie, Genre nodes
neo4j/import_relationships.cypher ← create RATED, HAS_GENRE, TAGGED relationships
neo4j/queries.cypher ← sample analytical queries
(:User) ─[:RATED {rating, rating_datetime, rating_date}]─▶ (:Movie {movieId, title, release_year})
(:User) ─[:TAGGED {tag, tag_datetime, tag_date}]──────────▶ (:Movie)
(:Movie) ─[:HAS_GENRE]───────────────────────────────────▶ (:Genre {name})
output/neo4j/)| File | Columns | Description |
|---|---|---|
users_node.csv |
userId |
Distinct user IDs |
movies_node.csv |
movieId, title, release_year |
Movie metadata |
genres_node.csv |
genre |
Distinct genre labels |
ratings_rel.csv |
userId, movieId, rating, rating_datetime, rating_date |
User → Movie ratings |
movie_genre_rel.csv |
movieId, genre |
Movie → Genre mapping |
tags_rel.csv |
userId, movieId, tag, tag_datetime, tag_date |
User → Movie tags |
| Script | Purpose |
|---|---|
schema.cypher |
Create uniqueness constraints and indexes for fast lookups |
import_nodes.cypher |
LOAD CSV commands to create :User, :Movie, :Genre nodes |
import_relationships.cypher |
LOAD CSV commands to create :RATED, :HAS_GENRE, :TAGGED relationships |
queries.cypher |
Sample queries — top-rated movies, genre co-occurrence, user similarity, recommendation paths |
| Module | Responsibility |
|---|---|
preprocessor.py |
load_csv(), load_json(), clean_movies(), clean_ratings(), clean_tags(), explode_genres() |
data_analysis.py |
run_all() — standalone analysis reports (genre distribution, top movies, active users, etc.) |
io_utils.py |
save_csv(), export_neo4j(), export_powerbi() — coalesce Spark partitions into single CSVs |
main.py |
Pipeline orchestrator — runs all steps in order, writes exploration log |
MovieLens Latest Small — 100,000 ratings and 3,600 tag applications across 9,700 movies by 600 users. Last updated 9/2018.
F. Maxwell Harper and Joseph A. Konecny. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems 5, 4, Article 19.
This project is for academic use.