Movie Recommendation System banner

Movie Recommendation System

1 devlog
41m 57s

This is a machine learning-based Movie Recommendation System designed to solve the problem of "what to watch next." Using Content-Based Filtering, the system analyzes the metadata of movies—including genres, keywords, cast, and crew—to find simila…

This is a machine learning-based Movie Recommendation System designed to solve the problem of “what to watch next.” Using Content-Based Filtering, the system analyzes the metadata of movies—including genres, keywords, cast, and crew—to find similarities between them

Demo Repository

Loading README...

techwizard

Shipped this project!

I just completed a movie recommender using Python, Pandas, and Scikit-Learn. The system uses Natural Language Processing (Bag of Words) to convert movie tags into vectors and calculates similarity scores to find the closest matches in n-dimensional space.

The Build: I processed the TMDB 5000 dataset, optimized the model using pickle for instant loading, and integrated the TMDB API to fetch real-time movie posters on the frontend.

The Lesson: I learned a ton about vectorization and CountVectorizer. It was fascinating to see how mathematical angles between vectors can accurately predict human taste in movies.

techwizard

Ever finished a movie and wondered, “What should I watch next?” I spent this week building a Content-Based Recommendation Engine to answer exactly that.
Here is how I built it day-by-day.
📅 Day 1: The Setup
I skipped the classic MovieLens dataset (ratings-based) and chose the TMDB 5000 Dataset to focus on content analysis. Using Pandas, I merged the movie and credits files and filtered for the essentials: genres, keywords, overview, cast, and crew.

📅 Day 2: Wrangling Data
Data cleaning was the heavy lifting. Columns like genres were stored as JSON strings (e.g., [{"id": 28, "name": "Action"}]).

The Fix: Used ast.literal_eval to parse them into Python lists.
Feature Engineering: I extracted the top 3 actors and the director. I also collapsed spaces (e.g., “Science Fiction” → “sciencefiction”) to create unique tag entities.
Result:A single “Super Column” called tags that summarizes the entire movie.
📅 Day 3: The Math (Vectorization)
To measure similarity, I needed to turn text into numbers.
Vectorization: Used Scikit-Learn’s CountVectorizer (Bag of Words) to convert tags into 5,000-dimensional vectors, removing stop words.
Similarity:Used Cosine Similarity to measure the angle between vectors. This generated a matrix comparing every movie against every other movie.
📅 Day 4: The Interface
I used Streamlit to build a frontend.
Logic: The user selects a movie → App finds its index → Sorts the similarity matrix → Returns the top 5 matches.
📅 Day 5: API Integration
Text-only lists are boring. I signed up for the TMDB API and wrote a script to fetch real-time movie posters. Displaying them side-by-side made the app feel like a real product.
📅 Day 6: Optimization
Re-calculating the model on every reload was too slow.
Solution: I used pickle to save the processed data and similarity matrix. The app now loads pre-computed files instantly.

Attachment
1

Comments

Chibueze Benneth
Chibueze Benneth 2 months ago

oh that’s really cool! One thing I want to learn is how to properly integrate APIs into my workflow, so I am impressed you scaled your project to more than just a text based model. Good job!