Ever finished a movie and wondered, “What should I watch next?” I spent this week building a Content-Based Recommendation Engine to answer exactly that.
Here is how I built it day-by-day.
📅 Day 1: The Setup
I skipped the classic MovieLens dataset (ratings-based) and chose the TMDB 5000 Dataset to focus on content analysis. Using Pandas, I merged the movie and credits files and filtered for the essentials: genres, keywords, overview, cast, and crew.
📅 Day 2: Wrangling Data
Data cleaning was the heavy lifting. Columns like genres were stored as JSON strings (e.g., [{"id": 28, "name": "Action"}]).
The Fix: Used ast.literal_eval to parse them into Python lists.
Feature Engineering: I extracted the top 3 actors and the director. I also collapsed spaces (e.g., “Science Fiction” → “sciencefiction”) to create unique tag entities.
Result:A single “Super Column” called tags that summarizes the entire movie.
📅 Day 3: The Math (Vectorization)
To measure similarity, I needed to turn text into numbers.
Vectorization: Used Scikit-Learn’s CountVectorizer (Bag of Words) to convert tags into 5,000-dimensional vectors, removing stop words.
Similarity:Used Cosine Similarity to measure the angle between vectors. This generated a matrix comparing every movie against every other movie.
📅 Day 4: The Interface
I used Streamlit to build a frontend.
Logic: The user selects a movie → App finds its index → Sorts the similarity matrix → Returns the top 5 matches.
📅 Day 5: API Integration
Text-only lists are boring. I signed up for the TMDB API and wrote a script to fetch real-time movie posters. Displaying them side-by-side made the app feel like a real product.
📅 Day 6: Optimization
Re-calculating the model on every reload was too slow.
Solution: I used pickle to save the processed data and similarity matrix. The app now loads pre-computed files instantly.