BibRec
This project is a Recommender Systems project to recommend books. It was completed along with other students in University.
Users can rate a book form a scale of 1-10 stars. Based on the information the user provides on login, the system recommends books the user might like. When viewing a book, similar books are recommended as well.
The Random Forest algorithm is used as Model-Based Collaborative Filtering Algorithm in order to predict ratings for a potential user given his age, country and state on books given their Year-of-Publication and which Publisher they belong from.
A Content Based Filtering approach is used to recommend similar books. Similarity is inferred by calculating the term frequency–inverse document frequency from the books title and its genres.
System Description
- Frontend: Vite, React, Material UI, Axios
- Backend: REST API mit Flask
- Algorithms: Random Forest & Content Based Filtering
- Libraries
- Initial research with the CaseRecommender
- RS Algorithms used from Scikit-Learn
- Python to train models
- Jupyter Notebooks for experiments
- Build: Makefile, Docker Compose
This project requires around 13GB of free RAM.
Dataset
As a base the BookCrossing was used for this project. The dataset is furthermore enhanced with genres taken from OpenLibrary. Normalized & hot-encoded versions are stored as files to increase startup performance.
Feature Engineering
Data Cleaning
- Remove invalid entries
- ISBN Duplicates & Conversion to ISBN-13 Standard
- Split Location into Country, State, City
- Data reduction
- Ratings: 66.6% reduced ⇒ from 1,149,780 to 383,962
- Books: 0.16 % reduced ⇒ from 271379 to 270944
- Users: remains the same
Data Normalization
- Replace missing values with mean value (Age, Year_of_publication)
- Publication Year Offset by minus 2005
- Only explicit ratings & rating bias correction
- Extend by average rating and number
Hot Encoding
- Categorization of publisher/country/state into the most common and “other”
Data extension
- Genre, Subject from OpenLibrary data
Recommendation Strategy
- Item Recommendations: Content Based: TF-IDF
- Data used Title, Genres, Subjects
- Calculation Cos Similarity
- User Recommendations: MBCF: Random Forest
- Features: Country, State, Age, Year-of-Publication, Publisher
- Hybrid approach (Mixed): Collaborative Filtering + Content based
- Used in the evaluation API
- Ratio: 70% to 30%
- Most Popular (Cold Start)
- 80% Most Rated and 20% Least Rated
- Merge and mix
- Top in Country
- Top 50 most rated books in a country
- sorted by rating
Lessons
-
Python was partly difficult
- Time-consuming (new territory, tinkering with DataFrames)
- Extraction into separate files partly didn’t work
- Jupyter Notebooks are great for interactive experiments + documentation
-
Performance issues with large amounts of data
- RAM limits cause problems (OOM, system crash)
- Data processing must be very precise (example: per-user split)
-
Development
- Introduce shared code base / normalization earlier
- Loading models isn’t feasible (MODEL_FILE_PKL)
- Start implementation earlier in the semester
- Less is forgotten / overlooked / left out