Web Crawling at Large Scale
Web crawling at scale is essential for aggregating data, monitoring websites, and enabling downstream applications like machine learning models and analytics. This project involved developing a scalable web crawling and re-crawling pipeline using Scrapy, storing the data in MongoDB, and deploying a machine learning model accessible through a Flask API. The entire pipeline was Dockerized and deployed on AWS EC2 for robust, high-availability operations.