Drag

Web Crawling at Large Scale

Web crawling at scale is essential for aggregating data, monitoring websites, and enabling downstream applications like machine learning models and analytics. This project involved developing a scalable web crawling and re-crawling pipeline using Scrapy, storing the data in MongoDB, and deploying a machine learning model accessible through a Flask API. The entire pipeline was Dockerized and deployed on AWS EC2 for robust, high-availability operations.