Design a web crawler

Question

Web Crawler

Step 1 - Understand the problem and establish design scope

Questions to ask

Functional requirements

Non-functional requirements

Back of the envelop estimation

Step 2 - Propose high-level design and get buy-in

CleanShot 2024-10-25 at 23.59.09@2x.png

Seed URLs
URL Frontier
HTML Downloader
DNS resolver
Content parser
Content Seen
Content storage
URL Extractor
URL Filter
URL Seen
URL Storage

Step 3 - Design deep dive

DFS vs BFS

URL frontier

CleanShot 2024-10-26 at 00.33.30@2x.png

Politeness

Priority

Freshness

Storage for URL Frontier

HTML Downloader

Robots.txt

Performance optimisation

Robustness

Extensibility

Detect and avoid problematic content

Step 4 - Wrap up