⚠ Switch to EXCALIDRAW VIEW in the MORE OPTIONS menu of this document. ⚠ You can decompress Drawing data with the command palette: ‘Decompress current Excalidraw file’. For more info check in plugin settings under ‘Saving’

Excalidraw Data

Text Elements

seed url queue

web crawler download extract text extract urls

S3 blob

DNS

website

consumer url

save downloaded file

download file

resolve host

send urls new fetched urls

Problems

  1. No retry
  2. No fault tolerance
  3. No single responsibility principle
  4. Not enough for 10B pagages to be crawlled

seed url queue

web crawler fetch and store pages

S3 blob store raw HTML data store text data

consumer url

save file html data

put extracted urls back in the queue

Website

DNS

parser queue

parsing service extract url extrat text

url metadata

{url: s3Url}

url(pk) text data url status lastCrawlTime

website domain metadata

Download html from s3 save parsed text to s3

Kafka/ Amazon SQS

Mutiple DNS provider to avoid DNS overload or DNS caching

Amazon SQS provides retry with delay, we can add some jitter in delay time to have randomness in bombarding same website urls retry limit exponential backoff after certain delays to avoid forever running processes and retry indefinetly. We can use DLQ for keep those unprocessed urls

update url status

frontier queue

DLQ

retry on failure w backoff

dynamo db

lastTimeWeCrawlTheDomain robot.txt metadata for that domain

Lambda functions, ECS tasks, or any other serverless technology.