⚠ Switch to EXCALIDRAW VIEW in the MORE OPTIONS menu of this document. ⚠ You can decompress Drawing data with the command palette: ‘Decompress current Excalidraw file’. For more info check in plugin settings under ‘Saving’
Excalidraw Data
Text Elements
seed url queue
web crawler download extract text extract urls
S3 blob
DNS
website
consumer url
save downloaded file
download file
resolve host
send urls new fetched urls
Problems
- No retry
- No fault tolerance
- No single responsibility principle
- Not enough for 10B pagages to be crawlled
seed url queue
web crawler fetch and store pages
S3 blob store raw HTML data store text data
consumer url
save file html data
put extracted urls back in the queue
Website
DNS
parser queue
parsing service extract url extrat text
url metadata
{url: s3Url}
url(pk) text data url status lastCrawlTime
website domain metadata
Download html from s3 save parsed text to s3
Kafka/ Amazon SQS
Mutiple DNS provider to avoid DNS overload or DNS caching
Amazon SQS provides retry with delay, we can add some jitter in delay time to have randomness in bombarding same website urls retry limit exponential backoff after certain delays to avoid forever running processes and retry indefinetly. We can use DLQ for keep those unprocessed urls
update url status
frontier queue
DLQ
retry on failure w backoff
dynamo db
lastTimeWeCrawlTheDomain robot.txt metadata for that domain
Lambda functions, ECS tasks, or any other serverless technology.