g/technology: The Webcrawling Robot Swarm War is On!

The Webcrawling Robot Swarm War is On!

15 07 Jul 2021 11:24 by u/vYu7495

3 comments

3 u/topmost 07 Jul 2021 13:34

> If I cannot block all robots, I want to at least be able to identify them to prevent my script from counting robot-generated page views as being from real people Nginx & Apache2 log the user agent, ip and page accessed. Web crawlers usually add things like "bing bot" to their headers. I still respect your other justifications for blocking the crawlers. edit: OP is using Lighttpd

2 u/gomugomu 07 Jul 2021 21:29

maybe it was somebody trying to archive the site. I think Wayback Machine also crawls and archives websites

2 u/package 10 Jul 2021 16:40

> Unfortunately, I do not record the user agents of robots that do not contain one of the following special strings: "bot", "Bot", "rawl", slurp", "pider", "oogle", "rchive", "acebook", "earch", or "http". This is because I don't log user agents that I think are coming from real people. This speeds up my script and reduces the wear on my storage media Wow you’ve potentially saved literal microseconds and increased the lifetime of your drives by the same amount.