I was seeing it as discussing the problem while waiting for a solution more than arguing about it.
That is me too. Just a little freindly discourse to pass the time.
Are the bots literal in what they look for?
copyright “bots” — automated systems that match content against a database of reference files of copyrighted material.
A swarm of tech companies are rushing in to provide technical solutions to enforce copyright in online sharing communities and video-streaming sites. Those players include Vobile, Attributor, Audible Magic, and Gracenote. And they’re thriving, despite the fact that U.S. copyright law, as modified by the 1998 Digital Millennium Copyright Act, doesn’t require sites that host user-created content to preemptively patrol for copyright violations.
Exactly how do these bots work?
I haven't come across any technical data on how it would apply here. In this case bots likely search metadata for specific matches to keywords such as the DVD codes themselves. How they would know to scrape the links that accompany a post and then to report them is beyond anything I have found so I am not sure it is something they are capable of. Internet Bots are like blind bats flying around eating insects, they are very specific in their targets and their reactions to their targets.
The basic term for what they do is called
"Webscraping" (web harvesting or web data extraction). This is closely related to what search engines do which is called web indexing, which indexes information on the web using a bot or web crawler.
Technical measures to stop bots (from the Wiki which knows all):
Blocking an IP address. This will also block all browsing from that address.
Disabling any web service API that the website's system might expose.
Bots sometimes declare who they are (using user agent strings) and can be blocked on that basis (using robots.txt); 'googlebot' is an example. Some bots make no distinction between themselves and a human browser.
Bots can be blocked by excess traffic monitoring.
Bots can sometimes be blocked with tools to verify that it is a real person accessing the site, like a CAPTCHA. Bots are sometimes coded to explicitly break specific Captcha patterns.
Commercial anti-bot services: Companies offer anti-bot and anti-scraping services for websites. A few web application firewalls have limited bot detection capabilities as well.
Locating bots with a honeypot or other method to identify the IP addresses of automated crawlers.
Using CSS sprites to display such data as phone numbers or email addresses, at the cost of accessibility to screen reader users.
Why these techniques don't work (from ShieldSquare anti-bot service):
Setting up robots.txt – Surprisingly, this technique is used against malicious bots! Why this wouldn’t work is pretty straight forward – robots.txt is an agreement between websites and search engine bots to prevent search engine bots from accessing sensitive information. No malicious bot (or the scraper behind it) in it’s right mind would obey robots.txt. This is the most ineffective method to prevent scraping.
Filtering requests by User agent – The user agent string of a client is set by the client itself. One method is to obtain this from the HTTP header of a request. This way, a request can be filtered even before the content is served to the request. We observed that very few bots (approximately less than 10%), used the default user agent string which belonged to a scraping tool or was an empty string. Once their requests to the website were filtered based on the user agent, it didn’t take too long for scrapers to realize this and change their user agent to that of any well known browser. This method merely stops new bots written by inexperienced scrapers for a few hours.
Blacklisting the IP address – Seeking out to an IP blacklisting service is much easier than having to perform the hectic process of capturing more metrics from page requests and analyzing server logs. There are plenty of third party services which maintain a database of blacklisted IPs. In our hunt for a suitable blacklisting service, we found that using a third party DNSBL/RBL service was not effective as these services blacklisted only email spambot servers and were not effective in preventing scraping bots. Less than 2% of scraping bots were detected for one of our customer’s when we did a trial run.
Throwing CAPTCHA – A very well know practice to stop bots is to throw CAPTCHA on pages with sensitive content. Although effective against bots, CAPTCHA is thrown to all clients requesting the web page irrespective of whether it is a human or a bot. This method often antagonizes users and hence reduces traffic to the website. Some more insights to the new NO CAPTCHA Re-CAPTCHA by Google can be found in our previous blog post.
Honey pot or Honey trap – Honey pots are a brilliant trap mechanism to capture new bots (scrapers who are not well versed with structure of every page) on the website. But, this approach poses a lesser known threat of reducing the page rank on search engines. Here’s why – Search engine bots visit these links and might get trapped accidentally. Even if exceptions to the page were made by disallowing a set of known user agents, the links to the traps might be indexed by a search engine bot. These links are interpreted as dead, irrelevant or fake links by search engines. With more such traps, the ranking of the website decreases considerably. Furthermore, filtering requests based on user agent can exploited as discussed above. In short, honey pots are risky business which must be handled very carefully.
Sorry to go on and on but it is a fascinating subject.........