Java – Check if web request is from Google crawler
If a web request is coming from Google crawler or Google bot, the requested “user agent” should look similar like this :
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
or
(rarely used): Googlebot/2.1 (+http://www.google.com/bot.html)
Source : Google crawlers
1. Java Example
In Java, you can get the “user agent” from HttpServletRequest
.
@Autowired
private HttpServletRequest request;
//...
String userAgent = request.getHeader("user-agent");
System.out.println("User Agent : " + userAgent);
if(!StringUtils.isEmpty(userAgent)){
if(userAgent.toLowerCase().contains("googlebot")){
System.out.println("This is Google bot");
}else{
System.out.println("Not from Google");
}
}
Above solution works well, but failed to detect the fake or spoof user agent.
2. Fake User Agent
It’s easy to create a fake/spoof user agent request. For example :
package com.mkyong.web;
import org.apache.http.HttpResponse;
import org.apache.http.client.HttpClient;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.HttpClientBuilder;
public class test {
public static void main(String[] args) throws Exception {
HttpClient client = HttpClientBuilder.create().build();
HttpGet request = new HttpGet("abcdefg.com");
request.setHeader("user-agent", "fake googlebot");
HttpResponse response = client.execute(request);
}
}
Output at abcdefg.com.
User Agent : fake googlebot
This is Google bot
3. Verifying Googlebot
To verify the real Googlebot, you can use “reverse DNS lookup” manually like this :
> host 66.249.66.1
1.66.249.66.in-addr.arpa domain name pointer
crawl-66-249-66-1.googlebot.com.
> host crawl-66-249-66-1.googlebot.com
crawl-66-249-66-1.googlebot.com has address 66.249.66.1
Source : Verifying Googlebot
4. Verifying Googlebot – Java Example
Based on the above theory, we can simulate the 1st part of the “reverse DNS lookup”. Use host
command to determine where is the requested IP point to.
If the request is coming from Googlebot, it will display this pattern : xx *.googlebot.com.
.
P.S host
command is available at *nix system only.
@Autowired
private HttpServletRequest request;
//...
String requestIp = getRequestIp();
String userAgent = request.getHeader("user-agent");
System.out.println("User Agent : " + userAgent);
if(!StringUtils.isEmpty(userAgent)){
if(userAgent.toLowerCase().contains("googlebot")){
//check fake user agent
String output = executeCommand("host " + requestIp);
System.out.println("Output : " + output);
if(output.toLowerCase().contains("googlebot.com")){
System.out.println("This is Google bot");
}else{
System.out.println("This is fake user agent");
}
}else{
System.out.println("Not from Google");
}
}
//get requested IP
private String getRequestIp() {
String ipAddress = request.getHeader("X-FORWARDED-FOR");
if (ipAddress == null) {
ipAddress = request.getRemoteAddr();
}
return ipAddress;
}
// execute external command
private String executeCommand(String command) {
StringBuffer output = new StringBuffer();
Process p;
try {
p = Runtime.getRuntime().exec(command);
p.waitFor();
BufferedReader reader =
new BufferedReader(new InputStreamReader(p.getInputStream()));
String line = "";
while ((line = reader.readLine())!= null) {
output.append(line + "\n");
}
} catch (Exception e) {
e.printStackTrace();
}
return output.toString();
}
Try the “step 2” fake user agent example again. Now, you get this output:
Output : Host 142.1.168.192.in-addr.arpa. not found: 3(NXDOMAIN) //this output may vary.
User Agent : fake googlebot
This is fake user agent
This simple solution may not able to stop the fake/spoof user agent 100%, but this extra security layer should be able to stop most of the basic user agent spoofing attacks.
If you have a better solution, do share below, thanks.
I wouldn’t use X-FORWARDED-FOR because anyone can spoof it. If you do use it, it shouldn’t be passed directly to .exec() as a string. What would happen if X-FORWARDED-FOR is ” bla | rm -rf /” ?
Definitely a good catch on your part, Andrei. I think a good solution to prevent that from happening would be to use a regex and to capture only the part that matched a correctly formed IP address.
«Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.» 😉
* i wouldn’t start an executable for every request (and the result should be of course cached)
* I would use X-Forwarded-For only if is set on my own network (I doubt Google will use (visible) proxies to make the crawling)
* The format of X-Forwarded-For can contain more proxies, “X-Forwarded-For: client, proxy1, proxy2” . What happens when one of those is IPv6? What happens if there is a port appended (not very common it seems, but may happen)
* The code misses the second part of the verification:
> host crawl-66-249-66-1.googlebot.com
crawl-66-249-66-1.googlebot.com has address 66.249.66.1
One can spoof the reverse DNS lookup ( http://en.wikipedia.org/wiki/Forward-confirmed_reverse_DNS )
All I meant was that it would prevent someone from running an arbitrary command on your system shell and limit the input to the correct form. Sometimes a regex is not appropriate, but matching an arbitrary IP is simple enough.
So now the new question at hand, how would you go about actually verifying a google bot? Are you saying that it is not possible in any case at all?
You can. I said that the steps two and three are missing (steps from wikipedia):
1. First a reverse DNS lookup (PTR query) is performed on the IP address, which returns a list of zero or more PTR records.
2. For each domain name returned in the PTR query results, a regular ‘forward’ DNS lookup (type A or AAAA query) is then performed on that domain name.
3. Any A or AAAA record returned by the second query is then compared against the original IP address, and if there is a match, then the FCrDNS check passes.
Or from original article from Google: “You can verify that a bot accessing your server really is Googlebot (or another Google user-agent) by using a reverse DNS lookup, verifying that the name is in the googlebot.com domain, **and then doing a forward DNS lookup using that googlebot name** ”
I don’t know if it’s such a good idea to verify if the bot comes from Google though.
Nice catch! The author still mentions that this is just a SIMPLE solution (not to stop fake user agents 100%). I guess it’s more to showcase the idea… But again your example is really nice ))
This really help, but i wanted to know how to write a java code to detect fake websites.
Thanks for this short and easy to understand (as always) tutorial ))
nice. Thanks for this short and easy to understand tutorial. Thanks for sharing this information.
http://www.dreamdestinations.in/