jsoup – Basic web crawler example

web-crawler-spider-logo

A Web Crawler is a program that navigates the Web and finds new or updated pages for indexing. The Crawler starts with seed websites or a wide range of popular URLs (also known as the frontier) and searches in depth and width for hyperlinks to extract.

A Web Crawler must be kind and robust. Kindness for a Crawler means that it respects the rules set by the robots.txt and avoids visiting a website too often. Robustness refers to the ability to avoid spider traps and other malicious behavior. Other good attributes for a Web Crawler is distributivity amongst multiple distributed machines, expandability, continuity and ability to prioritize based on page quality.

1. Steps to create web crawler

The basic steps to write a Web Crawler are:

  1. Pick a URL from the frontier
  2. Fetch the HTML code
  3. Parse the HTML to extract links to other URLs
  4. Check if you have already crawled the URLs and/or if you have seen the same content before
    • If not add it to the index
  5. For each extracted URL
    • Confirm that it agrees to be checked (robots.txt, crawling frequency)

Truth be told, developing and maintaining one Web Crawler across all pages on the internet is… Difficult if not impossible, considering that there are over 1 billion websites online right now. If you are reading this article, chances are you are not looking for a guide to create a Web Crawler but a Web Scraper. Why is the article called ‘Basic Web Crawler’ then? Well… Because it’s catchy… Really! Few people know the difference between crawlers and scrapers so we all tend to use the word “crawling” for everything, even for offline data scraping. Also, because to build a Web Scraper you need a crawl agent too. And finally, because this article intends to inform as well as provide a viable example.

2. The skeleton of a crawler

For HTML parsing we will use jsoup. The examples below were developed using jsoup version 1.10.2.

pom.xml

        <dependency>
            <groupId>org.jsoup</groupId>
            <artifactId>jsoup</artifactId>
            <version>1.10.2</version>
        </dependency>

So let’s start with the basic code for a Web Crawler.

BasicWebCrawler.java

package com.mkyong.basicwebcrawler;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;
import java.util.HashSet;

public class BasicWebCrawler {

    private HashSet<String> links;

    public BasicWebCrawler() {
        links = new HashSet<String>();
    }

    public void getPageLinks(String URL) {
        //4. Check if you have already crawled the URLs 
        //(we are intentionally not checking for duplicate content in this example)
        if (!links.contains(URL)) {
            try {
                //4. (i) If not add it to the index
                if (links.add(URL)) {
                    System.out.println(URL);
                }

                //2. Fetch the HTML code
                Document document = Jsoup.connect(URL).get();
                //3. Parse the HTML to extract links to other URLs
                Elements linksOnPage = document.select("a[href]");

                //5. For each extracted URL... go back to Step 4.
                for (Element page : linksOnPage) {
                    getPageLinks(page.attr("abs:href"));
                }
            } catch (IOException e) {
                System.err.println("For '" + URL + "': " + e.getMessage());
            }
        }
    }

    public static void main(String[] args) {
        //1. Pick a URL from the frontier
        new BasicWebCrawler().getPageLinks("http://www.mkyong.com/");
    }

}
Note
Don’t let this code run for too long. It can take hours without ending.

Sample Output:


http://www.mkyong.com/
<blockquote data-secret="4eI71g0avB" class="wp-embedded-content"><a href="http://www.mkyong.com/tutorials/android-tutorial/">Android Tutorial</a></blockquote><iframe class="wp-embedded-content" sandbox="allow-scripts" security="restricted" style="position: absolute; clip: rect(1px, 1px, 1px, 1px);" src="http://www.mkyong.com/tutorials/android-tutorial/embed/#?secret=4eI71g0avB" data-secret="4eI71g0avB" width="500" height="282" title="“Android Tutorial” — Mkyong.com" frameborder="0" marginwidth="0" marginheight="0" scrolling="no"></iframe>
<blockquote data-secret="hMadHrtmLr" class="wp-embedded-content"><a href="http://www.mkyong.com/tutorials/android-tutorial/">Android Tutorial</a></blockquote><iframe class="wp-embedded-content" sandbox="allow-scripts" security="restricted" style="position: absolute; clip: rect(1px, 1px, 1px, 1px);" src="http://www.mkyong.com/tutorials/android-tutorial/embed/#?secret=hMadHrtmLr" data-secret="hMadHrtmLr" width="500" height="282" title="“Android Tutorial” — Mkyong.com" frameborder="0" marginwidth="0" marginheight="0" scrolling="no"></iframe>
<blockquote data-secret="C16yHWC4Ex" class="wp-embedded-content"><a href="http://www.mkyong.com/tutorials/java-io-tutorials/">Java I/O Tutorial</a></blockquote><iframe class="wp-embedded-content" sandbox="allow-scripts" security="restricted" style="position: absolute; clip: rect(1px, 1px, 1px, 1px);" src="http://www.mkyong.com/tutorials/java-io-tutorials/embed/#?secret=C16yHWC4Ex" data-secret="C16yHWC4Ex" width="500" height="282" title="“Java I/O Tutorial” — Mkyong.com" frameborder="0" marginwidth="0" marginheight="0" scrolling="no"></iframe>
<blockquote data-secret="oO7WeASaWm" class="wp-embedded-content"><a href="http://www.mkyong.com/tutorials/java-io-tutorials/">Java I/O Tutorial</a></blockquote><iframe class="wp-embedded-content" sandbox="allow-scripts" security="restricted" style="position: absolute; clip: rect(1px, 1px, 1px, 1px);" src="http://www.mkyong.com/tutorials/java-io-tutorials/embed/#?secret=oO7WeASaWm" data-secret="oO7WeASaWm" width="500" height="282" title="“Java I/O Tutorial” — Mkyong.com" frameborder="0" marginwidth="0" marginheight="0" scrolling="no"></iframe>
<blockquote data-secret="oxXfw8SE35" class="wp-embedded-content"><a href="http://www.mkyong.com/tutorials/java-xml-tutorials/">Java XML Tutorial</a></blockquote><iframe class="wp-embedded-content" sandbox="allow-scripts" security="restricted" style="position: absolute; clip: rect(1px, 1px, 1px, 1px);" src="http://www.mkyong.com/tutorials/java-xml-tutorials/embed/#?secret=oxXfw8SE35" data-secret="oxXfw8SE35" width="500" height="282" title="“Java XML Tutorial” — Mkyong.com" frameborder="0" marginwidth="0" marginheight="0" scrolling="no"></iframe>
<blockquote data-secret="aUTZpMueVw" class="wp-embedded-content"><a href="http://www.mkyong.com/tutorials/java-xml-tutorials/">Java XML Tutorial</a></blockquote><iframe class="wp-embedded-content" sandbox="allow-scripts" security="restricted" style="position: absolute; clip: rect(1px, 1px, 1px, 1px);" src="http://www.mkyong.com/tutorials/java-xml-tutorials/embed/#?secret=aUTZpMueVw" data-secret="aUTZpMueVw" width="500" height="282" title="“Java XML Tutorial” — Mkyong.com" frameborder="0" marginwidth="0" marginheight="0" scrolling="no"></iframe>
<blockquote data-secret="BakIcf7jla" class="wp-embedded-content"><a href="http://www.mkyong.com/tutorials/java-json-tutorials/">Java JSON Tutorial</a></blockquote><iframe class="wp-embedded-content" sandbox="allow-scripts" security="restricted" style="position: absolute; clip: rect(1px, 1px, 1px, 1px);" src="http://www.mkyong.com/tutorials/java-json-tutorials/embed/#?secret=BakIcf7jla" data-secret="BakIcf7jla" width="500" height="282" title="“Java JSON Tutorial” — Mkyong.com" frameborder="0" marginwidth="0" marginheight="0" scrolling="no"></iframe>
<blockquote data-secret="G6vUVDKahI" class="wp-embedded-content"><a href="http://www.mkyong.com/tutorials/java-json-tutorials/">Java JSON Tutorial</a></blockquote><iframe class="wp-embedded-content" sandbox="allow-scripts" security="restricted" style="position: absolute; clip: rect(1px, 1px, 1px, 1px);" src="http://www.mkyong.com/tutorials/java-json-tutorials/embed/#?secret=G6vUVDKahI" data-secret="G6vUVDKahI" width="500" height="282" title="“Java JSON Tutorial” — Mkyong.com" frameborder="0" marginwidth="0" marginheight="0" scrolling="no"></iframe>
<blockquote data-secret="ldTvyk2JqD" class="wp-embedded-content"><a href="http://www.mkyong.com/tutorials/java-regular-expression-tutorials/">Java Regular Expression Tutorial</a></blockquote><iframe class="wp-embedded-content" sandbox="allow-scripts" security="restricted" style="position: absolute; clip: rect(1px, 1px, 1px, 1px);" src="http://www.mkyong.com/tutorials/java-regular-expression-tutorials/embed/#?secret=ldTvyk2JqD" data-secret="ldTvyk2JqD" width="500" height="282" title="“Java Regular Expression Tutorial” — Mkyong.com" frameborder="0" marginwidth="0" marginheight="0" scrolling="no"></iframe>
<blockquote data-secret="NoVTpAtlJ2" class="wp-embedded-content"><a href="http://www.mkyong.com/tutorials/java-regular-expression-tutorials/">Java Regular Expression Tutorial</a></blockquote><iframe class="wp-embedded-content" sandbox="allow-scripts" security="restricted" style="position: absolute; clip: rect(1px, 1px, 1px, 1px);" src="http://www.mkyong.com/tutorials/java-regular-expression-tutorials/embed/#?secret=NoVTpAtlJ2" data-secret="NoVTpAtlJ2" width="500" height="282" title="“Java Regular Expression Tutorial” — Mkyong.com" frameborder="0" marginwidth="0" marginheight="0" scrolling="no"></iframe>
<blockquote data-secret="Qy2Y1iGeyR" class="wp-embedded-content"><a href="http://www.mkyong.com/tutorials/jdbc-tutorials/">JDBC Tutorial</a></blockquote><iframe class="wp-embedded-content" sandbox="allow-scripts" security="restricted" style="position: absolute; clip: rect(1px, 1px, 1px, 1px);" src="http://www.mkyong.com/tutorials/jdbc-tutorials/embed/#?secret=Qy2Y1iGeyR" data-secret="Qy2Y1iGeyR" width="500" height="282" title="“JDBC Tutorial” — Mkyong.com" frameborder="0" marginwidth="0" marginheight="0" scrolling="no"></iframe>
...(+ many more links)

Like we mentioned before, a Web Crawler searches in width and depth for links. If we imagine the links on a web site in a tree-like structure, the root node or level zero would be the link we start with, the next level would be all the links that we found on level zero and so on.

3. Taking crawling depth into account

We will modify the previous example to set depth of link extraction. Notice that the only true difference between this example and the previous is that the recursive getPageLinks() method has an integer argument that represents the depth of the link which is also added as a condition in the if...else statement.

WebCrawlerWithDepth.java

package com.mkyong.depthwebcrawler;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;
import java.util.HashSet;

public class WebCrawlerWithDepth {
    private static final int MAX_DEPTH = 2;
    private HashSet<String> links;

    public WebCrawlerWithDepth() {
        links = new HashSet<>();
    }

    public void getPageLinks(String URL, int depth) {
        if ((!links.contains(URL) && (depth < MAX_DEPTH))) {
            System.out.println(">> Depth: " + depth + " [" + URL + "]");
            try {
                links.add(URL);

                Document document = Jsoup.connect(URL).get();
                Elements linksOnPage = document.select("a[href]");

                depth++;
                for (Element page : linksOnPage) {
                    getPageLinks(page.attr("abs:href"), depth);
                }
            } catch (IOException e) {
                System.err.println("For '" + URL + "': " + e.getMessage());
            }
        }
    }

    public static void main(String[] args) {
        new WebCrawlerWithDepth().getPageLinks("http://www.mkyong.com/", 0);
    }
}
Note
Feel free to run the above code. It only took a few minutes on my laptop with depth set to 2. Please keep in mind, the higher the depth the longer it will take to finish.

Sample Output:


...
>> Depth: 1 [https://docs.gradle.org/current/userguide/userguide.html]
>> Depth: 1 [http://hibernate.org/orm/]
>> Depth: 1 [https://jax-ws.java.net/]
>> Depth: 1 [http://tomcat.apache.org/tomcat-8.0-doc/index.html]
>> Depth: 1 [http://www.javacodegeeks.com/]
For 'http://www.javacodegeeks.com/': HTTP error fetching URL
>> Depth: 1 [http://beust.com/weblog/]
>> Depth: 1 [https://dzone.com]
>> Depth: 1 [https://wordpress.org/]
>> Depth: 1 [http://www.liquidweb.com/?RID=mkyong]
>> Depth: 1 [http://www.mkyong.com/privacy-policy/]

4. Data Scraping vs. Data Crawling

So far so good for a theoretical approach on the matter. The fact is that you will hardly ever build a generic crawler, and if you want a “real” one, you should use tools that already exist. Most of what the average developer does is an extraction of specific information from specific websites and even though that includes building a Web Crawler, it’s actually called Web Scraping.

There is a very good article by Arpan Jha for PromptCloud on Data Scraping vs. Data Crawling which personally helped me a lot to understand this distinction and I would suggest reading it.

To summarize it with a table taken from this article:

Data Scraping Data Crawling
Involves extracting data from various sources including the web Refers to downloading pages from the web
Can be done at any scale Mostly done at a large scale
Deduplication is not necessarily a part Deduplication is an essential part
Needs crawl agent and parser Needs only crawl agent

Time to move out of theory and into a viable example, as promised in the intro. Let’s imagine a scenario in which we want to get all the URLs for articles that relate to Java 8 from mkyong.com. Our goal is to retrieve that information in the shortest time possible and thus avoid crawling through the whole website. Besides, this approach will not only waste the server’s resources, but our time as well.

5. Case Study – Extract all articles for ‘Java 8’ on mkyong.com

5.1 First thing we should do is look at the code of the website. Taking a quick look at mkyong.com we can easily notice the paging at the front page and that it follows a /page/xx pattern for each page.

jsoup-web-crawler-example-1

That brings us to the realization that the information we are looking for is easily accessed by retrieving all the links that include /page/. So instead of running through the whole website, we will limit our search using document.select("a[href^=\"http://www.mkyong.com/page/\"]"). With this css selector we collect only the links that start with http://mkyong.com/page/.

5.2 Next thing we notice is that the titles of the articles -which is what we want- are wrapped in <h2></h2> and <a href=""></a> tags.

jsoup-web-crawler-example-2

So to extract the article titles we will access that specific information using a css selector that restricts our select method to that exact information: document.select("h2 a[href^=\"http://www.mkyong.com/\"]");

5.3 Finally, we will only keep the links in which the title contains ‘Java 8’ and save them to a file.

Extractor.java

package com.mkyong.extractor;

package com.mkyong;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.FileWriter;
import java.io.IOException;
import java.util.ArrayList;
import java.util.HashSet;
import java.util.List;

public class Extractor {
    private HashSet<String> links;
    private List<List<String>> articles;

    public Extractor() {
        links = new HashSet<>();
        articles = new ArrayList<>();
    }

    //Find all URLs that start with "http://www.mkyong.com/page/" and add them to the HashSet
    public void getPageLinks(String URL) {
        if (!links.contains(URL)) {
            try {
                Document document = Jsoup.connect(URL).get();
                Elements otherLinks = document.select("a[href^=\"http://www.mkyong.com/page/\"]");

                for (Element page : otherLinks) {
                    if (links.add(URL)) {
                        //Remove the comment from the line below if you want to see it running on your editor
                        System.out.println(URL);
                    }
                    getPageLinks(page.attr("abs:href"));
                }
            } catch (IOException e) {
                System.err.println(e.getMessage());
            }
        }
    }

    //Connect to each link saved in the article and find all the articles in the page
    public void getArticles() {
        links.forEach(x -> {
            Document document;
            try {
                document = Jsoup.connect(x).get();
                Elements articleLinks = document.select("h2 a[href^=\"http://www.mkyong.com/\"]");
                for (Element article : articleLinks) {
                    //Only retrieve the titles of the articles that contain Java 8
                    if (article.text().matches("^.*?(Java 8|java 8|JAVA 8).*$")) {
                        //Remove the comment from the line below if you want to see it running on your editor, 
                        //or wait for the File at the end of the execution
                        //System.out.println(article.attr("abs:href"));

                        ArrayList<String> temporary = new ArrayList<>();
                        temporary.add(article.text()); //The title of the article
                        temporary.add(article.attr("abs:href")); //The URL of the article
                        articles.add(temporary);
                    }
                }
            } catch (IOException e) {
                System.err.println(e.getMessage());
            }
        });
    }

    public void writeToFile(String filename) {
        FileWriter writer;
        try {
            writer = new FileWriter(filename);
            articles.forEach(a -> {
                try {
                    String temp = "- Title: " + a.get(0) + " (link: " + a.get(1) + ")\n";
                    //display to console
                    System.out.println(temp);
                    //save to file
                    writer.write(temp);
                } catch (IOException e) {
                    System.err.println(e.getMessage());
                }
            });
            writer.close();
        } catch (IOException e) {
            System.err.println(e.getMessage());
        }
    }

    public static void main(String[] args) {
        Extractor bwc = new Extractor();
        bwc.getPageLinks("http://www.mkyong.com");
        bwc.getArticles();
        bwc.writeToFile("Java 8 Articles");
    }
}

Output:

jsoup-web-crawler-example-3

References

  1. PromptCloud – Data Scraping vs Data Crawling
  2. PromptCloud – How a Web Crawler Works
  3. Wikipedia – Web crawler
  4. What is a Crawl Frontier?

About the Author

author image
Marilena
Marilena Panagiotidou is a senior at University of the Aegean, in the department of Information and Communication Systems Engineering. She is passionate about programming in a wide range of languages. You can contact her at an3liram@gmail.com or through her LinkedIn.

Comments

avatar
8 Comment threads
0 Thread replies
2 Followers
 
Most reacted comment
Hottest comment thread
7 Comment authors
MatPrateekKiranAnnaDinesh Recent comment authors
newest oldest most voted
Anna
Guest
Anna

Hello, what if i dont know the MAX DEPTH of the website?

Kiran
Guest
Kiran

I am getting java.net.SocketException: Connection reset error when crawling https website

Mat
Guest
Mat

I get “connect timed out” errors for many, but not all of the URLs, which this program attempts to connect to using this code.
Any idea why?
Thanks in advance!

Prateek
Guest
Prateek

Hello,
Really nice article, it helped us to get starting.

Is there any way to scrape dynamic website. A website which decides through javascript code about what cotent needs to be displayed and do not use href kind of structure for navigation?

thanks
-Prateek

Kiran
Guest
Kiran

I am using jdk 7 and jsoup 1.11.3

Dinesh
Guest
Dinesh

Really Helpful and well explained. :)

Oren Hoffman
Guest
Oren Hoffman

Thank you very much for all help Marilena !

Frank
Guest
Frank

for Document document = Jsoup.connect(URL)..get(); I cannot get the full HTML elements. Even I do something like below I still cannot get the full elements. any idea on this? Document document = Jsoup.connect(URL).header(“Accept-Encoding”, “gzip, deflate”) .userAgent(“Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36”) .maxBodySize(0) .timeout(600000) .get();