How to extract HTML Links with regular expression
Written on November 18, 2009 at 5:04 am by
mkyong
HTML A tag Regular Expression Pattern
(?i)<a([^>]+)>(.+?)</a>
Extract HTML link Regular Expression Pattern
\s*(?i)href\s*=\s*(\"([^"]*\")|'[^']*'|([^'">\s]+));
Description
( #start of group #1 ?i # all checking are case insensive ) #end of group #1 <a #start with "<a" ( # start of group #2 [^>]+ # anything except (">"), at least one character ) # end of group #2 > # follow by ">" (.+?) # match anything </a> # end with "</a>
\s* #can start with whitespace (?i) # all checking are case insensive href # follow by "href" word \s*=\s* # allows spaces on either side of the equal sign, ( # start of group #1 "([^"]*") # allow string with double quotes enclosed - "string" | # ..or '[^']*' # allow string with single quotes enclosed - 'string' | # ..or ([^'">]+) # can't contains one single quotes, double quotes ">" ) # end of group #1
Here is a simple Java Link extractor to extract the ‘href’ value from 1st pattern, and use 2nd pattern to extract the link from 1st pattern value. Of course with some logic as below.
Java Regular Expression Example
package com.mkyong.regex; import java.util.Vector; import java.util.regex.Matcher; import java.util.regex.Pattern; public class HTMLLinkExtrator{ private Pattern patternTag, patternLink; private Matcher matcherTag, matcherLink; private static final String HTML_A_TAG_PATTERN = "(?i)<a([^>]+)>(.+?)</a>"; private static final String HTML_A_HREF_TAG_PATTERN = "\\s*(?i)href\\s*=\\s*(\"([^\"]*\")|'[^']*'|([^'\">\\s]+))"; public HTMLLinkExtrator(){ patternTag = Pattern.compile(HTML_A_TAG_PATTERN); patternLink = Pattern.compile(HTML_A_HREF_TAG_PATTERN); } /** * Validate html with regular expression * @param html html content for validation * @return Vector links and link text */ public Vector<HtmlLink> grabHTMLLinks(final String html){ Vector<HtmlLink> result = new Vector<HtmlLink>(); matcherTag = patternTag.matcher(html); while(matcherTag.find()){ String href = matcherTag.group(1); //href String linkText = matcherTag.group(2); //link text matcherLink = patternLink.matcher(href); while(matcherLink.find()){ String link = matcherLink.group(1); //link result.add(new HtmlLink(link, linkText)); } } return result; } class HtmlLink { String link; String linkText; HtmlLink(String link, String linkText){ this.link = link; this.linkText = linkText; } @Override public String toString() { return new StringBuffer("Link : ") .append(this.link) .append(" Link Text : ") .append(this.linkText).toString(); } } }
Unit Test – HTMLLinkExtratorTest
package com.mkyong.regex; import java.util.Vector; import org.testng.Assert; import org.testng.annotations.*; import com.mkyong.regex.HTMLLinkExtrator.HtmlLink; /** * HTML link extrator Testing * @author mkyong * */ public class HTMLLinkExtratorTest { private HTMLLinkExtrator htmlLinkExtrator; @BeforeClass public void initData(){ htmlLinkExtrator = new HTMLLinkExtrator(); } @DataProvider public Object[][] HTMLContentProvider() { return new Object[][]{ new Object[] {"abc hahaha <a href='http://www.google.com'>google</a>"}, new Object[] {"abc hahaha <a HREF='http://www.google.com'>google</a>"}, new Object[] {"abc hahaha <A HREF='http://www.google.com'>google</A> , " + "abc hahaha <A HREF='http://www.google.com' target='_blank'>google</A>"}, new Object[] {"abc hahaha <A HREF='http://www.google.com' target='_blank'>google</A>"}, new Object[] {"abc hahaha <A target='_blank' HREF='http://www.google.com'>google</A>"}, new Object[] {"abc hahaha <a HREF=http://www.google.com>google</a>"}, }; } @Test(dataProvider = "HTMLContentProvider") public void ValidHTMLLinkTest(String html) { Vector<HtmlLink> links = htmlLinkExtrator.grabHTMLLinks(html); Assert.assertTrue(links.size()!=0); for(int i=0; i<links.size() ; i++){ HtmlLink htmlLinks = links.get(i); System.out.println(htmlLinks); } } }
Unit Test – Result
[Parser] Running:
E:\workspace\mkyong\temp-testng-customsuite.xml
Link : 'http://www.google.com' Link Text : google
Link : 'http://www.google.com' Link Text : google
Link : 'http://www.google.com' Link Text : google
Link : 'http://www.google.com' Link Text : google
Link : 'http://www.google.com' Link Text : google
Link : 'http://www.google.com' Link Text : google
Link : http://www.google.com Link Text : google
PASSED: ValidHTMLLinkTest("abc hahaha <a href='http://www.google.com'>google</a>")
PASSED: ValidHTMLLinkTest("abc hahaha <a HREF='http://www.google.com'>google</a>")
PASSED: ValidHTMLLinkTest("abc hahaha <A HREF='http://www.google.com'>google</A> ,
abc hahaha <A HREF='http://www.google.com' target='_blank'>google</A>")
PASSED: ValidHTMLLinkTest("abc hahaha <A HREF='http://www.google.com' target='_blank'>google</A>")
PASSED: ValidHTMLLinkTest("abc hahaha <A target='_blank' HREF='http://www.google.com'>google</A>")
PASSED: ValidHTMLLinkTest("abc hahaha <a HREF=http://www.google.com>google</a>")
===============================================
com.mkyong.regex.HTMLLinkExtratorTest
Tests run: 6, Failures: 0, Skips: 0
===============================================
===============================================
mkyong
Total tests run: 6, Failures: 0, Skips: 0
===============================================Want to learn more about regular expression? Highly recommend this best and classic book – “Mastering Regular Expression”
This article was posted in Regular Expressions category.
Oracle Magazine - Free Magazine
Oracle Magazine contains technology strategy articles, sample code, tips, Oracle and partner news, how to articles for Java's developers and DBAs, and more.
Securing & Optimizing Linux: The Hacking Solution - Free Guide
A comprehensive collection of Linux security products and explanations in the most simple and structured manner on how to safely and easily configure and run many popular Linux-based applications and services.
The Windows 7 Guide: From Newbies to Pros - Free Guide
In this 46 page guide you will be introduced to Windows 7 and what it has to offer. This guide will go over the software compatible issues, you will learn about the new taskbar, how to use and customize Windows Aero, what Windows 7 Libraries are all about, what software is included in Windows 7, and how easy networking is with Windows 7 along with other topics.
All Java Tutorials
- Java Core Technology - Java RegEx, Java XML, Java I/O, Java Misc
- J2EE Frameworks - Hibernate, Spring 2.5, Spring MVC, Struts 1.x, Struts 2.x
- Build Tools - Maven, Archiva
- Unit Test - jUnit, TestNG
- Client Scripts - jQuery
Thanks for sharing, however I am afraid you have errors in you description, try and edit it. This “([^"]*”) # only two double quotes are allow – “string” should be \”([^\"]*\”) # double quote can start and end.
Janus
Thanks for suggestion,
changed to allow string with double quotes enclosed – “string”
[...] ==> See the explanation and example here [...]