In the last part we looked at the lucene architecture and how lucene documents are composed. In this part we will concentrate in crawling a website and then adding the crawled site to a lucene index. In part 1 we discussed the notion of recursively crawling a web pages to eventually find all of the pages/links in a web site. We can achieve this by implementing a recursive indexing/crawling function. There are four important things going on in the following piece of code:
- Go to starting page, index the page then follow all links on the page
- After following the links, index the linked pages, and follow the links on those pages.
- Repeat step 2
- Do not index pages that have already been indexed, or pages from other sites.
The recursive crawler/indexer also makes use of an object called LinkParser this is a object that makes use of the html parser library to extract all of the links form a particular web page.
In order to write a dcoument to the index, lucene requires the use of an Analyzer, an analyzer proceses the content before it is added to the index. In this example we use the standard analyser, which is little more than a string tokeniser.
import
org.apache.lucene.analysis.standard.StandardAnalyzer;import org.apache.lucene.document.Document;
import org.apache.lucene.index.IndexWriter;
import
java.net.URL;import java.util.ArrayList;
public class
Main {private static IndexWriter writer; // new index being built
private static ArrayList indexed;
private static String beginDomain;
public static
void main(String[] args) throws Exception {String index
= "/opt/lucene/index";boolean create = true;
String link = "http://www.tek-dev.com/";
beginDomain = Domain(link);
System.out.println(beginDomain);
writer = new IndexWriter(index, new StandardAnalyzer(), create,
new IndexWriter.MaxFieldLength(1000000));
indexed = new ArrayList();
indexDocs(link);
System
.out.println("Optimizing…");writer.optimize();
writer.close();
}
private static void indexDocs(String url) throws Exception {
//index page
Document doc = HTMLDocument.Document(url);
System.out.println("adding " + doc.get("path"));
try {
indexed.add(doc.get("path"));
writer.addDocument(doc); // add docs unconditionally
//TODO: only add html docs
//and create other doc types
//get all links on the page then index them
URL[] links = lp.ExtractLinks();
for
(URL l : links) {//make sure the url hasnt already been indexed
//make sure the url contains the home domain
//ignore urls with a querystrings by excluding "?"
if ((!indexed.contains(l.toURI().toString())) &&
//don't index zip files
if (!l.toURI().toString().endsWith(".zip")) {
System.out.print(l.toURI().toString());
indexDocs(l.toURI().toString());
}
}
}
}
catch (Exception e) {System.out.println(e.toString());
}
}
private static String Domain(String url) {
int firstDot = url.indexOf(".");
int lastDot = url.lastIndexOf(".");
return url.substring(firstDot + 1, lastDot);
}
}
The following is the link parser object used in the above code:
import org.htmlparser.Parser;
import org.htmlparser.filters.NodeClassFilter;
import org.htmlparser.tags.LinkTag;
import org.htmlparser.util.NodeList;
import org.htmlparser.util.ParserException;
import
java.util.Vector;import java.net.URL;
import java.net.MalformedURLException; /**
*
* @author Stephen.Lane
*/
public class LinkParser {
String url
;Parser parser;
NodeFilter filter;
NodeList list;
LinkTag link;
URL[] linkArray;
Vector vector;
public
LinkParser(String Url) {url = Url;
}
public URL[] ExtractLinks() {
filter = new NodeClassFilter(LinkTag.class);
try
{parser = new Parser(url);
list = parser.extractAllNodesThatMatch(filter);
vector = new Vector();
for (int i = 0; i < list.size(); i++) {
try {
link = (LinkTag) list.elementAt(i);
vector.add(new URL(link.getLink()));
} catch (MalformedURLException murle) {
}
}
linkArray = new URL[vector.size()];
vector.copyInto(linkArray);
} catch (ParserException e) {
e.printStackTrace();
}
return (linkArray);
}
}
Now that we have the pages crawled and each of the pages indexed all we have to do is query te index to display some results, we will look at that in part 4.