In the first part of this four part series, we have looked at some of the fundamentals of enterprise search, in this second part we will take a look at some of the available open source technologies that we can leverage to implement those concepts.
With out a doubt the most challanging part of a search engine is the way we index content, and then subsequently how we interogate that index. Some of the main challanges that lie within search are providing results that are relevant to a search term. Take for example the word “Run”, if we search for the word “Run” we will also be interested in documents that containd words like “Runable”, “Running”, “Runs” etc etc… Thankfully are indexing engines that we can make use of that handle all of this logic transparently. Lucene is an appache foundation indexing project, and is widely accepted to be the best open source indexing project available.
In order to implement lucene we must first examine its basic architecture. Lucene is i very cleverly designed, it appears to be simple enough in nature, but the devil is definately in the detail ! The objects that lucene adds to its index are intuitivly called documents. A document is made up of fields, and each filed has a field name and field value. For example if we concider a web page, each of its attributes/tags can be mapped to a field. Such as its title, body etc. etc. If we take the title field for example, the field name id “title” and the field value is the actual title. Therefore when we query the index, we are returned an array of documents that match the search term. Any of the fields in the document can be returned as part of the resultset into a pager of repeater. The following code shows a html document type with title, summary, content and url fields.The following exmple uses html parser library to extract text from web pages.
import
org.htmlparser.Parser;import
org.htmlparser.NodeFilter;import
org.htmlparser.filters.TagNameFilter;import
org.htmlparser.util.ParserException;import java.io.*;
import org.apache.lucene.document.*;
/** A utility for making Lucene Documents for HTML documents. */
public class HTMLDocument {
public static Document Document(String url)
throws IOException, InterruptedException {
// make a new, empty document
Document doc = new Document();
String title = new String();
String summary = new String();
// Add the url as a field named ”path”. Use a field that is
// indexed (i.e. searchable), but don’t tokenize the field into words.
doc.add(new Field(“path”, url, Field.Store.YES, Field.Index.NOT_ANALYZED));
// Add the tag-stripped contents as a Reader-valued Text field so it will
// get tokenized and indexed.
StringBean sb = new StringBean();
sb.setLinks(false);
sb.setURL(url);
StringReader sr = new StringReader(sb.getStrings());
doc.add(new Field(“contents”, sr));
Parser bParser;
NodeFilter bFilter;
try {
bParser = new Parser();
bFilter = new TagNameFilter(“TITLE”);
bParser.setResource(url);
title = bParser.parse(bFilter).asString();
} catch (ParserException e) {
e.printStackTrace();
}
try {
bParser = new Parser();
bFilter = new TagNameFilter(“BODY”);
bParser.setResource(url);
try {
summary = bParser.parse(bFilter).asString().substring(0, 200);
} catch (StringIndexOutOfBoundsException e) {
summary = “”;
}
} catch (ParserException e) {
e.printStackTrace();
}
// Add the title as a field that it can be searched and that is stored.
doc.add(new Field(“title”, title, Field.Store.YES, Field.Index.ANALYZED));
doc.add(new Field(“summary”, summary, Field.Store.YES, Field.Index.NO));
return doc;
}
private HTMLDocument() {
}
}
In order to crawl a web site in order to create these html documents, we need to implement some form of crawling mechanism. We will look at this in part three.