Open Source Enterprise Search Review Part Three

March 20th, 2009 by steve Discuss this article »

In the last part we looked at the lucene architecture and how lucene documents are composed. In this part we will concentrate in crawling a website and then adding the crawled site to a lucene index. In part 1 we discussed the notion of recursively crawling a web pages to eventually find all of the pages/links in a web site. We can achieve this by implementing a recursive indexing/crawling function. There are four important things going on in the following piece of code:

  1. Go to starting page, index the page then follow all links on the page 
  2. After following the links, index the linked pages, and follow the links on those pages.
  3. Repeat step 2 
  4. Do not index pages that have already been indexed, or pages from other sites.

The recursive crawler/indexer also makes use of an object called LinkParser this is a object that makes use of the html parser library to extract all of the links form a particular web page.

In order to write a dcoument to the index, lucene requires the use of an Analyzer, an analyzer proceses the content before it is added to the index. In this example we use the standard analyser, which is little more than a string tokeniser.

package jsearchengine;

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import 
org.apache.lucene.document.Document;
import 
org.apache.lucene.index.IndexWriter;

import java.net.URL;
import 
java.util.ArrayList;

public class Main {

    private static IndexWriter writer;          // new index being built
    
private static ArrayList indexed;
    private static 
String beginDomain;

    public static void main(String[] args) throws Exception {

        String index "/opt/lucene/index";
        boolean 
create = true;
        
String link "http://www.tek-dev.com/";
        
beginDomain Domain(link);
        System
.out.println(beginDomain);

        writer = new IndexWriter(index, new StandardAnalyzer(), create,
                
new IndexWriter.MaxFieldLength(1000000));
        
indexed = new ArrayList();

        indexDocs(link);

        System.out.println("Optimizing…");
        
writer.optimize();
        
writer.close();

    }

    private static void indexDocs(String url) throws Exception {

        //index page
        
Document doc HTMLDocument.Document(url);
        System
.out.println("adding " + doc.get("path"));
        try 
{
            indexed.add(doc.
get("path"));
            
writer.addDocument(doc);          // add docs unconditionally
            //TODO: only add html docs
            //and create other doc types

            //get all links on the page then index them
            
LinkParser lp = new LinkParser(url);
            
URL[] links lp.ExtractLinks();

            for (URL l : links) {
                
//make sure the url hasnt already been indexed
                //make sure the url contains the home domain
                //ignore urls with a querystrings by excluding "?" 
                
if ((!indexed.contains(l.toURI().toString())) && 

                    (l.toURI().toString().contains(beginDomain)) && 
                    (!l.toURI().toString().contains("?"))) {
                    
//don't index zip files
                    
if (!l.toURI().toString().endsWith(".zip")) {
                        
System.out.print(l.toURI().toString());
                        
indexDocs(l.toURI().toString());
                    
}
                }
            }

        } catch (Exception e) {
            
System.out.println(e.toString());
        
}
    }

    private static String Domain(String url) {
        
int firstDot url.indexOf(".");
        int 
lastDot url.lastIndexOf(".");
        return 
url.substring(firstDot + 1, lastDot);
    
}
}

The following is the link parser object used in the above code:

 

import org.htmlparser.NodeFilter;
import 
org.htmlparser.Parser;
import 
org.htmlparser.filters.NodeClassFilter;
import 
org.htmlparser.tags.LinkTag;
import 
org.htmlparser.util.NodeList;
import 
org.htmlparser.util.ParserException;

import java.util.Vector;
import 
java.net.URL;
import 
java.net.MalformedURLException;

/**
 *
 * @author Stephen.Lane
 */
public class LinkParser {

    String url;
    
Parser parser;
    
NodeFilter filter;
    
NodeList list;
    
LinkTag link;
    
URL[] linkArray;
    
Vector vector;

    public LinkParser(String Url) {
        url 
Url;
    
}

    public URL[] ExtractLinks() {
        filter 
= new NodeClassFilter(LinkTag.class);

        try {
            parser 
= new Parser(url);
            
list parser.extractAllNodesThatMatch(filter);

            vector = new Vector();
            for 
(int 0i < list.size()i++) {
                
try {
                    link 
(LinkTag) list.elementAt(i);
                    
vector.add(new URL(link.getLink()));
                
catch (MalformedURLException murle) {
                }
            }
            linkArray 
= new URL[vector.size()];
            
vector.copyInto(linkArray);
        
catch (ParserException e) {
            e.printStackTrace()
;
        
}

        return (linkArray);
    
}
}

Now that we have the pages crawled and each of the pages indexed all we have to do is query te index to display some results, we will look at that in part 4.

-> part 1
-> part 2
-> part 4

Leave a Reply

ERROR: si-captcha.php plugin says GD image support not detected in PHP!

Contact your web host and ask them why GD image support is not enabled for PHP.

ERROR: si-captcha.php plugin says imagepng function not detected in PHP!

Contact your web host and ask them why imagepng function is not enabled for PHP.