Open Source Enterprise Search Review Part Four

March 20th, 2009 by steve Discuss this article »

In the final part of this series we will look at a very simple webpage that takes a search term, then post the term to a search page that executes the query for us and displays the results on the webpage. The lucene object that reads the index is called a IndexReader. Before the query is processed it needds to be precessed by our friend the analyser. The analyser will remove exotic characters, plurals etc. ect. so clean terms are tokenised and used to query the index. It is important to use the same analyser to read the index as we used to write to the index. We used the Standard anlyser to process the terms that were written to the index, so we will use it again to proces the search terms. The following is the simple jsp page used to take the search term and post it to the results page.

  <html>


    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
        <title>JSP Search Page</title>
    </head>
    <body>
        <form name="search" action="results.jsp" method="get">
        <p>
            <input name="query" size="44"/> Search Criteria
        </p>
        <p>
            <input name="maxresults" size="4" value="100"/> Results Per Page 
            <input type="submit" value="Search"/>
        </p>
        </form>
    </body>
</html>

The following is the results page that takes the search parametrs and displays the results. Most of the code is concerned with paging the results, the piece of interest is the index reader that takes the search term, processes it with the standard analyser and outputs the results. this results page is taken directly from the Lucene documentation with a few modifications:

<%
        
/*
        Author: Andrew C. Oliver, SuperLink Software, Inc. (acoliver2@users.sourceforge.net)
        
        This jsp page is deliberatly written in the horrible java directly embedded 
        in the page style for an easy and concise demonstration of Lucene.
        Due note…if you write pages that look like this…sooner or later
        you'll have a maintenance nightmare.  If you use jsps…use taglibs
        and beans!  That being said, this should be acceptable for a small
        page demonstrating how one uses Lucene in a web app. 
        
        This is also deliberately overcommented. ;-)
        
         */
%>
<%!
    
public String escapeHTML(String s) {
        s 
s.replaceAll("&""&");
        
s.replaceAll("<""<");
        
s.replaceAll(">"">");
        
s.replaceAll("\""""");
        
s.replaceAll("'""'");
        return 
s;
    
}
%>

<%
        
boolean error = false;                  //used to control flow for error messages
        
String indexName "/opt/lucene/index";       //local copy of the configuration variable
        
IndexSearcher searcher = null;          //the searcher used to open/search the index
        
Query query = null;                     //the Query created by the QueryParser
        
Hits hits = null;                       //the search results
        
int startindex 0;                     //the first index displayed on this page
        
int maxpage 50;                    //the maximum items displayed on this page
        
String queryString = null;              //the query entered in the previous page
        
String startVal = null;              //string version of startindex
        
String maxresults = null;              //string version of maxpage
        
int thispage 0;                       //used for the for/next either maxpage or
        //hits.length() - startindex - whichever is
        //less

        try {
            searcher 
= new IndexSearcher(indexName);      //create an indexSearcher for our page
        //NOTE: this operation is slow for large
        //indices (much slower than the search itself)
        //so you might want to keep an IndexSearcher 
        //open

        catch (Exception e) {                         //any error that happens is probably due
            //to a permission problem or non-existant
            //or otherwise corrupt index
%>
<p>ERROR opening the Index - contact sysadmin!</p>
<p>Error message: <%
=escapeHTML(e.getMessage())%></p>   
<%                error 
= true;                                  //don't do anything up to the footer
        
}
%>
<%
        
if (error == false) {                                           //did we open the index?
            
queryString request.getParameter("query");           //get the search criteria
            
startVal request.getParameter("startat");         //get the start index
            
maxresults request.getParameter("maxresults");      //get max results per page
            
try {
                maxpage 
Integer.parseInt(maxresults);    //parse the max results first
                
startindex Integer.parseInt(startVal);      //then the start index  
            
catch (Exception e) {
            } 
//we don't care if something happens we'll just start at 0
            //or end at 50

            if (queryString == null) {
                
throw new ServletException("no query " //if you don't have a query then
                        
"specified");      //you probably played on the 
            
}                                                               //query string so you get the 
            //treatment

            Analyzer analyzer = new StandardAnalyzer();           //construct our usual analyzer
            
try {
                QueryParser qp 
= new QueryParser("contents", analyzer);
                
query qp.parse(queryString)//parse the 
            
catch (ParseException e) {                          //query and construct the Query
                //object
                //if it's just "operator error"
                //send them a nice error HTML

%>
<p>Error 
while parsing query: <%=escapeHTML(e.getMessage())%></p>
<%
                error 
= true;                                 //don't bother with the rest of
            //the page
            
}
        }
%>
<%
        
if (error == false && searcher != null) {                     // if we've had no errors
            // searcher != null was to handle
            // a weird compilation bug 
            
thispage maxpage;                                   // default last element to maxpage
            
hits searcher.search(query);                        // run the query 
            
if (hits.length() == 0) {                             // if we got no results tell the user
%>
<p> I
'm sorry I couldn't find what you were looking for. </p>
<%
                error 
= true;                                        // don't bother with the rest of the
            // page
            
}
        }

        if (error == false && searcher != null) {
%>
<table>
    <tr>
        <td>Document</td>
        <td>Summary</td>
    </tr>
    <%
    
if ((startindex + maxpage) > hits.length()) {
        thispage 
hits.length() - startindex;      // set the max index to maxpage or last
    
}                                                   // actual search result whichever is less

    for (int startindexi < (thispage + startindex)i++) {  // for each element
%>
    <tr>
        <%
        Document doc 
hits.doc(i);                    //get the next document 
        
String doctitle doc.get("title");            //get its title
        
String url doc.get("path");                  //get its path field
        
if (url != null && url.startsWith("../webapps/")) { // strip off ../webapps prefix if present
            
url url.substring(10);
        
}
        
if ((doctitle == null) || doctitle.equals("")) //use the path if it has no title
        
{
            doctitle 
url;
        
}
        
//then output!
%>
        <td><a href
="<%=url%>"><%=doctitle%></a></td>
        <td><%
=doc.get("summary")%></td>
    </tr>
    <%
    }
    %>
    <%                
if ((startindex + maxpage) < hits.length()) {   //if there are more results…display 
        //the more link

        String moreurl "results.jsp?query=" +
                URLEncoder.encode(queryString) + 
//construct the "more" link
                
"&maxresults=" + maxpage +
                
"&startat=" + (startindex + maxpage);
    
%>
    <tr>
        <td></td><td><a href
="<%=moreurl%>">More Results>></a></td>
    </tr>
    <%
    }
    %>
</table>

<%       }                                            //then include our footer.
        
if (searcher != null) {
            searcher.
close();
        
}
%>

Colorized by: CarlosAg.CodeColorizer

In order to see all of these components working together i have create two sample netbans projects. One binary application that will do the indexing/crawling and a web project with the search pages. The web application uses the glassfish (based on tomcat) application server, but this can easily be changed to your prefered application server. Even though these applications are basic, they are a perfectly adequate solution to index/search sites with many thousand pages. With the current configuration, a new index will overwrite the previous index, but this can be changed using the index writer properties, this will allow the indexing of multiple sites into one index.

 If there is sufficient interest and i have the time, i will create a C#.Net web client that can query the index createdd in Java. This can be achieved using the Lucene.Net api.

 The code for the two projects can be downloaded, by clicking on the "Download" button here,

Or can be checked out of the following svn repository, there is no user name or password required: http://svn.assembla.com/svn/JSearchEngine

The code is provided "as is" with no express or
implied warranty, however every attempt has been made to ensure accuracy.

You are free to use this code for commercial/non-commercial use as long as you abide by the terms of the Lucene and Html Parser licences.

 

-> part 1
-> part 2
-> part 3

Leave a Reply

ERROR: si-captcha.php plugin says GD image support not detected in PHP!

Contact your web host and ask them why GD image support is not enabled for PHP.

ERROR: si-captcha.php plugin says imagepng function not detected in PHP!

Contact your web host and ask them why imagepng function is not enabled for PHP.