In the final part of this series we will look at a very simple webpage that takes a search term, then post the term to a search page that executes the query for us and displays the results on the webpage. The lucene object that reads the index is called a IndexReader. Before the query is processed it needds to be precessed by our friend the analyser. The analyser will remove exotic characters, plurals etc. ect. so clean terms are tokenised and used to query the index. It is important to use the same analyser to read the index as we used to write to the index. We used the Standard anlyser to process the terms that were written to the index, so we will use it again to proces the search terms. The following is the simple jsp page used to take the search term and post it to the results page.
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>JSP Search Page</title>
</head>
<body>
<form name="search" action="results.jsp" method="get">
<p>
<input name="query" size="44"/> Search Criteria
</p>
<p>
<input name="maxresults" size="4" value="100"/> Results Per Page
<input type="submit" value="Search"/>
</p>
</form>
</body>
</html>
The following is the results page that takes the search parametrs and displays the results. Most of the code is concerned with paging the results, the piece of interest is the index reader that takes the search term, processes it with the standard analyser and outputs the results. this results page is taken directly from the Lucene documentation with a few modifications:
<%
/*
Author: Andrew C. Oliver, SuperLink Software, Inc. (acoliver2@users.sourceforge.net)
This jsp page is deliberatly written in the horrible java directly embedded
in the page style for an easy and concise demonstration of Lucene.
Due note…if you write pages that look like this…sooner or later
you'll have a maintenance nightmare. If you use jsps…use taglibs
and beans! That being said, this should be acceptable for a small
page demonstrating how one uses Lucene in a web app.
This is also deliberately overcommented. ;-)
*/
%>
<%!
public String escapeHTML(String s) {
s = s.replaceAll("&", "&");
s = s.replaceAll("<", "<");
s = s.replaceAll(">", ">");
s = s.replaceAll("\"", """);
s = s.replaceAll("'", "'");
return s;
}
%>
<%
boolean error = false; //used to control flow for error messages
String indexName = "/opt/lucene/index"; //local copy of the configuration variable
IndexSearcher searcher = null; //the searcher used to open/search the index
Query query = null; //the Query created by the QueryParser
Hits hits = null; //the search results
int startindex = 0; //the first index displayed on this page
int maxpage = 50; //the maximum items displayed on this page
String queryString = null; //the query entered in the previous page
String startVal = null; //string version of startindex
String maxresults = null; //string version of maxpage
int thispage = 0; //used for the for/next either maxpage or
//hits.length() - startindex - whichever is
//less
try {
searcher = new IndexSearcher(indexName); //create an indexSearcher for our page
//NOTE: this operation is slow for large
//indices (much slower than the search itself)
//so you might want to keep an IndexSearcher
//open
} catch (Exception e) { //any error that happens is probably due
//to a permission problem or non-existant
//or otherwise corrupt index
%>
<p>ERROR opening the Index - contact sysadmin!</p>
<p>Error message: <%=escapeHTML(e.getMessage())%></p>
<% error = true; //don't do anything up to the footer
}
%>
<%
if (error == false) { //did we open the index?
queryString = request.getParameter("query"); //get the search criteria
startVal = request.getParameter("startat"); //get the start index
maxresults = request.getParameter("maxresults"); //get max results per page
try {
maxpage = Integer.parseInt(maxresults); //parse the max results first
startindex = Integer.parseInt(startVal); //then the start index
} catch (Exception e) {
} //we don't care if something happens we'll just start at 0
//or end at 50
if (queryString == null) {
throw new ServletException("no query " + //if you don't have a query then
"specified"); //you probably played on the
} //query string so you get the
//treatment
Analyzer analyzer = new StandardAnalyzer(); //construct our usual analyzer
try {
QueryParser qp = new QueryParser("contents", analyzer);
query = qp.parse(queryString); //parse the
} catch (ParseException e) { //query and construct the Query
//object
//if it's just "operator error"
//send them a nice error HTML
%>
<p>Error while parsing query: <%=escapeHTML(e.getMessage())%></p>
<%
error = true; //don't bother with the rest of
//the page
}
}
%>
<%
if (error == false && searcher != null) { // if we've had no errors
// searcher != null was to handle
// a weird compilation bug
thispage = maxpage; // default last element to maxpage
hits = searcher.search(query); // run the query
if (hits.length() == 0) { // if we got no results tell the user
%>
<p> I'm sorry I couldn't find what you were looking for. </p>
<%
error = true; // don't bother with the rest of the
// page
}
}
if (error == false && searcher != null) {
%>
<table>
<tr>
<td>Document</td>
<td>Summary</td>
</tr>
<%
if ((startindex + maxpage) > hits.length()) {
thispage = hits.length() - startindex; // set the max index to maxpage or last
} // actual search result whichever is less
for (int i = startindex; i < (thispage + startindex); i++) { // for each element
%>
<tr>
<%
Document doc = hits.doc(i); //get the next document
String doctitle = doc.get("title"); //get its title
String url = doc.get("path"); //get its path field
if (url != null && url.startsWith("../webapps/")) { // strip off ../webapps prefix if present
url = url.substring(10);
}
if ((doctitle == null) || doctitle.equals("")) //use the path if it has no title
{
doctitle = url;
}
//then output!
%>
<td><a href="<%=url%>"><%=doctitle%></a></td>
<td><%=doc.get("summary")%></td>
</tr>
<%
}
%>
<% if ((startindex + maxpage) < hits.length()) { //if there are more results…display
//the more link
String moreurl = "results.jsp?query=" +
URLEncoder.encode(queryString) + //construct the "more" link
"&maxresults=" + maxpage +
"&startat=" + (startindex + maxpage);
%>
<tr>
<td></td><td><a href="<%=moreurl%>">More Results>></a></td>
</tr>
<%
}
%>
</table>
<% } //then include our footer.
if (searcher != null) {
searcher.close();
}
%>
Colorized by: CarlosAg.CodeColorizer
In order to see all of these components working together i have create two sample netbans projects. One binary application that will do the indexing/crawling and a web project with the search pages. The web application uses the glassfish (based on tomcat) application server, but this can easily be changed to your prefered application server. Even though these applications are basic, they are a perfectly adequate solution to index/search sites with many thousand pages. With the current configuration, a new index will overwrite the previous index, but this can be changed using the index writer properties, this will allow the indexing of multiple sites into one index.
If there is sufficient interest and i have the time, i will create a C#.Net web client that can query the index createdd in Java. This can be achieved using the Lucene.Net api.
The code for the two projects can be downloaded, by clicking on the "Download" button here,
Or can be checked out of the following svn repository, there is no user name or password required: http://svn.assembla.com/svn/JSearchEngine
The code is provided "as is" with no express or
implied warranty, however every attempt has been made to ensure accuracy.
You are free to use this code for commercial/non-commercial use as long as you abide by the terms of the Lucene and Html Parser licences.
-> part 1
-> part 2
-> part 3