Archive for the ‘Uncategorized’ Category

Directory of IT services for Ireland & UK

Wednesday, April 1st, 2009

I am currently putting together a directory of IT services at http://www.tek-dev.com/dir/ for use by companies in Ireland and the UK. I am hoping to keep the directory concise while at the same time including as many quality services as possible.

If you have are Irish or UK based IT company that does not specialise in software development (conflict of interest!) please submit you link and short description to the directory for inclusion using the submit link,

If you cannot find a suitable category, leave a comment here and i will add as appropriate.

C Open Source Mail and Calendaring System

Monday, March 23rd, 2009

 I have spent the last few days looking at the zimbra open source mail and calendaring system. This project is hosted by Yahoo and upon further examination seems to be only semi-open source. Let me explain.. although the source code is available and the project seems to be seeking collaboration from the wider community there are several problems. The first being code availabiltiy, the zimbra project has multiple versions, such as subscription based supported versions all the ways down to an open source version. The main issue here is that the open source version is stripped of some of the nice features that come with the subscription version which seems to go against the ethos of open source. The second problem that i have with the project is the way in which contributers form outside zimbra/yahoo can contribute code changes/bug fixes. If you are not a zimbra employee you cannot make changes to the code repositories directly, you have to send your contributions along with a signed contract to zimbra, where a decision will be made on whether or not you contributions will be accepted.

This to me seems like an odd way to do business, and coders that make valuable contributions to the project are still only entitled to have a copy of the cut down "open source" version, not the feature rich one that subscribers have to pay for. At this rate i am surprised there are any contributers at all !!. One would be forgiven for speculating that zimbra have fulfilled the bare minumum necessary to brand their product open source in order to benefit form a wide variety of open source tools such as lucene, post fix, open ldap, mysql etc ..

Aside from the particular model the project uses to operate, the software is very very good and the only serious open source compeditor to the Exchange servers of this world ! What i most like about the zimbra collabouration suite is its web client. The interface is constructed from almost 100% java script, which is contained within a number of java jsp pages.. What struck me about this architecture is that if most of the nuts and bolts are javascript, then how difficult would it be to launch this type of interface from a few .aspx dot net pages.

 What i am proposing to do is to launch a new c# open source mail/calendar web based application using this type of technology. The project would be much more simple and would contain the following componetns:

 A free database with an ADO.NET client such as Firebird.

A server componant which contains all of the data acess and business logic

An open source SMTP mailer such as CSES which will be used by the server component.

An Asp.Net front end which makes use of the zimbra javascript canendar and mail client objects

Posibly a desktop client which provides similar functionality to the web client.

As a first pass i don't intend to integrate security with the active directory, an simple froms based security model could be implemented in the server to authenticate potiental users of the system.

I have already been playing around with a coupe of c# prototypes, to see what sort of architecture the sytem should take on, but what i will really need help is on porting the javascript client from zimbra's java platfrom to .Net. I am looking for as many interested .Net/Java/Javascript developers as possible to help me get this think off the ground, so if you are intereste please leave a comment at the end of this entry and i will be in touch,

 

When to use rentacoder or elance for outsourcing a software project

Friday, March 20th, 2009

A lot of people have at this point have heard about the rentacoder.com and elance.com type websites. These are online marketplaces where software buyers can pay software coders to create desktop and web applications using an auction system similar to ebay. A software buyer places his requirements on to the site where coders review the requirements and make bids on the work depending on how much they are willing to complete the work for. At a glance this seems like a good service, where buyers can get good value from the competeing coders while freelancers get a steady stream of work, but is it ?

I would say there are yes and no answers to that question depending project type and size..

The first important point to note about these services are the profile of programmers and freelancers that registered on the site. Because the sites have an international audience projects will tend to get lower bids from developing and countries. This means that projects generally get awarded to the lower bids, and it is through this system of jobs generally going to emerging countries that coders from developed countries do not participate in the bid auctions. At these sites there will typically be a lot of bids from countries such as India, Romania, Pakistan and Sri Lanka occasionally there will be bids from developed countries such as the United States or Ireland. This does not mean however that the work will be sub standard and cam be quite the contrary in fact with countries like India having some of the best software engineering schools in the world.

One of the main consequences of using offshore development teams that are not native English speakers is communication, and even when English is the native language, dialects can vary a lot depending on location. A lot of the time this obstacle can be overcome by using written forms of communication such as email and messenger, however this becomes impractical on larger projects. One of the more important aspects of the written forms of communication used in these services is that they can be used by as evidence by site mediators should the project go into arbitration. This is however dependant on the site approved messaging forums being used as opposed to external email/messenger.

 Another important point to note about the coders on this type of service is that they are often individuals with other full time jobs which gives them limited bandwidth for their freelance work, having said this a lot of the bidders are companies with multiple coders, (or at least claim to have multiple coders!. The elance service seems to have many more companies make bids than individuals making bids. I suspect this bias is brought about by the expensive subscription fees charged by elance, companies with a high turnover of work/revenue seem to favor this site as the fees are a flat monthly rate compared to a % of the profit, which is the model employed by rentacoder.

 A point to note when placing a job on one of these services is the level of technical expertise held by the poster. If the person posting the job does not at least have some grounding in software development/engineering, then this can lead to problems. One of the main complaints from coders on the site are that software buyers provide ambiguous or unrealistic requirements. This can lead to the job being ignored by coders or worse still the job is won by a coder who fails to deliver due to the non-specific requirements. If the job goes into arbitration before the job is complete then the buyer risks losing money due to the software vaguely meeting the buyers written requirements, but not delivering on what the buyer actually wants.


So in summary this can be a good service which provides low cost software, however there is still some risk a s discussed above. One of the main deciders of whether or not to use one of these services is the size and complexity of the projects, if it is a large and complex project i would say hire a local developer/team where you can at least have some face to face meetings. Similarly if you do not have knowledge of basic software development processes then i would hire a systems analyst or similar professional to at last help you with the requirements. But if you have a smaller project where you know what you want and failure of the project will not cause you/your company with substantial risk, then I would say go for it :)

So whatever you decide, good luck with your software endeavors !! 

 


Technorati Profile

Open Source Enterprise Search Review Part Four

Friday, March 20th, 2009

In the final part of this series we will look at a very simple webpage that takes a search term, then post the term to a search page that executes the query for us and displays the results on the webpage. The lucene object that reads the index is called a IndexReader. Before the query is processed it needds to be precessed by our friend the analyser. The analyser will remove exotic characters, plurals etc. ect. so clean terms are tokenised and used to query the index. It is important to use the same analyser to read the index as we used to write to the index. We used the Standard anlyser to process the terms that were written to the index, so we will use it again to proces the search terms. The following is the simple jsp page used to take the search term and post it to the results page.

  <html>


    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
        <title>JSP Search Page</title>
    </head>
    <body>
        <form name="search" action="results.jsp" method="get">
        <p>
            <input name="query" size="44"/> Search Criteria
        </p>
        <p>
            <input name="maxresults" size="4" value="100"/> Results Per Page 
            <input type="submit" value="Search"/>
        </p>
        </form>
    </body>
</html>

The following is the results page that takes the search parametrs and displays the results. Most of the code is concerned with paging the results, the piece of interest is the index reader that takes the search term, processes it with the standard analyser and outputs the results. this results page is taken directly from the Lucene documentation with a few modifications:

<%
        
/*
        Author: Andrew C. Oliver, SuperLink Software, Inc. (acoliver2@users.sourceforge.net)
        
        This jsp page is deliberatly written in the horrible java directly embedded 
        in the page style for an easy and concise demonstration of Lucene.
        Due note…if you write pages that look like this…sooner or later
        you'll have a maintenance nightmare.  If you use jsps…use taglibs
        and beans!  That being said, this should be acceptable for a small
        page demonstrating how one uses Lucene in a web app. 
        
        This is also deliberately overcommented. ;-)
        
         */
%>
<%!
    
public String escapeHTML(String s) {
        s 
s.replaceAll("&""&");
        
s.replaceAll("<""<");
        
s.replaceAll(">"">");
        
s.replaceAll("\""""");
        
s.replaceAll("'""'");
        return 
s;
    
}
%>

<%
        
boolean error = false;                  //used to control flow for error messages
        
String indexName "/opt/lucene/index";       //local copy of the configuration variable
        
IndexSearcher searcher = null;          //the searcher used to open/search the index
        
Query query = null;                     //the Query created by the QueryParser
        
Hits hits = null;                       //the search results
        
int startindex 0;                     //the first index displayed on this page
        
int maxpage 50;                    //the maximum items displayed on this page
        
String queryString = null;              //the query entered in the previous page
        
String startVal = null;              //string version of startindex
        
String maxresults = null;              //string version of maxpage
        
int thispage 0;                       //used for the for/next either maxpage or
        //hits.length() - startindex - whichever is
        //less

        try {
            searcher 
= new IndexSearcher(indexName);      //create an indexSearcher for our page
        //NOTE: this operation is slow for large
        //indices (much slower than the search itself)
        //so you might want to keep an IndexSearcher 
        //open

        catch (Exception e) {                         //any error that happens is probably due
            //to a permission problem or non-existant
            //or otherwise corrupt index
%>
<p>ERROR opening the Index - contact sysadmin!</p>
<p>Error message: <%
=escapeHTML(e.getMessage())%></p>   
<%                error 
= true;                                  //don't do anything up to the footer
        
}
%>
<%
        
if (error == false) {                                           //did we open the index?
            
queryString request.getParameter("query");           //get the search criteria
            
startVal request.getParameter("startat");         //get the start index
            
maxresults request.getParameter("maxresults");      //get max results per page
            
try {
                maxpage 
Integer.parseInt(maxresults);    //parse the max results first
                
startindex Integer.parseInt(startVal);      //then the start index  
            
catch (Exception e) {
            } 
//we don't care if something happens we'll just start at 0
            //or end at 50

            if (queryString == null) {
                
throw new ServletException("no query " //if you don't have a query then
                        
"specified");      //you probably played on the 
            
}                                                               //query string so you get the 
            //treatment

            Analyzer analyzer = new StandardAnalyzer();           //construct our usual analyzer
            
try {
                QueryParser qp 
= new QueryParser("contents", analyzer);
                
query qp.parse(queryString)//parse the 
            
catch (ParseException e) {                          //query and construct the Query
                //object
                //if it's just "operator error"
                //send them a nice error HTML

%>
<p>Error 
while parsing query: <%=escapeHTML(e.getMessage())%></p>
<%
                error 
= true;                                 //don't bother with the rest of
            //the page
            
}
        }
%>
<%
        
if (error == false && searcher != null) {                     // if we've had no errors
            // searcher != null was to handle
            // a weird compilation bug 
            
thispage maxpage;                                   // default last element to maxpage
            
hits searcher.search(query);                        // run the query 
            
if (hits.length() == 0) {                             // if we got no results tell the user
%>
<p> I
'm sorry I couldn't find what you were looking for. </p>
<%
                error 
= true;                                        // don't bother with the rest of the
            // page
            
}
        }

        if (error == false && searcher != null) {
%>
<table>
    <tr>
        <td>Document</td>
        <td>Summary</td>
    </tr>
    <%
    
if ((startindex + maxpage) > hits.length()) {
        thispage 
hits.length() - startindex;      // set the max index to maxpage or last
    
}                                                   // actual search result whichever is less

    for (int startindexi < (thispage + startindex)i++) {  // for each element
%>
    <tr>
        <%
        Document doc 
hits.doc(i);                    //get the next document 
        
String doctitle doc.get("title");            //get its title
        
String url doc.get("path");                  //get its path field
        
if (url != null && url.startsWith("../webapps/")) { // strip off ../webapps prefix if present
            
url url.substring(10);
        
}
        
if ((doctitle == null) || doctitle.equals("")) //use the path if it has no title
        
{
            doctitle 
url;
        
}
        
//then output!
%>
        <td><a href
="<%=url%>"><%=doctitle%></a></td>
        <td><%
=doc.get("summary")%></td>
    </tr>
    <%
    }
    %>
    <%                
if ((startindex + maxpage) < hits.length()) {   //if there are more results…display 
        //the more link

        String moreurl "results.jsp?query=" +
                URLEncoder.encode(queryString) + 
//construct the "more" link
                
"&maxresults=" + maxpage +
                
"&startat=" + (startindex + maxpage);
    
%>
    <tr>
        <td></td><td><a href
="<%=moreurl%>">More Results>></a></td>
    </tr>
    <%
    }
    %>
</table>

<%       }                                            //then include our footer.
        
if (searcher != null) {
            searcher.
close();
        
}
%>

Colorized by: CarlosAg.CodeColorizer

In order to see all of these components working together i have create two sample netbans projects. One binary application that will do the indexing/crawling and a web project with the search pages. The web application uses the glassfish (based on tomcat) application server, but this can easily be changed to your prefered application server. Even though these applications are basic, they are a perfectly adequate solution to index/search sites with many thousand pages. With the current configuration, a new index will overwrite the previous index, but this can be changed using the index writer properties, this will allow the indexing of multiple sites into one index.

 If there is sufficient interest and i have the time, i will create a C#.Net web client that can query the index createdd in Java. This can be achieved using the Lucene.Net api.

 The code for the two projects can be downloaded, by clicking on the "Download" button here,

Or can be checked out of the following svn repository, there is no user name or password required: http://svn.assembla.com/svn/JSearchEngine

The code is provided "as is" with no express or
implied warranty, however every attempt has been made to ensure accuracy.

You are free to use this code for commercial/non-commercial use as long as you abide by the terms of the Lucene and Html Parser licences.

 

-> part 1
-> part 2
-> part 3

Open Source Enterprise Search Review Part Three

Friday, March 20th, 2009

In the last part we looked at the lucene architecture and how lucene documents are composed. In this part we will concentrate in crawling a website and then adding the crawled site to a lucene index. In part 1 we discussed the notion of recursively crawling a web pages to eventually find all of the pages/links in a web site. We can achieve this by implementing a recursive indexing/crawling function. There are four important things going on in the following piece of code:

  1. Go to starting page, index the page then follow all links on the page 
  2. After following the links, index the linked pages, and follow the links on those pages.
  3. Repeat step 2 
  4. Do not index pages that have already been indexed, or pages from other sites.

The recursive crawler/indexer also makes use of an object called LinkParser this is a object that makes use of the html parser library to extract all of the links form a particular web page.

In order to write a dcoument to the index, lucene requires the use of an Analyzer, an analyzer proceses the content before it is added to the index. In this example we use the standard analyser, which is little more than a string tokeniser.

package jsearchengine;

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import 
org.apache.lucene.document.Document;
import 
org.apache.lucene.index.IndexWriter;

import java.net.URL;
import 
java.util.ArrayList;

public class Main {

    private static IndexWriter writer;          // new index being built
    
private static ArrayList indexed;
    private static 
String beginDomain;

    public static void main(String[] args) throws Exception {

        String index "/opt/lucene/index";
        boolean 
create = true;
        
String link "http://www.tek-dev.com/";
        
beginDomain Domain(link);
        System
.out.println(beginDomain);

        writer = new IndexWriter(index, new StandardAnalyzer(), create,
                
new IndexWriter.MaxFieldLength(1000000));
        
indexed = new ArrayList();

        indexDocs(link);

        System.out.println("Optimizing…");
        
writer.optimize();
        
writer.close();

    }

    private static void indexDocs(String url) throws Exception {

        //index page
        
Document doc HTMLDocument.Document(url);
        System
.out.println("adding " + doc.get("path"));
        try 
{
            indexed.add(doc.
get("path"));
            
writer.addDocument(doc);          // add docs unconditionally
            //TODO: only add html docs
            //and create other doc types

            //get all links on the page then index them
            
LinkParser lp = new LinkParser(url);
            
URL[] links lp.ExtractLinks();

            for (URL l : links) {
                
//make sure the url hasnt already been indexed
                //make sure the url contains the home domain
                //ignore urls with a querystrings by excluding "?" 
                
if ((!indexed.contains(l.toURI().toString())) && 

                    (l.toURI().toString().contains(beginDomain)) && 
                    (!l.toURI().toString().contains("?"))) {
                    
//don't index zip files
                    
if (!l.toURI().toString().endsWith(".zip")) {
                        
System.out.print(l.toURI().toString());
                        
indexDocs(l.toURI().toString());
                    
}
                }
            }

        } catch (Exception e) {
            
System.out.println(e.toString());
        
}
    }

    private static String Domain(String url) {
        
int firstDot url.indexOf(".");
        int 
lastDot url.lastIndexOf(".");
        return 
url.substring(firstDot + 1, lastDot);
    
}
}

The following is the link parser object used in the above code:

 

import org.htmlparser.NodeFilter;
import 
org.htmlparser.Parser;
import 
org.htmlparser.filters.NodeClassFilter;
import 
org.htmlparser.tags.LinkTag;
import 
org.htmlparser.util.NodeList;
import 
org.htmlparser.util.ParserException;

import java.util.Vector;
import 
java.net.URL;
import 
java.net.MalformedURLException;

/**
 *
 * @author Stephen.Lane
 */
public class LinkParser {

    String url;
    
Parser parser;
    
NodeFilter filter;
    
NodeList list;
    
LinkTag link;
    
URL[] linkArray;
    
Vector vector;

    public LinkParser(String Url) {
        url 
Url;
    
}

    public URL[] ExtractLinks() {
        filter 
= new NodeClassFilter(LinkTag.class);

        try {
            parser 
= new Parser(url);
            
list parser.extractAllNodesThatMatch(filter);

            vector = new Vector();
            for 
(int 0i < list.size()i++) {
                
try {
                    link 
(LinkTag) list.elementAt(i);
                    
vector.add(new URL(link.getLink()));
                
catch (MalformedURLException murle) {
                }
            }
            linkArray 
= new URL[vector.size()];
            
vector.copyInto(linkArray);
        
catch (ParserException e) {
            e.printStackTrace()
;
        
}

        return (linkArray);
    
}
}

Now that we have the pages crawled and each of the pages indexed all we have to do is query te index to display some results, we will look at that in part 4.

-> part 1
-> part 2
-> part 4

Open Source Enterprise Search Review Part Two

Friday, March 20th, 2009

In the first part of this four part series, we have looked at some of the fundamentals of enterprise search, in this second part we will take a look at some of the available open source technologies that we can leverage to implement those concepts.

With out a doubt the most challanging part of a search engine is the way we index content, and then subsequently how we interogate that index. Some of the main challanges that lie within search are providing results that are relevant to a search term. Take for example the word “Run”, if we search for the word “Run” we will also be interested in documents that containd words like “Runable”, “Running”, “Runs” etc etc… Thankfully are indexing engines that we can make use of that handle all of this logic transparently. Lucene is an appache foundation indexing project, and is widely accepted to be the best open source indexing project available.

In order to implement lucene we must first examine its basic architecture. Lucene is i very cleverly designed, it appears to be simple enough in nature, but the devil is definately in the detail ! The objects that lucene adds to its index are intuitivly called documents. A document is made up of fields, and each filed has a field name and field value. For example if we concider a web page, each of its attributes/tags can be mapped to a field. Such as its title, body etc. etc. If we take the title field for example, the field name id “title” and the field value is the actual title. Therefore when we query the index, we are returned an array of documents that match the search term. Any of the fields in the document can be returned as part of the resultset into a pager of repeater. The following code shows a html document type with title, summary, content and url fields.The following exmple uses html parser library to extract text from web pages.

import org.htmlparser.beans.StringBean;

import org.htmlparser.Parser;

import org.htmlparser.NodeFilter;

import org.htmlparser.filters.TagNameFilter;

import org.htmlparser.util.ParserException;

import java.io.*;

import org.apache.lucene.document.*;

/** A utility for making Lucene Documents for HTML documents. */

public class HTMLDocument {

public static Document Document(String url)

throws IOException, InterruptedException {

// make a new, empty document

Document doc = new Document();

String title = new String();

String summary = new String();

// Add the url as a field named ”path”.  Use a field that is

// indexed (i.e. searchable), but don’t tokenize the field into words.

doc.add(new Field(“path”, url, Field.Store.YES, Field.Index.NOT_ANALYZED));

// Add the tag-stripped contents as a Reader-valued Text field so it will

// get tokenized and indexed.

StringBean sb = new StringBean();

sb.setLinks(false);

sb.setURL(url);

StringReader sr = new StringReader(sb.getStrings());

doc.add(new Field(“contents”, sr));

Parser bParser;

NodeFilter bFilter;

try {

bParser = new Parser();

bFilter = new TagNameFilter(“TITLE”);

bParser.setResource(url);

title = bParser.parse(bFilter).asString();

} catch (ParserException e) {

e.printStackTrace();

}

try {

bParser = new Parser();

bFilter = new TagNameFilter(“BODY”);

bParser.setResource(url);

try {

summary = bParser.parse(bFilter).asString().substring(0, 200);

} catch (StringIndexOutOfBoundsException e) {

summary = “”;

}

} catch (ParserException e) {

e.printStackTrace();

}

// Add the title as a field that it can be searched and that is stored.

doc.add(new Field(“title”, title, Field.Store.YES, Field.Index.ANALYZED));

doc.add(new Field(“summary”, summary, Field.Store.YES, Field.Index.NO));

return doc;

}

private HTMLDocument() {

}

}

In order to crawl a web site in order to create these html documents, we need to implement some form of crawling mechanism. We will look at this in part three.

-> part 1 -> part 3
-> part 4

Open Source Enterprise Search Review Part One

Friday, March 20th, 2009

When we type out search term into that site search box and hit enter, we expect to be provided with a list of relevant results from the site in question relating to our search term. I am not talking about a search engine such as G£$Gle or Y$%oo I’m talking about the search boxes that decorate most good websites and forums. So how do they work ? Excellent question!, most users don’t know or care but as a software developer we are intrigued by this seemingly simple yet important functionality. Whether it be a tool for searching the entire internet, a corporate intranet or a small website, the principal is the same. If we think about it long enough most of us will come up with some sort of approximation as to how the process works. It might go something like “Find all the available data sources and move them into an archive, store the data in some soft of searchable format, and make a client to search this database ……….” Although this sounds simple this is basic principal of how all major search engines. Put more formally the three steps are Crawling, Indexing and Searching. There are a variety of tools which can be mixed and matched to complete these activities, and of course they sound much better when they are free or open source, so that’s what we will focus for this article. 

  • Crawling

Crawling as mentioned above is the process of finding all available data sources and storing in it in some archive. The method used for finding all available data sources will vary depending if we are crawling a network folder, a website or the internet.  If we want to crawl a files system folder such as a SharePoint™ archive all we need to do is loop through all of the files in the directory and place them into our archive.  If we want to take a snapshot of the data in one website to search, we can start by recursively following all of the links on the homepage of the site and storing copy of each page as we go. Following links recursively means following all the links on  a page, and then following all the links on the target pages and so on. It important not to follow links that leave the site as that will cause data external to the site getting stored.  Searching the entire internet is achieved in a similar manner to a single site search. Most search engines have the facility for webmasters to submit their site to those engines for inclusion in their search. The large search engines then recursively crawl both the internal and external links on these submitted sites. This means that even if a site isn’t submitted to the search engine, it will still get crawled if it is linked to by a site that is included in the search engine. Of course this takes a lot of cpu and bandwidth !. (hundreds of thousands of servers)  

  • Indexing

 Once all of the data has been stored to a repository, then the fun can start. At this stage the files and data are not very searchable as the data is stored in a so called “heap”. To search all of this unstructured data would be very inefficient and slow. In order to make the content more accessible the data need to be stored in a structured format called an index. Thus this is why this process is called indexing. In it simplest form an index is a sorted list of all of the words and phrases that are found in the content that has been retrieved. The words and phrases will be stored in alphabetical order along with their source and rank or popularity. One of the primary issues with adding content to the index is its format, all content needs to be converted to readable text in order to add it to the index. This is a problem when the data is stored in complex or proprietary file types. However this can be over come be the creation of content parsers. Often these parsers are created by owners or proprietary files types to make there file types more accessible.   Now that the content is added to a sorted index, a keyword search on that index will quickly retrieve all of the sources for that word or phrase. When there are multiple sources they will be ranked based on the popularity or rank of the source. This then begs the question, so how is the popularity of a page determined. Well there are countless algorithms from a variety of vendors, most of which are propitiatory. A very famous algorithm named Page Rank™ is used by the Google™ search engine. Its exact ranking/popularity scoring technique is a well guarded secret but it works on the basic premise that if a page has a lot of incoming links then it is considered popular. If the links are coming from pages that are considered popular, then the receiving page is considered to be more popular or have a higher page rank. Depending on the engine, crawling and indexing can either be done as a single step or as two separate processes. If the two are combined then the content is added to the index immediately after it is retrieved. This is more common on smaller solutions as it does not scale very well. For larger enterprise or internet engines it is much easier to manage one set of high bandwidth servers to crawl and download content, and have set of servers with high cpu capacity to parse and index the content. 

  • Searching

Now that all of the content has been indexed the hard work has been done. Now all a client has to do to get some search results is submit a search word or term to the index. If the index is stored in a relational database, then and sql query can be used to retrieve the results. However search engines tend to use specially designed highly efficient data structures to store the index. In this case the method used to retrieve the results will depend on the implementation.    

So, now that we know all of the boring stuff, let’s get stuck in and look at some example implementations in part 2.

-> part 2
-> part 3
-> part 4