When we type out search term into that site search box and hit enter, we expect to be provided with a list of relevant results from the site in question relating to our search term. I am not talking about a search engine such as G£$Gle or Y$%oo I’m talking about the search boxes that decorate most good websites and forums. So how do they work ? Excellent question!, most users don’t know or care but as a software developer we are intrigued by this seemingly simple yet important functionality. Whether it be a tool for searching the entire internet, a corporate intranet or a small website, the principal is the same. If we think about it long enough most of us will come up with some sort of approximation as to how the process works. It might go something like “Find all the available data sources and move them into an archive, store the data in some soft of searchable format, and make a client to search this database ……….” Although this sounds simple this is basic principal of how all major search engines. Put more formally the three steps are Crawling, Indexing and Searching. There are a variety of tools which can be mixed and matched to complete these activities, and of course they sound much better when they are free or open source, so that’s what we will focus for this article.
- Crawling
Crawling as mentioned above is the process of finding all available data sources and storing in it in some archive. The method used for finding all available data sources will vary depending if we are crawling a network folder, a website or the internet. If we want to crawl a files system folder such as a SharePoint™ archive all we need to do is loop through all of the files in the directory and place them into our archive. If we want to take a snapshot of the data in one website to search, we can start by recursively following all of the links on the homepage of the site and storing copy of each page as we go. Following links recursively means following all the links on a page, and then following all the links on the target pages and so on. It important not to follow links that leave the site as that will cause data external to the site getting stored. Searching the entire internet is achieved in a similar manner to a single site search. Most search engines have the facility for webmasters to submit their site to those engines for inclusion in their search. The large search engines then recursively crawl both the internal and external links on these submitted sites. This means that even if a site isn’t submitted to the search engine, it will still get crawled if it is linked to by a site that is included in the search engine. Of course this takes a lot of cpu and bandwidth !. (hundreds of thousands of servers)
- Indexing
Once all of the data has been stored to a repository, then the fun can start. At this stage the files and data are not very searchable as the data is stored in a so called “heap”. To search all of this unstructured data would be very inefficient and slow. In order to make the content more accessible the data need to be stored in a structured format called an index. Thus this is why this process is called indexing. In it simplest form an index is a sorted list of all of the words and phrases that are found in the content that has been retrieved. The words and phrases will be stored in alphabetical order along with their source and rank or popularity. One of the primary issues with adding content to the index is its format, all content needs to be converted to readable text in order to add it to the index. This is a problem when the data is stored in complex or proprietary file types. However this can be over come be the creation of content parsers. Often these parsers are created by owners or proprietary files types to make there file types more accessible. Now that the content is added to a sorted index, a keyword search on that index will quickly retrieve all of the sources for that word or phrase. When there are multiple sources they will be ranked based on the popularity or rank of the source. This then begs the question, so how is the popularity of a page determined. Well there are countless algorithms from a variety of vendors, most of which are propitiatory. A very famous algorithm named Page Rank™ is used by the Google™ search engine. Its exact ranking/popularity scoring technique is a well guarded secret but it works on the basic premise that if a page has a lot of incoming links then it is considered popular. If the links are coming from pages that are considered popular, then the receiving page is considered to be more popular or have a higher page rank. Depending on the engine, crawling and indexing can either be done as a single step or as two separate processes. If the two are combined then the content is added to the index immediately after it is retrieved. This is more common on smaller solutions as it does not scale very well. For larger enterprise or internet engines it is much easier to manage one set of high bandwidth servers to crawl and download content, and have set of servers with high cpu capacity to parse and index the content.
- Searching
Now that all of the content has been indexed the hard work has been done. Now all a client has to do to get some search results is submit a search word or term to the index. If the index is stored in a relational database, then and sql query can be used to retrieve the results. However search engines tend to use specially designed highly efficient data structures to store the index. In this case the method used to retrieve the results will depend on the implementation.
So, now that we know all of the boring stuff, let’s get stuck in and look at some example implementations in part 2.