How can you make your website searchable?

How to install an index search for your website

Thorsten Eggeling

There is also a quick search function for self-built HTML pages. All you need is a tool for creating the index and a few Javascripts.

EnlargeThe search result shows links to the websites, an info text on the content and a relevance indicator.

Content management systems (CMS) usually have a search function that offers visitors a search field. This is not a problem because most of the content is in a database. If, on the other hand, a website consists of static, i.e. not database-generated HTML pages, a search function is not so easy to implement. One possibility is to integrate the Google search https://cse.google.de into the website. However, this assumes that Google has already indexed all pages. New pages are initially not taken into account. This article goes a different way with its own indexer and a search function via Javascript. The process is particularly suitable for content that consists of many individual pages, such as scientific papers, documentation or product catalogs.

1. Search and find information

In the age of Google & Co, you can hardly imagine that you can't find something. But even on your own hard drive you will not be able to locate a certain content among thousands of files without special search tools. Different file formats such as DOC, ODT or PDF make the search more difficult. This is easier with a website made up of only HTML files. Nevertheless, a search directly in the files would usually be too slow here too. The search can be accelerated significantly with an index. This contains all the words from the indexed documents and references to the places where they were found.

The HTML indexer we use was originally developed for documentation and manuals in HTML format under the name Webhelp Indexer. Since it does not require a web server, the files generated are also suitable for delivery on DVD or USB stick.

You can find the indexer and the sample files at www.pcwelt.de/PF8JZA. The Java source code of the Webhelp indexer that we have adapted can be found at www.pcwelt.de/druQN1, the original code at http://sourceforge.net/projects/docbook.

Reading tip: HTML crash course

2. Install and test the HTML indexer

Unzip the downloaded ZIP archive into your home directory. It contains the “lib” folder with the required Java tools, the sample files are located under “doc”, and the “indexer.sh” script is used to generate the index. Make it executable in a terminal window with the following command line:

If not available, install a Java Runtime Environment (JRE). For Linux Mint and Ubuntu, for example, the package is called “default-jre”. Try out the index by starting it on the command line:

The bash script recursively records all HTML files under "doc". It stores the search index under “doc / search”. To try out the search, open the file "doc / index.html" in the browser. Enter a search term, for example "student". No distinction is made between upper and lower case. The search results page shows the headings of the pages that contain the term. Behind each is a point display that symbolizes the relevance, i.e. the frequency of the search term on the respective page. When you click the link to a page, the references are highlighted. You will notice that not only “Student” but also “Student” is highlighted. The indexer uses a stemming algorithm that tries to trace the variants of a word back to the word stem.

EnlargeOnly content that is surrounded by the div tags with id = "content_idx" are taken into account.

3. Prepare web pages for indexing

Rename the example directory “doc”, something like “doc.bak”, and create a new folder “doc”. Copy the folders “common” and “search” from “doc.bak” to “doc”. In order for the indexer to do its job, you need to customize your web pages for it. Contents that the indexer should take into account must be within the tags "

...
". Typically, you will surround articles with these tags, but not footers or other content that is repeated on every page. Include the following line in the head tag:

Replace “description” with a sentence or two with information about the respective page. This text then appears in the search result under the page title. Copy the HTML files into the "doc" folder and start the indexer:

Look at "search / htmlFileInfo-List.js" in an editor. The file contains the list of indexed files. “Index-1.js”, “index-2.js” and “index-3.js” contain the indexed words.

Install Javascripts: So that the search function can be called up on all websites, add some Javascript links. Open the file "index.html" from the backup folder "doc.bak" in an editor and copy the script block under "" into your HTML files.

You can also use the search form ("

"). It calls the Javascript function "Verifie (searchForm)" from "nwSearchFnt.js". This first checks whether the search term is at least one character long, and otherwise issues an error message. The search term is then saved in a cookie. The actual search and the display of the results takes place via "searchresult.html", which is automatically loaded next. This must contain the line "
" for the search to work. Copy the file from the examples folder and customize it for your website. Note the slightly changed search form in "searchresult.html". In contrast to the other pages, the Javascript call only contains the script call and no forwarding.

4. Customize the indexer script

The script "indexer.sh" mainly contains the path information to the necessary Java tools such as Saxon, Xerces and Lucene. You do not have to worry about this any further, because the associated Jar files are in the "lib" folder and you do not have to install them yourself. If necessary, you can change the name of the folder with the HTML files after “OUTPUT_DIR =”. If your website offers English-language texts, enter the value "en" after "-DindexerLanguage =". Other available languages ​​are French (“fr”), Japanese (“ja”) and Chinese (“zh”). The not always error-free stemmer function (-> point 2) can be switched off with "-DdoStem = false".

Reading tip: Validation tools for HTML, CSS and Javascript

Make PDF files searchable

The presented indexer only takes HTML files into account. You would therefore have to convert other formats to HTML. In the case of PDF files, this works with the pdftohtml tool. It is included in the "popplerutils" package on Ubuntu and related systems. Start the tool in a terminal window in the following form:

There are several options to customize the output.

Call the tool without parameters to see an overview, or use man pdftohtml. For example, try the "-c" option.

The original format is retained as far as possible, with images forming a page-filling background. If necessary, you can link to the original file from the converted PDFs. You can find an example of this in the folder "doc / wording".