How can you make your website searchable?
How to install an index search for your website
1. Search and find information
In the age of Google & Co, you can hardly imagine that you can't find something. But even on your own hard drive you will not be able to locate a certain content among thousands of files without special search tools. Different file formats such as DOC, ODT or PDF make the search more difficult. This is easier with a website made up of only HTML files. Nevertheless, a search directly in the files would usually be too slow here too. The search can be accelerated significantly with an index. This contains all the words from the indexed documents and references to the places where they were found.
The HTML indexer we use was originally developed for documentation and manuals in HTML format under the name Webhelp Indexer. Since it does not require a web server, the files generated are also suitable for delivery on DVD or USB stick.
You can find the indexer and the sample files at www.pcwelt.de/PF8JZA. The Java source code of the Webhelp indexer that we have adapted can be found at www.pcwelt.de/druQN1, the original code at http://sourceforge.net/projects/docbook.
Reading tip: HTML crash course
2. Install and test the HTML indexer
Unzip the downloaded ZIP archive into your home directory. It contains the “lib” folder with the required Java tools, the sample files are located under “doc”, and the “indexer.sh” script is used to generate the index. Make it executable in a terminal window with the following command line:
If not available, install a Java Runtime Environment (JRE). For Linux Mint and Ubuntu, for example, the package is called “default-jre”. Try out the index by starting it on the command line:
The bash script recursively records all HTML files under "doc". It stores the search index under “doc / search”. To try out the search, open the file "doc / index.html" in the browser. Enter a search term, for example "student". No distinction is made between upper and lower case. The search results page shows the headings of the pages that contain the term. Behind each is a point display that symbolizes the relevance, i.e. the frequency of the search term on the respective page. When you click the link to a page, the references are highlighted. You will notice that not only “Student” but also “Student” is highlighted. The indexer uses a stemming algorithm that tries to trace the variants of a word back to the word stem.
3. Prepare web pages for indexing
Rename the example directory “doc”, something like “doc.bak”, and create a new folder “doc”. Copy the folders “common” and “search” from “doc.bak” to “doc”. In order for the indexer to do its job, you need to customize your web pages for it. Contents that the indexer should take into account must be within the tags "
Replace “description” with a sentence or two with information about the respective page. This text then appears in the search result under the page title. Copy the HTML files into the "doc" folder and start the indexer:
Look at "search / htmlFileInfo-List.js" in an editor. The file contains the list of indexed files. “Index-1.js”, “index-2.js” and “index-3.js” contain the indexed words.
You can also use the search form ("
4. Customize the indexer script
The script "indexer.sh" mainly contains the path information to the necessary Java tools such as Saxon, Xerces and Lucene. You do not have to worry about this any further, because the associated Jar files are in the "lib" folder and you do not have to install them yourself. If necessary, you can change the name of the folder with the HTML files after “OUTPUT_DIR =”. If your website offers English-language texts, enter the value "en" after "-DindexerLanguage =". Other available languages are French (“fr”), Japanese (“ja”) and Chinese (“zh”). The not always error-free stemmer function (-> point 2) can be switched off with "-DdoStem = false".
Make PDF files searchable
The presented indexer only takes HTML files into account. You would therefore have to convert other formats to HTML. In the case of PDF files, this works with the pdftohtml tool. It is included in the "popplerutils" package on Ubuntu and related systems. Start the tool in a terminal window in the following form:
There are several options to customize the output.
Call the tool without parameters to see an overview, or use man pdftohtml. For example, try the "-c" option.
The original format is retained as far as possible, with images forming a page-filling background. If necessary, you can link to the original file from the converted PDFs. You can find an example of this in the folder "doc / wording".
- Who is responsible for the better road
- What happens if you select delete data
- How do I reformat my Windows Vista
- What if Viserys Targaryen became king?
- Can Homeopathy Cure Parkinson's Disease
- Will Donald Trump ever stop being stupid?
- Why don't liberals condemn anti-fascist terrorists
- Is THJ 018 a legal cannabinoid
- Will social classes disappear over time?
- How hot does a heating lamp get
- Is autism widespread among millennials
- What's your rating of Cornell Engineering
- What are the lyrics to Soft Kitty
- How old is Anuradhapura
- Have people had sex in space
- What are good songs of forgiveness
- Who was the Leap King of Jerusalem?
- How do you provide great customer service
- Sicilians have Middle Eastern ancestry
- Have you ever been irreparably damaged?
- What's your favorite example of everyone dying
- Has Alaska Walmart stores
- Is killing a person a perfect crime?
- How does public opinion come about
- Who is the most controversial musician
- What is accountability
- Why do you like wireless charging
- When will the new Deftones album be released?
- What are some oxymoronic words in English
- Can day trading make a living
- How will Disneyland Paris be saved
- Which is the fastest CDN for India
- When will Nymeria return
- Can people with Williams Syndrome have children?