One challenge of building a search engine is that we cannot load all the data into the memory at once. Therefore retrieving relevant information about a given term requires multiple steps. The goal is to know the location information of this term in the whole collection and this information is stored on a physical disk or multiple physical disks somewhere in the system.
The general process flow is illustrated by the figure blow: A query comes in and it contains multiple terms. For each team, we need to retrieve its postings information. By retrieving, we mean the information is load into the memory. This corresponds to the last box in the flow and the color green indicates that the information is in the memory. To process a given term, we first look it up in some in-memory index. The in-memory index is a data structure that allows us to figure out the physical location of saved posting information of the given term while minimizing the number of disk reads.
1. Data Structure
Dictionary and postings list are two core data structures used in search engine. Standard built-in in-memory implementation is not sufficient due to the following reasons:
- The data of locations of terms are huge and we cannot load all of them into memory. Therefore, disk-based data structure and algorithms are needed in a search engine implementation.
- The distribution of terms in documents is not uniform. In fact, depending on the task the search engine may need to handle distribution with very specific characteristics. Therefore, the standard build-in implementation may not always be optimal and a customized version could usually have better performance.
There are two ways to implement a dictionary:
- Hash-based dictionary: hash map
- Sort-based dictionary: binary search tree
For a hash-based dictionary, the size of the hash table matters because the data volume is large and too many collisions can result in poor performance. Otherwise, the hash-based approach is generally faster than the sort based approach because it doesn't need to perform binary search. However, hash-based dictionary does not support support prefix queries well.
1.2 Postings List
One of the challenges of implementing postings list is again the fact the we may not be able to load the full postings list into memory. Moreover, the implementation should support efficient random access. The ideas of solving this type of problems are similar. We store some auxiliary information in memory, which provides approximated location of data that contains the target. Then we perform a linear scan starting from the returned location. For example, suppose for term \(t\), the posting list has 20,000 items. Instead of loading all 10,000 items into the memory, we could construct a list by adding every 100 items. The resulting list only has 200 items. In order to find the next occurrence of the term \(t\) after position \(k\), we could first use the helper list to approximately locate the position and then perform a linear scan in the file stored on the disk. Items in the helper list are called synchronization point.
1.3 Interleaving Dictionary
An interleaving dictionary is to putting term and its postings list together in the file.
Here is the basic form of an Interleaving dictionary:
An optimized form:
2. Index Construction
There are two types of index construction
- In-memory index construction
- Disk-based index construction
In-memory index construction is essentially to create a hash table. The standard hash table implementation could work but it may not be optimal. According to Zipf's law, the distribution of terms are not uniform and we could leverage the distribution in a language to optimize the hash table performance.
There are two ways to construct a disk-based index:
- Sort-based Index Construction
- Merge-base Index Construction
The idea of a sort-based index construction is to consider the collection of documents as a collection of
(term, position) and we could apply disk-based sorting algorithm to sort these pairs. Note that this approach requires generating global term ID.
The idea of a merge-based index construction is to first build the index in memory. If there is no more memory available or the consumed memory reaches a threshold, we simply write the index to disk, clear the in-memory index and move forwards. After all the documents are processed, we can then merge the individual index saved on the disk. An individual index saved on the disk in this context is called an index partition.
----- END -----
©2019 - 2022 all rights reserved