Solution1: Scan each page

Solution2: Term-Document Incidence Matrix

image-20210318200615547

Solution3: Compact Version - Inverted File Index 倒排文件索引

[Definition] Index is a mechanism for locating a given term in a text.

[Definition] Inverted file contains a list of pointers (e.g. the number of a page) to all occurrences of that term in the text

image-20210318200810862

Note: Inverted because it lists for a term, all documents that contain the term.

image-20210318202149883
  • AND 操作从最小频率的 term 开始求交,速度快

Index Generator

1
2
3
4
5
6
7
8
9
10
11
while ( read a document D ) /*1*/
{
while ( read a term T in D ) /*2*/
{
if ( Find( Dictionary, T ) == false ) /*3*/
Insert( Dictionary, T );
Get T’s posting list;
Insert a node to T’s posting list;
}
}
Write the inverted index to disk; /*4*/
  1. Token Analyzer + Stop Filter
  2. Vocabulary Scanner
  3. Vocabulary Insertor
  4. Memory Management

Token Analyzer + Stop Filter

Word Stemming

  • Process a word so that only its stem or root form is left.
image-20210508162359043

Stop Words

  • Some words are so common that almost every document contains them, such as “a” “the” “it”. It is useless to index them. They are called stop words. We can eliminate them from the original documents.

Vocabulary Scanner and Insertor

  • Solution 1: Search trees(B- trees, B+ trees, Tries, …)
  • Solution 2: Hashing is faster for one word, but scanning in sequential order is not possible
image-20210508163733855

Memory Management

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
BlockCnt = 0; 
while ( read a document D )
{
while ( read a term T in D )
{
if ( out of memory )
{
Write BlockIndex[BlockCnt] to disk;
BlockCnt++;
FreeMemory;
}
if ( Find( Dictionary, T ) == false )
Insert( Dictionary, T );
Get T’s posting list;
Insert a node to T’s posting list;
}
}
for ( i = 0; i < BlockCnt; i++ )
Merge( InvertedIndex, BlockIndex[i] );
  • 在写磁盘之前先排序,归并效率高

Distributed indexing

  • Each node contains index of a subset of collection

  • Solution 1: Term-partitioned index

  • Solution 2: Document-partitioned index

Dynamic indexing

  • Docs come in over time

    • postings updates for terms already in dictionary
    • new terms added to dictionary
  • Docs get deleted

    image-20210319173346408

Compression

image-20210319191834565

Thresholding

  • Document: only retrieve the top xx documents where the documents are ranked by weight

    • Not feasible for Boolean queries
    • Can miss some relevant documents due to truncation
  • Query: Sort the query terms by their frequency in ascending order; search according to only some percentage of the original query terms

    image-20210508163029245

Measures for a search engine

image-20210319194234895 image-20210508163628653 image-20210319194057834 image-20210319194120352