Inverted File Index
Solution1: Scan each page
Solution2: Term-Document Incidence Matrix
Solution3: Compact Version - Inverted File Index 倒排文件索引
[Definition] Index is a mechanism for locating a given term in a text.
[Definition] Inverted file contains a list of pointers (e.g. the number of a page) to all occurrences of that term in the text
Note: Inverted because it lists for a term, all documents that contain the term.
- AND 操作从最小频率的 term 开始求交,速度快
Index Generator
1 | while ( read a document D ) /*1*/ |
- Token Analyzer + Stop Filter
- Vocabulary Scanner
- Vocabulary Insertor
- Memory Management
Token Analyzer + Stop Filter
Word Stemming
- Process a word so that only its stem or root form is left.
Stop Words
- Some words are so common that almost every document contains them, such as “a” “the” “it”. It is useless to index them. They are called stop words. We can eliminate them from the original documents.
Vocabulary Scanner and Insertor
- Solution 1: Search trees(B- trees, B+ trees, Tries, …)
- Solution 2: Hashing is faster for one word, but scanning in sequential order is not possible
Memory Management
1 | BlockCnt = 0; |
- 在写磁盘之前先排序,归并效率高
Distributed indexing
-
Each node contains index of a subset of collection
-
Solution 1: Term-partitioned index
-
Solution 2: Document-partitioned index
Dynamic indexing
-
Docs come in over time
- postings updates for terms already in dictionary
- new terms added to dictionary
-
Docs get deleted
Compression
Thresholding
-
Document: only retrieve the top documents where the documents are ranked by weight
- Not feasible for Boolean queries
- Can miss some relevant documents due to truncation
-
Query: Sort the query terms by their frequency in ascending order; search according to only some percentage of the original query terms
Measures for a search engine
本博客所有文章除特别声明外,均采用 CC BY-NC-SA 4.0 许可协议。转载请注明来自 OE.Heart's Blog!