Reading Notes For Week 2

IIR Section 1.2

In the resulting index, the storage is for both dictionary(memory) and posting lists(disk). So the size of each is very important. A optimized data structure should be used for posting list. Also, we need to optimize the efficiency of storage and access.
Variable length array keeps avoiding overhead for pointers and its contiguous memory increases speed on modern processors with memory caches. All in all, it's a good solution for space and time efficiency.

IIR Section 2

Steps of index processing: convert byte sequence into a linear sequence of characters - determine weather what document unit for indexing is.
Parsing a document contains a lot of problems, such as the format, what language is in it. All of these is are classification problems,
We may need to normalize words and query words into the same form.

IIR Section 3

There are two main data structures for dictionaries. Hash table and tree.The best known search tree is the binary tree. Efficient search hinges on the tree being balanced. When the vocabulary is growing, Hash table needs to rehashing everything.(expensive). The principal issue for tree is rebalancing.
Maintain a second inverted index from bigrams to dictionary which match each bigram,
Documentation correction is necessary for OCR'ed documents, but usually we don't change the documents and fix the query document mapping instead.

2140 - Information Storage and Retrieval