Sunday, September 13, 2015

Reading Notes For Week 3

IIR Chapter 4

  1. When building an information retrieval system, many decisions are based on the characteristics of the computer hardware on which the system runs.
  2. With main memory insufficient, we need to use an external sorting algorithm. It minimize the number of random disk seeks during sorting – sequential disk reads.
  3. Blocked sort-based indexing has excellent scaling properties, but it needs a data structure for mapping terms to termIDs. For very large collections, this data structure does not fit into memory. A more scalable alternative is single-pass in-memory indexing or SPIMI.
  4. Difference between BSBI and SPIMI : SPIMI add a post directly to its postings list. Each postings list is dynamic and it's immediately available to collect postings.
  5. Advantages of SPIMI: faster because there is no sorting required and it saves memory because we keep track of the term a posting.
  6. The map phase of MapReduce consists of mapping splits of the input data to key-value pairs.
    For the reduce phase, we want all values for a given key to be stored close together, so that they can be read and processed quickly.
IIR Chapter 5

  1. Main benefit of compression is we need less disk space. The subtle benefits of compression is increased use of caching and it is faster transfer of data from disk to memory.
  2. The primary factor in determining the response time of IR system is the number of disk seeks necessary to process a query.
  3. Using fixed-width entries for terms is clearly wasteful. We can overcome these shortcomings by storing the dictionary terms as one long string of characters, The pointer to the next term is also used to demarcate the end of the current term. Then, we can locate terms in the data structure by way of binary search in the table. This scheme saves us 60% compared to fixed-width storage.
  4. To encode small numbers in less space than large numbers, we look at two types of methods: bytewise compression and bitwise compression.
  5. Bytes offer a good compromise between compression ratio and speed of decompression.

No comments:

Post a Comment