2140 - Information Storage and Retrieval : September 2015

Wednesday, September 23, 2015

Reading Notes For Week 4

MIR Chapter 2.1-2.5.3

How users go about doing search tasks? Usually four main cycle activities: problem identification, articulation of information needs, query formulation and results evaluation.
The standard model of the information seeking process is out of date. Because the users learn as they search, the models should emphasize the dynamic nature of the search process. Thus, the information is adjusted as the retrieval results came up.
Many search engines allow users to peruse an information structure of some kind to select a starting point for search, however, I think the user is prefer to browsing with less recall, and see what the search engine returns.
According to the search, the less expert the users are about a topic, the more likely they are to feel confident that all of the relevant information has been accessed.
Web search engines have become more sophisticated about dropping term that would result in empty results.
An increasingly common strategy within the search form is to show hints about what kind of information should be entered into each form via greyed-out text. For instance, on the search box, the first box is labeled “what are you looking for?” while the second box is labeled “when (tonight, this weekend, ...)”. When the user places the cursor into the entry form, the grey text disappears, and the user can type in their query terms.
Since relevance feedback indicates which documents are relevant to the query, this method is able to help greatly improve rank ordering.

Muddiest Point For Lecture 3

When we do spelling check, do we need to edit distance for all the dictionaries?

Monday, September 14, 2015

Muddiest Point For Lecture 2

If the collection or the query include multiple languages or formats, what algorithm should be used for stemming?

Sunday, September 13, 2015

Reading Notes For Week 3

IIR Chapter 4

When building an information retrieval system, many decisions are based on the characteristics of the computer hardware on which the system runs.
With main memory insufficient, we need to use an external sorting algorithm. It minimize the number of random disk seeks during sorting – sequential disk reads.
Blocked sort-based indexing has excellent scaling properties, but it needs a data structure for mapping terms to termIDs. For very large collections, this data structure does not fit into memory. A more scalable alternative is single-pass in-memory indexing or SPIMI.
Difference between BSBI and SPIMI : SPIMI add a post directly to its postings list. Each postings list is dynamic and it's immediately available to collect postings.
Advantages of SPIMI: faster because there is no sorting required and it saves memory because we keep track of the term a posting.
The map phase of MapReduce consists of mapping splits of the input data to key-value pairs.

For the reduce phase, we want all values for a given key to be stored close together, so that they can be read and processed quickly.

IIR Chapter 5

Main benefit of compression is we need less disk space. The subtle benefits of compression is increased use of caching and it is faster transfer of data from disk to memory.
The primary factor in determining the response time of IR system is the number of disk seeks necessary to process a query.
Using fixed-width entries for terms is clearly wasteful. We can overcome these shortcomings by storing the dictionary terms as one long string of characters, The pointer to the next term is also used to demarcate the end of the current term. Then, we can locate terms in the data structure by way of binary search in the table. This scheme saves us 60% compared to fixed-width storage.
To encode small numbers in less space than large numbers, we look at two types of methods: bytewise compression and bitwise compression.
Bytes offer a good compromise between compression ratio and speed of decompression.

Monday, September 7, 2015

Reading Notes For Week 2

IIR Section 1.2

In the resulting index, the storage is for both dictionary(memory) and posting lists(disk). So the size of each is very important. A optimized data structure should be used for posting list. Also, we need to optimize the efficiency of storage and access.
Variable length array keeps avoiding overhead for pointers and its contiguous memory increases speed on modern processors with memory caches. All in all, it's a good solution for space and time efficiency.

IIR Section 2

Steps of index processing: convert byte sequence into a linear sequence of characters - determine weather what document unit for indexing is.
Parsing a document contains a lot of problems, such as the format, what language is in it. All of these is are classification problems,
We may need to normalize words and query words into the same form.

IIR Section 3

There are two main data structures for dictionaries. Hash table and tree.The best known search tree is the binary tree. Efficient search hinges on the tree being balanced. When the vocabulary is growing, Hash table needs to rehashing everything.(expensive). The principal issue for tree is rebalancing.
Maintain a second inverted index from bigrams to dictionary which match each bigram,
Documentation correction is necessary for OCR'ed documents, but usually we don't change the documents and fix the query document mapping instead.

Sunday, September 6, 2015

According to the lecture, the basic approaches to IR are mainly not from the user side. For example, use of statistics and automatic methods. How about if we try to do it from the user side and offer the user step by step instructions on how to query, is that going to be more efficient?

Reading Notes For Week 1

FOA:

FOA process of browsing readers involve three steps: asking a question(user information need), constructing an answer(the source of the question, refer to search engine) and assessing the answer(how relevant the answer to the question).
Relevance feedback gives the asker an opportunity to provide more information with their reaction to each retrieved document, so researchers should learn more about how to make use of relevant feedback judgments.

IES:

The main components of IR system: User information need drives the search process. Then the user constructs a query to the system. The user interface mediates between the user and the IR system. User's query is processed by a search engine. The search engine accepts queries from user, processes queries and return ranked lists of results.
Search engine is to maintain and manipulate an inverted index for a document collection. The index updates is related to the searching efficiency.
Build an IR system requires electronic text formats and the characteristics of the text encode. The content and structure of paper has a impact on indexing and retrieval.

MIR:

Information Retrieval can be studied from two point of views: computer-centered (build up efficient indexes, process user queries, develop rank algorithms) and human-centered (user behavior, user main needs, understanding affects the operation).
The index as a kind of specialized data structures, is very necessary to build for fast searching. It provides fast access to the data and allow speeding up query processing. The most used index structure is an inverted index.
The paper also mentioned information retrieval in three generations of library system.(card catalog searching - search functionality - graphical interface) Not only the searching functionality has been added to the query operators, but also the interface and system architecture has been improved too. It means we pay more attention to the human-centered improvement to offer users a better experience.
The primary goal of IR system is to retrieve information that is useful to the user, which is opposed to the retrieval of data. IR system users concerned more about retrieving information about a subject than with retrieving data that satisfies a query.