2140 - Information Storage and Retrieval : 2015

Friday, December 11, 2015

Muddiest Point For Week 10

What is the difference between relevance feedback and query expansion?

Muddiest Point for Week 8

Does pseudo-feedback always increases precision and recall?

Muddiest Point For Week 7

How does stemming typically affect recall in Boolean document retrieval?

Friday, November 27, 2015

Muddiest Point For Week 13

What will be a good example for adaptive information retrieval system?

Reading Notes For Week 13

IIR

Most language-modeling work in IR has used unigram language models. IR is not the place where you most immediately need complex language models. Unigram models are often sufficient to judge the topic of a text.
Language modeling is a quite general formal approach to IR, with many variant realizations. The original and basic method for using language models in IR is the query likelihood model.
Vector space systems have generally preferred more lenient matching, though recent web search developments have tended more in the direction of doing searches with such conjunctive semantics.
Group-average agglomerative clustering is avoiding the pitfalls of the single-link and complete-link criteria, which equate cluster similarity with the similarity of a single pair of documents.
Flat clustering creates a flat set of clusters without any explicit structure that would relate clusters to each other. Hierarchical clustering creates a hierarchy of clusters.
The inverted index supports fast nearest-neighbor search for the standard IR setting. However, sometimes we may not be able to use an inverted index efficiently.
Feature selection makes training and applying a classifier more efficient by decreasing the size of the effective vocabulary.
Differential cluster labeling selects cluster labels by comparing the distribution of terms in one cluster with that of other clusters.

Friday, November 20, 2015

Reading Notes For Week 12

User Profiles

classification: the way information is collected, the life period of the profile and structure.
five basic approaches to user identification: software agents, logins, enhanced proxy serves, cookies and session ids.
The searches is not limited to the Web, but they would also include databases to which the user has access, and the users personal documents. Such search systems are implemented in tools like Google Desktop Search.
User identification can be obtained using mechanisms such as session ids or cookies that provide anonymity. Even methods requiring a login process can be anonymous if users are be allowed to use pseudonyms rather than their true identity.
In user customization, a recommendation system provides an interface that allows users to construct a representation of their own interests. Often check boxes are used to allow a user to select from the known values of attributes,
Content-based recommendation systems recommend an item to a user based upon a description of the item and a profile of the user’s interests. While a user profile may be entered by the user, it is commonly learned from feedback the user provides on items.
Personalized Web search has emerged as one of the hottest topics for both the Web industry and academic researchers.

Muddiest Point For Week 12

How to evaluate the quality of a web search?

Friday, November 13, 2015

Reading Notes For Week 11

IIR Chapter 19 & 21

Despite these words being consequently invisible to the human user, a search engine indexer would parse the invisible words out the HTML representation of the web page and index these words as being present in the page.
Web search engines frown on this business of attempting to decipher and adapt to their proprietary ranking techniques and indeed announce policies on forms of SEO behavior they do not tolerate.
Current search engines follow precisely this model: they provide pure search results (generally known as algorithmic search results) as the primary response to a user’s search, together with sponsored search results displayed separately and distinctively to the right of the algorithmic results.

Authoritative Sources in a Hyper linked Environment

The central issue we address within our framework is the distillation of broad search topics, through the discovery of “authorative” information sources on such topics.
The motivation of the algorithm is highly intuitive and is, in itself, an interesting and insightful contribution.
The formulation of this paper has connections to the vectors of certain matrices associated with the link graph; these connections in turn motivate additional heuristics for link-based analysis.

The Anatomy of a Large-Scale Hyper Textual Web Search Engine

The final design goal was to build an architecture that can support novel research activities on large-scale web data.
PageRank can be thought of as a model of user behavior. The probability that the random surfer visits a page is its PageRank. And, the d damping factor is the probability at each page the "random surfer" will get bored and request another random page.
While a complete user evaluation is beyond the scope of this paper, our own experience with Google has shown it to produce better results than the major commercial search engines for most searches.

Saturday, October 31, 2015

Reading Notes For Week 9

MIR Chapter 10

The design principles are offer informative feedback, reduce working memory, provide alternative interface for novice and expert users, information access interfaces.
Precision and recall measures have been widely used for comparing the ranking results of non-interactive systems, but are less appropriate for assessing interactive systems. The standard evaluations emphasize high recall levels.
The user interface should also support methods for monitoring the status of the current strategy in relation to the user's current task and high-level goals.
Studies show that users tend to start out with very short queries, inspect the results, and then modify those queris in an incremental feedback cycle.

The Design of Search User Interfaces

The most understandable and transparent way to order search results is according to how recently they appeared.
Another important issue in the tradeoff between system cleverness and user control lies with query transformations.
Keyboard shortcuts can save time and effort when the user is typing, as the shortcuts remove the need to move hands away from the keyboard to the mouse. But there is a barrier to using shortcuts, as they require memorization.

Information Visualization For Text Analysis

One of the most common strategies used in text mining is to identify important entities within the text and attempt to show connections among those entities.
Standard data graphics can be an effective tool for understanding frequencies of usage of terms within documents.

Friday, October 16, 2015

Reading Notes For Week 7

IIR Chapter 8

The article begin with a discussion of measuring the effectiveness of IR systems and the test collections that are most often used for this purpose. We then present the straightforward notion of relevant and nonrelevant documents and the formal evaluation methodology that has been developed for evaluating unranked retrieval results.
That relevance of retrieval results is the most important factor: blindingly fast, useless answers do not make a user happy. However, user perceptions do not always coincide with system designers’ notions of quality.
The standard way to measure human satisfaction is by various kinds of user studies. These might include quantitative measures, both objective, such as time to complete a task, as well as subjective, such as a score for satisfaction with the search engine, and qualitative measures, such as user comments on the search interface.

Improving the effectiveness of information Retrieval with local context analysis

The paper talks about the experiments with a range of collections of different sizes and languages, comparing no-expansion base and conventional independent local expansion. The experiment results are very helpful for expansions in all kinds of form.
The comparison between local context pseudo relevance feedback and real relevance feedback are very interesting two strategies for the special case.

A study of methods for negative relevance feedback & Relevance feedback revisited

The paper conduct a study of methods for negative relevance feedback. They compare representative negative feedback methods, covering vector space models.
Also the author mentioned how to evaluate negative feedback, which requires a test set with sufficient difficult topics. from the vector space models in the paper, I think model based negative feedback methods are generally more effective than those based on vector-space models.

Muddiest Point For Lecture 6

How to balance the efficiency and the quality of ranking in real world?

Friday, October 2, 2015

Reading Notes For Week 5

Djoerd Hiemstra and Arjen de Vries

The paper shows the existence of efficient retrieval algorithms that only use the matching terms in their computation. And the language models could be interpreted as belonging to tf.idf term weighting algorithms
It introduces three traditional retrieval models: The vector space model(rank documents by the similarity between the query and each document), the probabilistic model(rank documents by the probability of relevance given a query) and Boolean model(use the operations of Boolean algebra for query formulations).
The vector space model and the probabilistic model stand for different approaches to information retrieval. The former is based on the similarity between query and document, the latter is based on the probability of relevance, using the distribution of terms over relevant and non-relevant documents.
The paper differs considerably from other publications that also compare retrieval models within one frame work, beacuse it is not to show that the language modelling approach to information retrieval is so flexible that it can be used to model or implement many other approaches to information retrieval.
As a side effect of the introduction of language models for retrieval, this paper introduced new ways of thinking about two popular information retrieval tools: the use of stop words and the use of a stemmer.

Muddiest Point For Lecture 4

What's the effect of idf on ranking for one term queries?

Wednesday, September 23, 2015

Reading Notes For Week 4

MIR Chapter 2.1-2.5.3

How users go about doing search tasks? Usually four main cycle activities: problem identification, articulation of information needs, query formulation and results evaluation.
The standard model of the information seeking process is out of date. Because the users learn as they search, the models should emphasize the dynamic nature of the search process. Thus, the information is adjusted as the retrieval results came up.
Many search engines allow users to peruse an information structure of some kind to select a starting point for search, however, I think the user is prefer to browsing with less recall, and see what the search engine returns.
According to the search, the less expert the users are about a topic, the more likely they are to feel confident that all of the relevant information has been accessed.
Web search engines have become more sophisticated about dropping term that would result in empty results.
An increasingly common strategy within the search form is to show hints about what kind of information should be entered into each form via greyed-out text. For instance, on the search box, the first box is labeled “what are you looking for?” while the second box is labeled “when (tonight, this weekend, ...)”. When the user places the cursor into the entry form, the grey text disappears, and the user can type in their query terms.
Since relevance feedback indicates which documents are relevant to the query, this method is able to help greatly improve rank ordering.

Muddiest Point For Lecture 3

When we do spelling check, do we need to edit distance for all the dictionaries?

Monday, September 14, 2015

Muddiest Point For Lecture 2

If the collection or the query include multiple languages or formats, what algorithm should be used for stemming?

Sunday, September 13, 2015

Reading Notes For Week 3

IIR Chapter 4

When building an information retrieval system, many decisions are based on the characteristics of the computer hardware on which the system runs.
With main memory insufficient, we need to use an external sorting algorithm. It minimize the number of random disk seeks during sorting – sequential disk reads.
Blocked sort-based indexing has excellent scaling properties, but it needs a data structure for mapping terms to termIDs. For very large collections, this data structure does not fit into memory. A more scalable alternative is single-pass in-memory indexing or SPIMI.
Difference between BSBI and SPIMI : SPIMI add a post directly to its postings list. Each postings list is dynamic and it's immediately available to collect postings.
Advantages of SPIMI: faster because there is no sorting required and it saves memory because we keep track of the term a posting.
The map phase of MapReduce consists of mapping splits of the input data to key-value pairs.

For the reduce phase, we want all values for a given key to be stored close together, so that they can be read and processed quickly.

IIR Chapter 5

Main benefit of compression is we need less disk space. The subtle benefits of compression is increased use of caching and it is faster transfer of data from disk to memory.
The primary factor in determining the response time of IR system is the number of disk seeks necessary to process a query.
Using fixed-width entries for terms is clearly wasteful. We can overcome these shortcomings by storing the dictionary terms as one long string of characters, The pointer to the next term is also used to demarcate the end of the current term. Then, we can locate terms in the data structure by way of binary search in the table. This scheme saves us 60% compared to fixed-width storage.
To encode small numbers in less space than large numbers, we look at two types of methods: bytewise compression and bitwise compression.
Bytes offer a good compromise between compression ratio and speed of decompression.

Monday, September 7, 2015

Reading Notes For Week 2

IIR Section 1.2

In the resulting index, the storage is for both dictionary(memory) and posting lists(disk). So the size of each is very important. A optimized data structure should be used for posting list. Also, we need to optimize the efficiency of storage and access.
Variable length array keeps avoiding overhead for pointers and its contiguous memory increases speed on modern processors with memory caches. All in all, it's a good solution for space and time efficiency.

IIR Section 2

Steps of index processing: convert byte sequence into a linear sequence of characters - determine weather what document unit for indexing is.
Parsing a document contains a lot of problems, such as the format, what language is in it. All of these is are classification problems,
We may need to normalize words and query words into the same form.

IIR Section 3

There are two main data structures for dictionaries. Hash table and tree.The best known search tree is the binary tree. Efficient search hinges on the tree being balanced. When the vocabulary is growing, Hash table needs to rehashing everything.(expensive). The principal issue for tree is rebalancing.
Maintain a second inverted index from bigrams to dictionary which match each bigram,
Documentation correction is necessary for OCR'ed documents, but usually we don't change the documents and fix the query document mapping instead.

Sunday, September 6, 2015

According to the lecture, the basic approaches to IR are mainly not from the user side. For example, use of statistics and automatic methods. How about if we try to do it from the user side and offer the user step by step instructions on how to query, is that going to be more efficient?

Reading Notes For Week 1

FOA:

FOA process of browsing readers involve three steps: asking a question(user information need), constructing an answer(the source of the question, refer to search engine) and assessing the answer(how relevant the answer to the question).
Relevance feedback gives the asker an opportunity to provide more information with their reaction to each retrieved document, so researchers should learn more about how to make use of relevant feedback judgments.

IES:

The main components of IR system: User information need drives the search process. Then the user constructs a query to the system. The user interface mediates between the user and the IR system. User's query is processed by a search engine. The search engine accepts queries from user, processes queries and return ranked lists of results.
Search engine is to maintain and manipulate an inverted index for a document collection. The index updates is related to the searching efficiency.
Build an IR system requires electronic text formats and the characteristics of the text encode. The content and structure of paper has a impact on indexing and retrieval.

MIR:

Information Retrieval can be studied from two point of views: computer-centered (build up efficient indexes, process user queries, develop rank algorithms) and human-centered (user behavior, user main needs, understanding affects the operation).
The index as a kind of specialized data structures, is very necessary to build for fast searching. It provides fast access to the data and allow speeding up query processing. The most used index structure is an inverted index.
The paper also mentioned information retrieval in three generations of library system.(card catalog searching - search functionality - graphical interface) Not only the searching functionality has been added to the query operators, but also the interface and system architecture has been improved too. It means we pay more attention to the human-centered improvement to offer users a better experience.
The primary goal of IR system is to retrieve information that is useful to the user, which is opposed to the retrieval of data. IR system users concerned more about retrieving information about a subject than with retrieving data that satisfies a query.