2140 - Information Storage and Retrieval : November 2015

Friday, November 27, 2015

What will be a good example for adaptive information retrieval system?

IIR

Most language-modeling work in IR has used unigram language models. IR is not the place where you most immediately need complex language models. Unigram models are often sufficient to judge the topic of a text.
Language modeling is a quite general formal approach to IR, with many variant realizations. The original and basic method for using language models in IR is the query likelihood model.
Vector space systems have generally preferred more lenient matching, though recent web search developments have tended more in the direction of doing searches with such conjunctive semantics.
Group-average agglomerative clustering is avoiding the pitfalls of the single-link and complete-link criteria, which equate cluster similarity with the similarity of a single pair of documents.
Flat clustering creates a flat set of clusters without any explicit structure that would relate clusters to each other. Hierarchical clustering creates a hierarchy of clusters.
The inverted index supports fast nearest-neighbor search for the standard IR setting. However, sometimes we may not be able to use an inverted index efficiently.
Feature selection makes training and applying a classifier more efficient by decreasing the size of the effective vocabulary.
Differential cluster labeling selects cluster labels by comparing the distribution of terms in one cluster with that of other clusters.

User Profiles

classification: the way information is collected, the life period of the profile and structure.
five basic approaches to user identification: software agents, logins, enhanced proxy serves, cookies and session ids.
The searches is not limited to the Web, but they would also include databases to which the user has access, and the users personal documents. Such search systems are implemented in tools like Google Desktop Search.
User identification can be obtained using mechanisms such as session ids or cookies that provide anonymity. Even methods requiring a login process can be anonymous if users are be allowed to use pseudonyms rather than their true identity.
In user customization, a recommendation system provides an interface that allows users to construct a representation of their own interests. Often check boxes are used to allow a user to select from the known values of attributes,
Content-based recommendation systems recommend an item to a user based upon a description of the item and a profile of the user’s interests. While a user profile may be entered by the user, it is commonly learned from feedback the user provides on items.
Personalized Web search has emerged as one of the hottest topics for both the Web industry and academic researchers.

How to evaluate the quality of a web search?

IIR Chapter 19 & 21

Despite these words being consequently invisible to the human user, a search engine indexer would parse the invisible words out the HTML representation of the web page and index these words as being present in the page.
Web search engines frown on this business of attempting to decipher and adapt to their proprietary ranking techniques and indeed announce policies on forms of SEO behavior they do not tolerate.
Current search engines follow precisely this model: they provide pure search results (generally known as algorithmic search results) as the primary response to a user’s search, together with sponsored search results displayed separately and distinctively to the right of the algorithmic results.

Authoritative Sources in a Hyper linked Environment

The central issue we address within our framework is the distillation of broad search topics, through the discovery of “authorative” information sources on such topics.
The motivation of the algorithm is highly intuitive and is, in itself, an interesting and insightful contribution.
The formulation of this paper has connections to the vectors of certain matrices associated with the link graph; these connections in turn motivate additional heuristics for link-based analysis.

The Anatomy of a Large-Scale Hyper Textual Web Search Engine

The final design goal was to build an architecture that can support novel research activities on large-scale web data.
PageRank can be thought of as a model of user behavior. The probability that the random surfer visits a page is its PageRank. And, the d damping factor is the probability at each page the "random surfer" will get bored and request another random page.
While a complete user evaluation is beyond the scope of this paper, our own experience with Google has shown it to produce better results than the major commercial search engines for most searches.