Information Retrieval und Text Mining

Wintersemester 2012/13
Wiltrud Kessler, Thomas Müller, Max Kisselew, Hinrich Schütze
Tu 14:00-15:30, Pfaffenwaldring 5b, V 5.01 & V 5.02
Fr 14:00-15:30, Pfaffenwaldring 5b, V 5.01 & V 5.02

Schedule and Resources

Please sign up for this course in ILIAS for the assignments. You will need these style files to compile the latex sources. Please note that some of the links to slides still refer to last year's slides. They will be updated over time.

Day Topic * Chapter Slides Resources
IIR01 Tu 10/16 Boolean retrieval (WK) V pdf html students
instructors
source
information retrieval links
search Shakespeare
IIR02 Fr 10/19 Term vocabulary & postings lists (WK)
V pdf html students
instructors
source
Porter stemmer
IIR13 Tu 10/23 Text classification, Naive Bayes (HS) V pdf html students
instructors
source
Weka (includes Naive Bayes)
Reuters-21578
vulgarity text classifier fail
Fr 10/26 Practical exercise U Assignment 1
Supporting files for assignment 1
Solutions for assignment 1 (Exercises 1, 3, 4)
Solution for assignment 1 - exercise 2
Slides from the practical exercise 1
IIR03 Tu 10/30 Dictionaries & tolerant retrieval (WK) V pdf html students
instructors
source
trie vs hash vs ternary tree
wild card search on Google
edit distance demo
P. Norvig's spell corrector
spelling correction gone wrong
freq(misspelling)>freq(correct)
soundex demo
IIR12 Fr 11/2 Language models for IR (HS) V pdf html students
instructors
source
Ponte & Croft paper on LMs in IR
Zhai & Lafferty
Lemur Toolkit
IIR05 Tu 11/6 Index compression (WK) V pdf html students
instructors
source
variable byte codes
word-aligned binary codes
pos/freq compression
Fr 11/9 Practical exercise U Assignment 2
Solutions for assignment 2
IIR06 Tu 11/13 Scores, weights, vector spaces (WK) V pdf html students
instructors
source
vector space for dummies
exploring the similarity space
Okapi BM25
Lilian Lee on pivoted document length normalization
IIR09 Fr 11/16 Rel. feedback, query expansion (HS) V pdf html students
instructors
source
original relevance feedback paper
relevance feedback at Excite
Justin Bieber: related searches fail
WordSpace
automatic word sense discrimination
Tu 11/20 Practical exercise U Assignment 3
Solutions for assignment 3
Slides from the practical exercise 3
IIR07 Fr 11/23 Computing scores (WK) V pdf html students
instructors
source
how Google tweaks ranking
interview with Google's Udi Manber
Amit Singhal on Google ranking
SEO perspective: ranking factors
Yahoo BOSS: opening up search
compare Google/Yahoo rankings
eye tracking at Google
IIR14 Tu 11/27 Vector space classification (HS) V pdf html students
instructors
source
perceptron example
TC overview by Sebastiani
FSNLP (decision trees, perceptrons)
The elements of statistical learning
IIR08 Fr 11/30 Evaluation & result summaries (WK) V pdf html students
instructors
source
TREC at NIST
v. Rijsbergen's definition of F
A/B testing
too much A/B testing?
early paper on dynamic summaries
search quality evaluation at Google
IIR15-1 Tu 12/4 Support vector machines (HS) V pdf html students
instructors
source
Explanation for distance
Fr 12/7 Practical exercise U Assignment 4
Corpus for exercise 2
Solution for assignment 4 (exercise 1)
Solution for assignment 4 (exercise 2)
Slides from the practical exercise 4
IIR16 Tu 12/11 Flat clustering (HS) V pdf html students
instructors
source

van Rijsbergen: Cluster Hypothesis
search result clustering: Yippy
search result clustering: Carrot2
search result clustering: Bing
# clusterings: Stirling number
IIR18 Fr 12/14 Latent semantic indexing (HS) V pdf html students
instructors
source
Original LSI paper
Probabilistic LSI
Dimensions of meaning: LSI for words
Tu 12/18 Practical exercise U Assignment 5
Solutions for assignment 5
IIR19 Tu 1/8 Web information retrieval (CS) V pdf html students
instructors
source
how ads are priced
most expensive keywords
Geico search ca. 2004
geo-targeted ad
size of the web in 2007
size of the web in 2008
ad monitoring at Google
fighting webspam
IIR20 Fr 1/11 Crawling (FL) V pdf html students
instructors
source
Mercator web crawler
robots.txt standard
Google data centers
Tu 1/15 Practical exercise U Assignment 6
Solutions for assignment 6
IIR21 Fr 1/18 Link analysis (CS) V pdf html students
instructors
source
more on PageRank math
Jon Kleinberg (inventor of HITS)
Google bomb (January 2008)
defused Google bomb (June 2009)
Tu 1/22 Semantic Search (WK) V students
instructors
source
CleverSearch
Yummly
SWSE
Ask The Wiki
Evi
PizzaFinder
Semantic Media Wiki
Fr 1/25 Practical exercise U Assignment 7
Solutions for assignment 7
Tu 1/29 Probeklausur U Review questions
Review exercises
Exam topics
Fr 2/1 Besprechung Probeklausur, Fragen U
Fr 2/8 Klausur U

* V = Vorlesung / normal class; U = Übung

Exam-related Information

To get admitted to the exam ("Vorleistung"), you will have to

For all students (except M.Sc. CL who take the course as part of the concentration StatNLP) there will be a written exam on Fr 2/8/13, your final grade for the course will be the grade you get in the exam.

M.Sc. CL students who take the course as part of the concentration StatNLP do not have to do the written exam, this course will be examined as part of the oral exam for the concentration.

More Resources

Textbook (IIR): Introduction to Information Retrieval. Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze. Cambridge University Press, 2008. Web publication at http://informationretrieval.org

Old assignments and slides:

Topic Chapter Slides Resources
IIR04 Index construction pdf html students
instructors
source
SPIMI paper
Google data center tour
IIR10 XML retrieval pdf html students
instructors
source
IIR11 Probabilistic information retrieval pdf html students
instructors
source
IIR15-2 Learning to rank (LTR) pdf html students
instructors
source
Microsoft LTR datasets
IIR17 Hierarchical clustering pdf html students
instructors
source
GoogleNews precursor: Newsblaster
Bisecting K-means
PDDP algorithm


You can find more information on the pages of previous courses: