Name: COEN 169 Web Information Management Project 1 solved
SKU: 9041
Price: 30.00 USD
Availability: InStock

Description

5/5 - (9 votes)

Introduction
In this assignment, you will build indexes for a corpus and implement several information
retrieval algorithms with the Lemur Toolkit (http://www.lemurproject.org/) in C/C++.
Both the text document collection and the incomplete implementation of a retrieval system can
be found in a light version of the Lemur toolkit (see Instructions below). The light version only
runs on Unix/Linux. You should remotely Login to the Engineering Computing Center’s Linux
machine (ftp.engr.scu.edu) by using ssh or putty to complete this assignment.
Instructions
Follow the procedures below to install the light Lemur toolkit:
1. Login to the Linux machine (ftp.engr.scu.edu). Use the following command
wget http://www.cse.scu.edu/~yfang/lemur.tar.gz
to get the toolkit.
1. Use “gzip -d lemur.tar.gz” and “tar -xvf lemur.tar” to uncompress and unpack the package.
2. Enter the lemur root directory and use “./configure” to configure the Lemur toolkit
3. Use “gmake” in the lemur root directory to compile the package
4. Every time you change the code go to the lemur root directory to rebuild the code by “gmake”
There are two directories under the lemur root directory that are particular useful. The
“eval_data/” directory contains the text document collection (i.e., database.sgml) to analyze and
several parameter files to run the retrieval system. The “app/src/” directory contains the code of
the applications. “BuildIndex.cpp” is the application for building indexes (complete).
“RetrievalEval.cpp” is incomplete, which you will work on.
Implementation of several retrieval algorithms
In this task, you are going to use lemur toolkit for retrieval and implement several retrieval
algorithms by yourself. First, please build the inverted index files. Enter the directory of
“eval_data” under the light Lemur root directory. Use the command “../app/obj/BuildIndex
build_param database.sgml” to build the index files with raw text data. Use the command
“../app/obj/BuildIndex build_stemmed_nostopw_param database.sgml” to build the index files
with text data processed by stemming and removing stop words.
The source file “RetrievalEval.cpp” under “app/src/” contains several retrieval algorithms.
Please read through the annotated code to understand the framework. Different retrieval
algorithms have different implementation of functions as computeMETHODWeight and/or
computeMETHODAdjustedScore (METHOD should be replaced by the names of different
retrieval method like RawTF). The retrieval score of a document (D) with respect to a query (Q)
is calculated as follows:
S(Q, D) = scoreAdjust(Weight(q1
,d1
,Q,D) + … + Weight(qN
,dN
,Q,D), Q, D)
Where computeMETHODWeight calculates the value of the Weight function for a
matched term of a document and a query; computeMETHODAdjustedScore is used to
calculate the value of scoreAdjust function (e.g., for score normalization).
One simple retrieval algorithm “RawTF” has been implemented. You can refer to the
code for implementing other retrieval algorithms.
To play with Lemur toolkit and implement your retrieval algorithm, please complete the
following experiments:
1. Test the performance of retrieval algorithm “RawTF” with two types of text data
(i.e., raw text data and text data by stemming and removing stopwords). (20 points)
Enter the directory of “eval_data” under the light Lemur root directory. Use the command
“../app/obj/RetrievalEval eval_param query” to generate the result file (i.e., result_rawtf)
of the RawTF retrieval algorithm for the raw text data. Use the command
“../app/obj/RetrievalEval eval_stemmed_nostopw_param query_stemmed_nostopw”
(before this please set the weight scheme parameter in param file to RawTF and set
corresponding name of result file by the result parameter) to generate the result file (i.e.,
result_rawtf_stemmed_nostopw) of the RawTF retrieval algorithm for the text data by
stemming and removing stopwords.
2. Evaluate the results by using “../trec_eval qrel result_rawtf” and “../trec_eval qrel
result_rawtf_stemmed_nostopw”. Please include the results in your report. Can you
tell which result is better? If one is better than the other, please provide a short
analysis. Please answer what stemmer is used in the index. Can you also use another
stemmer and compare the results? The instruction of how to defining the BuildIndex
parameter file can be found at
http://www.cs.cmu.edu/~lemur/3.1/indexing.html#BuildIndex
3. Evaluate the results by NOT removing the stopwords. A stopword list is contained in
eval_data/stopwordlist. You need to modify the parameter file (e.g., remove
stopwordlist in build_stemmed_nostopw_param) when
apply BuildIndex.
Please provide a short analysis on whether removing stopwords helps or not.
2. Implement three different retrieval algorithms and evaluate their performance.
(80 points)
You only need to change the computeMETHODWeight functions of the three algorithms.
The weight functions of the three algorithms and the RawTF method (implemented) are:
In order to obtain statistics like the average document length, please check the reference
of Lemur::api::index class. More general information about Lemur classes can be found here.
Please check the parameters of weightingScheme and resultFile within
“eval_stemmed_nostopw_param” for different retrieval algorithms. Follow the step in task 1 to
generate the evaluation results.
In summary, you need to generate the results (Average Precision) with different combinations of
retrieval models and preprocessing steps and fill in the table as below
RawTF weight(q , d , Q, D) erm_freq eight(q )
t t = t (d )
t * w t
RawTFI
DF
weight(q , d , Q, D) erm_freq og eight(q )
t t = t (d )
t * log l (
totalNumDocs
numDocsContain(term_t) ) * w t
LogTFI
DF
weight(q , d , Q, D) log(term_freq og eight(q )
t t = ( (d )) )
t + 1 * log l (
totalNumDocs
numDocsContain(term_t)) * w t
Okapi weight(q , d , Q, D) og t t =
term_freq(d )t
term_freq(dt)+0.5+1.5
len(D)
avgDocLen
* log l ( numDocsContain(term_t)+0.5
totalNumDocs−numDocsContain(term_t)+0.5
) * 7+weight(q )
t
8+weight(q )t
Please compare the results and provide a short discussion about the advantage/disadvantage of
the algorithms.
What to turn in
1. Please write a report of your experiments.
2. Please attach the hardcopy of your code (RetrievalEval.cpp) with the report.
Remove
Stopwords &
Stemming
Remove
Stopwords & No
Stemming
No Removing
Stopwords &
Stemming
No Removing
Stopwords & No
Stemming
RawTF
RawTFIDF
LogTFIDF
Okapi