Instructor
Shannon Bradshaw, Ph.D.
302 Hall of Sciences
973.408.3198
sbradsha at drew dot edu
http://users.drew.edu/sbradsha
Office Hours
Schedule (Subject to change)
Week 1
Week 2
- Reading: Learning Perl,
Chapters 1-7
- Development: Assignment 1 (due Mon 17 Sep @ 1:15pm)
-
No class: Professor Bradshaw will be presenting a paper at Hypertext 2007 in Manchester, England.
Week 3
- Reading: Learning Perl, "Matching with Regular Expressions"
- Examples: /home/ sbradsha/ CSCI_100_For_Students/ RosterProcessor
on bob.
- Introduction to the Bash shell continued
- Perl fundamentals
Week 4
- Development: Assignment 2 (multiple due dates)
- Reading: Learning Perl, "Processing Text with Regular Expressions"
- Perl fundamentals continued
- Perl regular expressions
Week 5
- Development: Assignment 3 (Due 8 Oct @ 1:15pm): Write a perl script that determines what values are used as subfields in MARC records. Your script should identify all ids used.
- Reading: Learning Perl, "Hashes"
- Reading: Introduction to Information Retrieval (Book),
Information Retrieval
Using the Boolean
Model (Chapter)
- Information Retrieval Using the
Boolean Model (ppt)
Week 6
- Development: Assignment 4 (Due Fri 19 Oct @ 1:15pm): Download the
Amazon comments page for Andrew Keen's book, The
Cult of the Amateur. Write a perl script that will extract all the
comments from this page. Store each comment in a hash that has key/value
pairs for title, date, commenter, and comments. Store all comment hashes in
an array. Print out your array of hashes using the prettyPrint method of my
Util::Output module.
I recommend that you parse the file by first extracting
an entire comment entry and then breaking it apart into the four pieces you
need to store. This part of the assignment will exercise your regular
expression skills.
This assignment will require some experimentation with various regular
expressions. Begin now. Make certain you understand how the comments are
delimited in the data file before you get started. You may work in
teams of two. Send me an email before class on the 19th telling me who you
worked with, where to find your code, and how well it works. If you get
stuck, send me email. CC the class, in case someone else is able to lend a
hand before I check my email.
Week 7
- Development/Reading: Assignment 5 (Due Fri 26 Oct @ 11:59pm):
Work through the command line
lucene demo. Read through the discussion of the
command line demo.
Lucene requires that we create a directory containing files to be
indexed. Each file or document as they are called in an indexing and retrieval
system, will be named with a unique identifier (e.g., 1.txt, 2.txt,
3.txt,...). The contents of each file/document are the terms that Lucene will
use to create an inverted index (see Introduction to IR slides). We must
answer several questions in order to build our files to files to be indexed correctly:
- How does Lucene enable indexing and retrieval of documents with fields?
- How should we generate files to be indexed in order to take advantage of
this functionality?
- How does Lucene uniquely identify documents?
- Can we add documents to an existing index? If so, how?
- Can we delete documents from an index? If so, how?
Answer the questions above, then use your answers to do the following:
- Create a directory of files of your own making (containing at least
10 documents). The files should be structured with title, author, and body
fields.
- Build a simple Lucene index that employs the title, author, and body
fields and any other fields you deem appropriate for our documents.
- Demonstrate that you can query your index by specific fields.
Week 8
Week 9
Week 10
- Development: Assignment 6 (Due Mon 12 Nov @ 1:15pm):
Complete parsing code for the MARC fields to which you have been
assigned. (See many days of class discussion.)
Week 11
- Reading Assignment:
- Development Assignment: Ungraded Assignment (Due Wed 14 Nov @ noon): Check
out a working copy of the MARC parsing script from the SVN repository. The url for
the project is:
https://bob.drew.edu/repos/Project
Add the subs you wrote to your working copy and commit them back to the
repository. Before committing, make sure to update your working copy and merge
your changes in the event that someone else has committed changes since you
checked out your working copy.
- Development Assignment: Assignment 7 (Due Fri 16 Nov @ noon): Make
small coding additions you were assigned to the MARC parsing code. See coding assignments.
- Writing Assignment: Assignment 8 (Due Mon 26 Nov): Write a one page
reflection on lessons learned with regard to library cataloging and working
with real customer.
Week 12
- Code Review:
- Look at merged MARC parsing Perl code.
- Determine what fields to use in searching our catalog index.
-
Week 13
- Slides: Java Threads
- Development Assignment: Assignment 9 (due Wed 28 Nov @ 1:15pm)
- Development Assignment: Assignment 10, due Wed 28 Nov @ 1:15pm, (10
points). Using the classes below (among others), write a simple Java program
that will retrieve the course homepage and save a local copy to a directory
other than the one from which your program runs.
- java.net.URL
- java.net.HttpURLConnection
- java.io.BufferedInputStream
- java.io.InputStreamReader
- java.io.Reader
You will need to explore the documentation for each of these classes in the Java API in order to
undrestand how to use them. Be prepared to explain each line of your
program. Demonstrated understanding of your code will, in part, determine your
grade.
- Development Assignment: Assignment 11, due Fri 30 Nov @
1:15pm, (5 points). Using SimpleCrawler.java and seedURLs.txt in the
Crawler sub-directory of my CSCI_100_For_Students directory, complete a very simple multi-threaded
application. The individual threads will simply retrieve the next url
from the urlsToBeCrawled PriorityQueue, print it to System.out, and
sleep for brief period of time. I'll give you 1 extra credit point if
you make it sleep for a random number of seconds between 0 and 5. As
command line arguments your code should accept the name of a seed url
file and the number of threads to create. As review of mult-threaded
applications, review the slides above and follow the examples in the
Threads sub-directory of my CSCI_100_For_Students directory.
Week 15
- Development Assignment: Assignment 10 (10 points, due Mon 10 Dec @
1:15pm) Please complete your simple crawler. It should have the following
features:
- The ability to specify the number of threads to invoke from the
command line. (The crawler should actually start that number of
threads.)
- The crawler should initialize a queue of urls to crawl from a seeds
file also specified on the command line.
- Each thread should continue to run, pulling the next url from the
queue of urls to crawl until all urls are consumed.
- Each thread should ignore urls that have already been crawled.
- Each thread should request, download, and save to disk each new url it
encounters. You do NOT at this point need to harvest new urls from
downloaded pages.
- Your program should print appropriate messages to indicate what each
thread is doing as it runs.
- Development Assignment: Assignment 11 (10 points, due Mon 10 Dec @
1:15pm by email to me) Comment each block of code in the precisionAtK-OurSystem.pl perl
file. This file is located in the svn repository in the
PerformanceEvaluation subdirectory. A block of code is defined as a sequence
of lines that is preceded by a blank line. Your comments should describe each
block in sufficient detail to convince me that you understand what the code is
doing. This file contains code that is representative of nearly every
Perl-related topic we have discussed.
Week 16
- Development Assignment: Assignment 12 (20 points, due Mon 17 Dec @
11:59pm): All files referenced in this assignment are found in the svn
repository. Modify my PerformanceEvaluation/precisionAtK-OurSystem.pl script
so that it measures the precision
at top 10 and recall
at top 10 of the Library's retrieval system. You will NOT be gauging
precision/recall by sending queries to the catalog
system. Rather you will save the results pages for a few queries and run your
script against them.
For this assignment, complete the following steps.
- By hand, query the Library's search system for each of the queries
found in PerformanceEvaluation/Data/goldData-Assignment12.txt. Save the
files as 1.html, 2.html, ... 5.html.
- Copy PerformanceEvaluation/precisionAtK-OurSystem.pl to a file called
precisionAtK-Library-yourUsername.pl in the same directory. If you
name the file with the actual text "yourUsername" I won't grade your
assignment and may be forced to ridicule you. :-)
- Modify your copy of my script so that it compares the search results
for each query to the gold data provided for that query. Following the
example in my script, you should generate precision and recall numbers for
each query and total precision and recall numbers for the entire set of
queries. In order to make the comparison you will need to figure out a 99%
reliable way of matching a document in the search results to a document in
the gold data. Please evaluate your code by hand to make sure that all
relevant documents are matched to their corresponding entry in the search
results. I recommend that you begin this step by focusing on a comparison
sub (and possibly helpers) that takes two arguments, a gold data document description,
and a single search result, and returns true if they reference the same document.
- Check in to svn your completed solution to this assignment.
|