Shannon Bradshaw > CSCI 100 - Systems Programming and Tools

CSCI 100 - Systems Programming and Tools (Fall 2007)
Course Home

Home Syllabus Instructor's Site

Instructor

Shannon Bradshaw, Ph.D.
302 Hall of Sciences
973.408.3198
sbradsha at drew dot edu
http://users.drew.edu/sbradsha

Office Hours

MWF: 4:00pm - 5:00pm

Schedule (Subject to change)

Week 1

Introduction to Unix

Learning Perl (O'Reilly)

Perl Tutorial

Deciphering MARC Records

Week 2

Reading: Learning Perl, Chapters 1-7

Development: Assignment 1 (due Mon 17 Sep @ 1:15pm)

No class: Professor Bradshaw will be presenting a paper at Hypertext 2007 in Manchester, England.

Week 3

Reading: Learning Perl, "Matching with Regular Expressions"

Examples: /home/ sbradsha/ CSCI_100_For_Students/ RosterProcessor on bob.

Introduction to the Bash shell continued

Perl fundamentals

Week 4

Development: Assignment 2 (multiple due dates)

Reading: Learning Perl, "Processing Text with Regular Expressions"

Perl fundamentals continued

Perl regular expressions

Week 5

Development: Assignment 3 (Due 8 Oct @ 1:15pm): Write a perl script that determines what values are used as subfields in MARC records. Your script should identify all ids used.

Reading: Learning Perl, "Hashes"

Reading: Introduction to Information Retrieval (Book), Information Retrieval Using the Boolean Model (Chapter)

Information Retrieval Using the Boolean Model (ppt)

Week 6

Development: Assignment 4 (Due Fri 19 Oct @ 1:15pm): Download the Amazon comments page for Andrew Keen's book, The Cult of the Amateur. Write a perl script that will extract all the comments from this page. Store each comment in a hash that has key/value pairs for title, date, commenter, and comments. Store all comment hashes in an array. Print out your array of hashes using the prettyPrint method of my Util::Output module.
I recommend that you parse the file by first extracting an entire comment entry and then breaking it apart into the four pieces you need to store. This part of the assignment will exercise your regular expression skills.

This assignment will require some experimentation with various regular expressions. Begin now. Make certain you understand how the comments are delimited in the data file before you get started. You may work in teams of two. Send me an email before class on the 19th telling me who you worked with, where to find your code, and how well it works. If you get stuck, send me email. CC the class, in case someone else is able to lend a hand before I check my email.

Week 7

Development/Reading: Assignment 5 (Due Fri 26 Oct @ 11:59pm): Work through the command line lucene demo. Read through the discussion of the command line demo.
Lucene requires that we create a directory containing files to be indexed. Each file or document as they are called in an indexing and retrieval system, will be named with a unique identifier (e.g., 1.txt, 2.txt, 3.txt,...). The contents of each file/document are the terms that Lucene will use to create an inverted index (see Introduction to IR slides). We must answer several questions in order to build our files to files to be indexed correctly:

How does Lucene enable indexing and retrieval of documents with fields?

How should we generate files to be indexed in order to take advantage of this functionality?

How does Lucene uniquely identify documents?

Can we add documents to an existing index? If so, how?

Can we delete documents from an index? If so, how?

Answer the questions above, then use your answers to do the following:

Create a directory of files of your own making (containing at least 10 documents). The files should be structured with title, author, and body fields.
Build a simple Lucene index that employs the title, author, and body fields and any other fields you deem appropriate for our documents.

Demonstrate that you can query your index by specific fields.

Week 8

Week 9

Week 10

Development: Assignment 6 (Due Mon 12 Nov @ 1:15pm): Complete parsing code for the MARC fields to which you have been assigned. (See many days of class discussion.)

Week 11

Reading Assignment:

SVN basic usage

SVN reference

Development Assignment: Ungraded Assignment (Due Wed 14 Nov @ noon): Check out a working copy of the MARC parsing script from the SVN repository. The url for the project is:

https://bob.drew.edu/repos/Project

Add the subs you wrote to your working copy and commit them back to the repository. Before committing, make sure to update your working copy and merge your changes in the event that someone else has committed changes since you checked out your working copy.

Development Assignment: Assignment 7 (Due Fri 16 Nov @ noon): Make small coding additions you were assigned to the MARC parsing code. See coding assignments.

Writing Assignment: Assignment 8 (Due Mon 26 Nov): Write a one page reflection on lessons learned with regard to library cataloging and working with real customer.
Week 12

Code Review:

Look at merged MARC parsing Perl code.

Determine what fields to use in searching our catalog index.

Week 13

Slides: Java Threads

Development Assignment: Assignment 9 (due Wed 28 Nov @ 1:15pm)

Development Assignment: Assignment 10, due Wed 28 Nov @ 1:15pm, (10 points). Using the classes below (among others), write a simple Java program that will retrieve the course homepage and save a local copy to a directory other than the one from which your program runs.

java.net.URL

java.net.HttpURLConnection

java.io.BufferedInputStream

java.io.InputStreamReader

java.io.Reader

You will need to explore the documentation for each of these classes in the Java API in order to undrestand how to use them. Be prepared to explain each line of your program. Demonstrated understanding of your code will, in part, determine your grade.

Development Assignment: Assignment 11, due Fri 30 Nov @ 1:15pm, (5 points). Using SimpleCrawler.java and seedURLs.txt in the Crawler sub-directory of my CSCI_100_For_Students directory, complete a very simple multi-threaded application. The individual threads will simply retrieve the next url from the urlsToBeCrawled PriorityQueue, print it to System.out, and sleep for brief period of time. I'll give you 1 extra credit point if you make it sleep for a random number of seconds between 0 and 5. As command line arguments your code should accept the name of a seed url file and the number of threads to create. As review of mult-threaded applications, review the slides above and follow the examples in the Threads sub-directory of my CSCI_100_For_Students directory.
Week 15

Development Assignment: Assignment 10 (10 points, due Mon 10 Dec @ 1:15pm) Please complete your simple crawler. It should have the following features:

The ability to specify the number of threads to invoke from the command line. (The crawler should actually start that number of threads.)

The crawler should initialize a queue of urls to crawl from a seeds file also specified on the command line.

Each thread should continue to run, pulling the next url from the queue of urls to crawl until all urls are consumed.

Each thread should ignore urls that have already been crawled.

Each thread should request, download, and save to disk each new url it encounters. You do NOT at this point need to harvest new urls from downloaded pages.

Your program should print appropriate messages to indicate what each thread is doing as it runs.

Development Assignment: Assignment 11 (10 points, due Mon 10 Dec @ 1:15pm by email to me) Comment each block of code in the precisionAtK-OurSystem.pl perl file. This file is located in the svn repository in the PerformanceEvaluation subdirectory. A block of code is defined as a sequence of lines that is preceded by a blank line. Your comments should describe each block in sufficient detail to convince me that you understand what the code is doing. This file contains code that is representative of nearly every Perl-related topic we have discussed.

Week 16

Development Assignment: Assignment 12 (20 points, due Mon 17 Dec @ 11:59pm): All files referenced in this assignment are found in the svn repository. Modify my PerformanceEvaluation/precisionAtK-OurSystem.pl script so that it measures the precision at top 10 and recall at top 10 of the Library's retrieval system. You will NOT be gauging precision/recall by sending queries to the catalog system. Rather you will save the results pages for a few queries and run your script against them.

For this assignment, complete the following steps.

By hand, query the Library's search system for each of the queries found in PerformanceEvaluation/Data/goldData-Assignment12.txt. Save the files as 1.html, 2.html, ... 5.html.

Copy PerformanceEvaluation/precisionAtK-OurSystem.pl to a file called precisionAtK-Library-yourUsername.pl in the same directory. If you name the file with the actual text "yourUsername" I won't grade your assignment and may be forced to ridicule you. :-)

Modify your copy of my script so that it compares the search results for each query to the gold data provided for that query. Following the example in my script, you should generate precision and recall numbers for each query and total precision and recall numbers for the entire set of queries. In order to make the comparison you will need to figure out a 99% reliable way of matching a document in the search results to a document in the gold data. Please evaluate your code by hand to make sure that all relevant documents are matched to their corresponding entry in the search results. I recommend that you begin this step by focusing on a comparison sub (and possibly helpers) that takes two arguments, a gold data document description, and a single search result, and returns true if they reference the same document.

Check in to svn your completed solution to this assignment.