Assignment 2

Simple search engines.

All changes to the assignment specification and files will be listed here.

  • [02/11 10:00] Assignment released
  • [02/11 22:00] Added additional clarifications and assumptions
  • [07/11 16:20] Updated the Makefile by adding the -f option to rm . Download the new version here . Also added information about testing.

In this assignment, your task is to implement a simple search engine using a simplified version of the well-known Weighted PageRank algorithm. You should start by reading the Wikipedia article on PageRank (read up to the section Damping Factor). Later I will release a video lecture introducing this assignment and discuss the required topics.

The main focus of this assignment is to build a graph structure from a set of web pages, calculate Weighted PageRanks and rank pages. To make things easier for you, you don't need to spend time crawling, collecting and parsing web pages for this assignment. Instead, you will be provided with a collection of mock web pages in the form of plain text files. Each web page has two sections:

  • Section 1 contains URLs representing outgoing links. The URLs are separated by whitespace, and may be spread across multiple lines.
  • Section 2 contains the actual content of the web page, and consists of one or more words. Words are separated by whitespace, and may be spread across multiple lines.

Here is an example web page:

Sections will always be appropriately delimited by lines containing #start Section-1 , #end Section-1 , #start Section-2 and #end Section-2 . There will be no other characters on these lines (no leading or trailing whitespace). There may be lines containing whitespace before or after any section. Web pages contain exactly one of each section, and Section 1 will always precede Section 2.

Note that it is completely valid for Section 1 to be empty - this simply means that the web page does not have any outgoing links.

Change into the directory you created for the assignment and run the following command:

If you're working at home, download files.zip by clicking on the above link and then run the unzip command on the downloaded file.

If you've done the above correctly, you should now have the following files:

Each test directory contains the following files:

First, compile the original version of the files using the make command. You'll see that it produces three executables: pagerank , searchPagerank and scaledFootrule . It also copies these executables to the ex1 , ex2 and ex3 directories, because when you test your program, you will need to change into one of these directories first.

Note that in this assignment, you are permitted to create as many supporting files as you like. This allows you to compartmentalise your solution into different files. For example, you could implement a Graph ADT in Graph.c and Graph.h . To ensure that these files are actually included in the compilation, you will need to edit the Makefile to include these supporting files; the provided Makefile contains instructions on how to do this.

Part 1 - Weighted PageRanks

  • How to Implement Part 1 (hints)
  • Weighted PageRank Algorithm (paper)
  • How to Calculate W out

Your task is to implement a program in a file called pagerank.c that calculates Weighted PageRanks for a collection of pages.

The URLs of the pages in the collection are given in the file collection.txt . The URLs in collection.txt are separated by whitespace, and may be spread across multiple lines. Here is an example collection.txt :

To get the name of the file that contains the page associated with some URL, simply append .txt to the URL. For example, the page associated with the URL url24 is contained in url24.txt .

For each URL in the collection file, you need to read its corresponding page and build a graph structure using a graph representation of your choice . Then, you must use the algorithm below to calculate the Weighted PageRank for each page.

The simplified Weighted PageRank algorithm you need to implement is shown below. Please note that the formula to calculate PageRank values is slightly different to the one provided in the above paper.

  • Read web pages from the collection in file collection.txt and build a graph structure using a graph representation of your choice
  • \( N \) = number of URLs in the collection
  • \( PR(p_i, 0) = \frac{1}{N} \)
  • \( iteration = 0 \)
  • \( diff = diffPR \)
  • \( t = iteration \)
  • For each URL \( p_i \) in the collection $$ PR(p_i, t + 1) = \frac{1 - d}{N} + d \ast \sum_{p_j \in M(p_i)} PR(p_j, t) \ast W_{(p_j, p_i)}^{in} \ast W_{(p_j, p_i)}^{out} $$
  • $$ diff = \sum_{i = 1}^{N} |PR(p_i, t + 1) - PR(p_i, t)| $$
  • \( iteration \)++

In the above algorithm:

  • \( M(p_i) \) is a set of pages with outgoing links to \( p_i \) (ignore self-loops and parallel edges)
  • \( W_{(p_j, p_i)}^{in} \) and \( W_{(p_j, p_i)}^{out} \) are defined in the paper above (see the link "Weighted PageRank")
  • \( t \) and \( t + 1 \) correspond to iteration numbers. For example, \( PR(p_i, 2) \) means "the PageRank of \( p_i \) after iteration 2".

Note: When calculating \( W_{(p_j, p_i)}^{out} \), if a page \( k \) has zero outdegree (zero outlinks), \( O_k \) should be 0.5, not zero. This will avoid issues related to division by zero.

Your program in pagerank.c must take three command-line arguments: d (damping factor), diffPR (sum of PageRank differences), maxIterations (maximum number of iterations), and using the algorithm described in this section, calculate the Weighted PageRank for every web page in the collection.

Here is an example of how the program is run:

Each of the expected output files in the provided files were generated using the above command.

Your program should output a list of URLs, one per line, to a file named pagerankList.txt , along with some information about each URL. Each line should begin with a URL, followed by its outdegree and Weighted PageRank (using the format string "%.7lf" ). The values should be comma-separated, with a space after each comma. Lines should be sorted in descending order of Weighted PageRank. URLs with the same Weighted PageRank should be ordered alphabetically (ascending) by the URL.

Here is an example pagerankList.txt . Note that this was not generated from any of the provided test files, this is just an example to demonstrate the output format.

We will use a tolerance of \( 10^{-4} \) to check PageRank values when automarking. This means your answer for a PageRank value will be accepted if it is within \( 10^{-4} \) of the expected value.

Assumptions/Constraints

  • All URLs in collection.txt are distinct.
  • URLs consist of alphanumeric characters only, and are at most 100 characters in length.
  • URLs do not necessarily start with url .
  • Web pages may have links to themselves and multiple links to the same web page. You must completely ignore self links and duplicate links, including when calculating the outdegree of each page.
  • There will be no missing web pages. This means each URL referenced in collection.txt or as an outlink in a web page will have a text file associated with it that is appropriately named. Also, every URL referenced as an outlink in a web page will be listed in collection.txt .
  • The provided command-line arguments will be valid. This means \( d \) will be a double between 0 and 1, \( diffPR \) will be a double greater than 0, and \( maxIterations \) will be a positive integer.
  • When we say "ordered alphabetically", we mean as determined by strcmp , which will compare the ASCII values of the characters.
  • All the text files required by the program will be in the current working directory when the program is executed. For Part 1, this includes the collection.txt and all the files containing the web pages.

Part 2 - Simple Search Engine

Your task is to implement a simple search engine in a file called searchPagerank.c that takes one or more search terms and outputs relevant URLs.

The program must make use of the data available in two files: invertedIndex.txt and pagerankList.txt . All other text files are not relevant for this task.

invertedIndex.txt contains data about what pages contain what words. Each line of the file contains one word followed by an alphabetically ordered list of URLs of pages that contain that word. Lines are ordered alphabetically by the initial word. Here is an example invertedIndex.txt :

pagerankList.txt contains information about web pages including their URL, outdegree and Weighted PageRank. The format of this file is the same as that produced by your pagerank program in Part 1.

The program should take search terms as command-line arguments, find pages with one or more matching search terms and output the top 30 pages in descending order of the number of matching search terms to stdout. Pages with the same number of matching search terms should be ordered in descending order by their Weighted PageRank. Pages with the same number of matching search terms and the same Weighted PageRank should be sorted in increasing lexicographic order by URL.

Here is an example output. Note that this was not necessarily generated from any of the provided test files, this is just an example to demonstrate the output format.

  • Relevant assumptions from previous parts apply.
  • Search terms will be unique.
  • At least one search term will be given to the searchPagerank program.
  • Search terms will consist entirely of lowercase letters.
  • The maximum length of a line in invertedIndex.txt is 1000 characters.
  • Words in invertedIndex.txt will consist entirely of lowercase letters.
  • It is not guaranteed that all the search terms will be in invertedIndex.txt . Search terms that are not in invertedIndex.txt should be ignored, since there are no pages that contain them.
  • pagerankList.txt will contain correct URLs, outdegrees and PageRanks. If you do not produce the correct output in Part 1 then you should copy the expected result from pagerankList.exp into pagerankList.txt before testing.

Part 3 - Hybrid/Meta Search Engine using Rank Aggregation

  • How to Get Started with Part 3

Your task is to implement a program in scaledFootrule.c that combines search results (ranks) from multiple sources using Scaled Footrule Rank Aggregation, described below.

Scaled Footrule Rank Aggregation

Suppose that \( \tau_1 \), \( \tau_2 \), ... , \( \tau_k \) are search results (rankings) obtained using \( k \) different criteria. We want to aggregate these results into a single search result that agrees with the original \( k \) results as much as possible.

A weighted bipartite graph for scaled footrule optimization \((C, P, W)\) is defined as:

  • \( C \) is the set of pages to be ranked (say, a union of \( \tau_1 \), \( \tau_2 \), ... , \( \tau_k \))
  • \( P \) is the set of positions available (say, \(\{1, 2, 3, 4, 5\}\))
  • \( n \) is the cardinality (size) of \( C \)
  • \( |\tau_i| \) is the cardinality (size) of \( \tau_i \)
  • \( \tau_i(c) \) is the position of \( c \) in \( \tau_i \)
  • \( k \) is the number of rank lists

For example:

assignment 2 simple search engines

The final ranking is derived by finding possible values of position 'P' such that the total scaled-footrule distance is minimum .

There are many different ways to assign possible values for 'P'. In the above example, P = [1, 3, 2, 5, 4]. Some other possible values are: P = [1, 2, 4, 3, 5], P = [5, 2, 1, 4, 3], P = [1, 2, 3, 4, 5], etc. When n = 5 (i.e., the number of distinct URLs is 5) there are 5! possible alternatives. When n = 10, there are 10! possible alternatives, which is is 3,628,800. A very simple and obviously inefficient approach could use brute-force search and generate all possible alternatives, calculate the total scaled-footrule distance for each alternative, and find the alternative with minimum scaled-footrule distance.

If you use such a brute-force search, you will receive a maximum of 65% for Part 3. However, if you implement a "smart" algorithm that avoids generating unnecessary alternatives, you can receive up to 100% for Part 3. You must document your algorithm such that your tutor can easily understand your logic, and clearly outline how you algorithm reduces the search space, otherwise you will not be awarded marks for your algorithm. Yes, it's only 35% of part-3 marks, but if you try it, you will find it very challenging and rewarding.

Note: You are not permitted to adapt or copy algorithms from the Internet for this task.

You need to write a program in scaledFootrule.c that aggregates ranks from files given as command-line arguments, and outputs an aggregated rank list with the minimum scaled footrule distance.

Suppose that the file rankA.txt contains the following:

And suppose that the file rankB.txt contains the following:

The following command will read ranks from files rankA.txt and rankB.txt and outputs minimum scaled footrule distance (using the format "%.6lf" ) on the first line, followed by the corresponding aggregated rank list.

Two possible values for P with minimum distance are:

Note that there may be multiple values of P that could result in the minimum scaled footrule distance. You only need to output one such list.

P = [1, 4, 2, 5, 3] corresponds to the following list:

P = [1, 5, 2, 4, 3] corresponds to the following list:

Your program should be able to handle more than two rank files, for example:

  • All given rank files will be distinct.
  • Each ranking file will contain exactly one URL per line, and no other text. Furthermore, lines do not contain leading or trailing whitespace.

Assumptions

Implementation details.

Debugging is an important and inevitable aspect of programming. We expect everyone to have an understanding of basic debugging methods, including using print statements, knowing basic GDB commands and running Valgrind. In the forum and in help sessions, tutors will expect you to have made a reasonable attempt to debug your program yourself using these methods before asking for help, and you should be able to explain what you found during your debugging attempts.

You can learn about GDB and Valgrind in the Debugging with GDB and Valgrind lab exercise.

We have provided some test files for you to test your code for Part 1. Please note that you are expected to develop your own test cases for Parts 2 and 3 .

Here is an example of how to test Part 1.

Note that you can test with other values of \( d \), \( diffPR \) and \( maxIterations \), but you will need to verify the output yourself, as there is no expected output for other values.

Here is an example of how to test Part 2.

Obviously, you can test with other search terms and other numbers of search terms, and you will also need to verify the output of this yourself.

Here is an example of how to test Part 3.

Note that if you create the rank files in your assignment 2 directory and not in any of the test directories (such as ex1 ), then you don't need to change into the test directories before running ./scaledFootrule .

You can also test with more rank files, for example:

Frequently Asked Questions

  • Can we assume X? If the spec does not say you can assume something, then you cannot assume it. However, if you have a reason to think that you should be able to assume something, but the spec does not mention it, then make a post on the forum.
  • Are we allowed to use helper functions? Of course. You may implement helper functions in any .c file.
  • Are we allowed to #include any additional libraries? No. However, you are allowed to #include your own .h files.
  • The specification states that we can assume something. Do we need to check whether this assumption is satisfied? No. If the specification says you can assume something, it means you don't need to check it.
  • What errors do we need to handle? No error handling is necessary. However, it is good practice to handle common errors such as: invalid number of command-line arguments, invalid command-line arguments, NULL returned from fopen , NULL returned from malloc , even though this is not required in this assignment.
  • Are we allowed to share tests? No. Testing and coming up with test cases is an important part of software design. Students that test their code more will likely have more correct code, so to ensure fairness, each student should independently develop their own tests.
  • What is the point of Section 2 of the web pages? Section 2 is not used in the assignment, since we have already provided invertedIndex.txt .

You must submit the three files: pagerank.c , searchPagerank.c and scaledFootrule.c . In addition, you can submit as many supporting files ( .c and .h ) as you want. Please note that none of your supporting files should contain a main function - only pagerank.c , searchPagerank.c and scaledFootrule.c should contain a main function.

You can submit via the command line using the give command:

or you can submit via WebCMS. You may not submit any other files. You can submit multiple times. Only your last submission will be marked. You can check the files you have submitted here .

Compilation

You must ensure that your final submission compiles on CSE machines. Submitting non-compiling code leads to extra administrative overhead and will result in a 10% penalty.

Here we describe how we will compile your files during autotesting.

To test Part 1, we will rename searchPagerank.c and scaledFootrule.c so they do not end with .c and then compile all the remaining .c files together as follows to produce an executable called pagerank .

To test Part 2, we will rename pagerank.c and scaledFootrule.c so they do not end with .c and then compile all the remaining .c files together as follows to produce an executable called searchPagerank .

To test Part 3, we will rename pagerank.c and searchPagerank.c so they do not end with .c and then compile all the remaining .c files together as follows to produce an executable called scaledFootrule .

Every time you make a submission, a dryrun test will be run on your code to ensure that it compiles. Please ensure that your final submission successfully compiles, even for parts that you have not completed.

This assignment will contribute 20% to your final mark.

Correctness

80% of the marks for this assignment will be based on the correctness and performance of your code, and will be mostly based on autotesting. Marks for correctness will be distributed as follows:

Part 1 50% of the correctness mark
Part 2 25% of the correctness mark
Part 3 25% of the correctness mark
Note: A correct brute-force algorithm is worth 65% of this. You will only receive 100% of this if you devise a "smart" algorithm that avoids checking all possible permutations. See Part 3 for details.

Additionally, you must ensure that your code does not contain memory errors or leaks, as code that contains memory errors is unsafe and it is bad practice to leave memory leaks in your program. Submissions that contain memory errors or leaks will receive a penalty of 10% of the correctness mark. Specifically, there will be a separate memory error/leak check for each part, and the penalty will be 10% of the marks for that part. You can check for memory errors and leaks with the valgrind command:

Programs that contain no memory errors or leaks should produce no output from these commands. You can learn more about memory errors and leaks in the Debugging with GDB and Valgrind lab exercise.

20% of the marks for this assignment will come from hand marking of the readability of the code you have written. These marks will be awarded on the basis of clarity, commenting, elegance and style. The following is an indicative list of characteristics that will be assessed, though your program will be assessed wholistically so other characteristics may be assessed too (see the style guide for more details):

  • Good separation of logic into files
  • Consistent and sensible indentation and spacing
  • Using blank lines and whitespace
  • Using functions to avoid repeating code
  • Decomposing code into functions and not having overly long functions
  • Using comments effectively and not leaving planning or debugging comments

The course staff may vary the assessment scheme after inspecting the assignment submissions but it will remain broadly similar to the description above.

Originality of Work

This is an individual assignment. The work you submit must be your own work and only your work apart from exceptions below. Joint work is not permitted. At no point should a student read or have a copy of another student's assignment code.

You may use small amounts (< 10 lines) of general purpose code (not specific to the assignment) obtained from sites such as Stack Overflow or other publicly available resources. You should clearly attribute the source of this code in a comment with it.

You are not permitted to request help with the assignment apart from in the course forum, help sessions or from course lecturers or tutors.

Do not provide or show your assignment work to any other person ( including by posting it on the forum) apart from the teaching staff of COMP2521. When posting on the course forum, teaching staff will be able to view the assignment code you have recently submitted with give.

The work you submit must otherwise be entirely your own work. Submission of work partially or completely derived from any other person or jointly written with any other person is not permitted. The penalties for such an offence may include negative marks, automatic failure of the course and possibly other academic discipline. Assignment submissions will be examined both automatically and manually for such issues.

Relevant scholarship authorities will be informed if students holding scholarships are involved in an incident of plagiarism or other misconduct. If you knowingly provide or show your assignment work to another person for any reason, and work derived from it is submitted, then you may be penalised, even if the work was submitted without your knowledge or consent. This may apply even if your work is submitted by a third party unknown to you.

COMP2521: Assignment 2 Simple Search Engines

[The specification may change, Please check the change log on this page.]

  • [08:00am 11/Oct] : In 1.B, for the expected output, earlier "url101 was at the end (incorrectly), revised to " mars url101 url25 url31 "
  • [08:00am 11/Oct] : In the 1.A algorithm, "iteration++;" is moved to the end of the loop body to improve readability
20 marks
This assignment is completed in , based on your current lab group. ) -->
11:59:00 pm Thursday Week-13
2 marks per day off the ceiling.
Last day to submit this assignment is 5pm Friday Week-13, of course with late penalty.
Read instructions in the section below.
  • PageRank (read up to the section "Damping factor")
  • Section-1 contains urls representing outgoing links. Urls are separated by one or more blanks, across multiple lines.
  • Section-2 contains selected words extracted from the url. Words are separated by one or more spaces, spread across multiple lines.

Hint: If you use fscanf to read the body of a section above, you do not need to impose any restriction on line length. I suggest you should try to use this approach - use fscanf ! However, if you want to read line by line using say fgets , you can assume that maximum length of a line would be 1000 characters.

Hint: You need to use a dynamic data structure(s) to handle words in a file and across files, no need to know max words beforehand.

Example file url31.txt

In Part-1: Graph structure-based search engine, you need to create a graph structure that represents a hyperlink structure of given collection of "web pages" and for each page (node in your graph) calculate Weighted PageRank and other graph properties. You need to create "inverted index" that provides a list of pages for every word in a given collection of pages. Your graph-structure based search engine will use this inverted index to find pages where query term(s) appear and rank these pages using their Weighted PageRank values.

In Part-2: Content-based search engine, you need to calculate tf-idf values for each query term in a page, and rank pages based on the summation of tf-idf values for all query terms. Use "inverted index" you created in Part-1 to locate matching pages for query terms.

Additional files: You can submit additional supporting files, *.c and *.h , for this assignment. For example, you may implement your graph adt in files graph.c and graph.h and submit these two files along with other required files as mentioned below.

Sample files

  • Sample1.zip

Part-1: Graph structure-based Search Engine

  • Hints on "How to Implement Ass2 (Part-1)" , to be discussed in the lecture.

Please follow hints provided in the above documents, later in week-10 we will release APIs (files pagerank.h, invertedIndex.h and searchPagerank.h) based on the hints provided above, and you must implement the required functions in your .c files (pagerank.c , invertedIndex.c and searchPagerank.c) for this part. We will individually tests these functions for awarding marks.

A: Calculate Weighted PageRanks

You need to write a program in the file pagerank.c that reads data from a given collection of pages in the file collection.txt and builds a graph structure using Adjacency Matrix or List Representation. Using the algorithm described below, calculate Weighted PageRank for every url in the file collection.txt . In this file, urls are separated by one or more spaces or/and new line character. Add suffix .txt to a url to obtain file name of the corresponding "web page". For example, file url24.txt contains the required information for url24 .

Example file collection.txt

assignment 2 simple search engines

Your program in pagerank.c will take three arguments ( d - damping factor, diffPR - difference in PageRank sum, maxIterations - maximum iterations) and using the algorithm described in this section, calculate Weighted PageRank for every url.

For example,

Your program should output a list of urls in descending order of Weighted PageRank values (use format string "%.7f" to 8 significant digits ) to a file named pagerankList.txt . The list should also include out degrees (number of out going links) for each url, along with its Weighted PageRank value. The values in the list should be comma separated. For example, pagerankList.txt may contain the following:

Example file pagerankList.txt

Sample Files for 1A

You can download the following three sample files with expected pagerankList.txt files. For your reference, I have also included the file " log.txt " which includes values of Win, Wout, etc. Please note that you do NOT need to generate such a log file.

Use format string "%.7f" to output pagerank values. Please note that your pagerank values might be slightly different to that provided in these samples. This might be due to the way you carry out calculations. However, make sure that your pagerank values match to say first 6 decimal points to the expected values. For example, say an expected value is 0.1843112, your value could be 0.184311x where x could be any digit.

All the sample files were generated using the following command: % pagerank 0.85 0.00001 1000

B: Inverted Index

In each sorted list (set), duplicate urls are not allowed. Your program should output this "inverted index" to a file named invertedIndex.txt . One line per word, words should be alphabetically ordered, using ascending order. Each list of urls (for a single word) should be alphabetically ordered, using ascending order.

Example file invertedIndex.txt

C: Search Engine

Write a simple search engine in file searchPagerank.c that given search terms (words) as commandline arguments, finds pages with one or more search terms and outputs (to stdout) top 30 pages in descending order of number of search terms found and then within each group, descending order of Weighted PageRank. If number of matches are less than 30, output all of them.

Your program must use data available in two files invertedIndex.txt and pagerankList.txt , and must derive result from them. We will test this program independently to your solutions for "A" and "B".

Part-2: Content-based Search Engine

Please follow hints provided in the document in Part-1, later in week-10 we will release API (files searchTfIdf.h) based on the hints provided above, and you must implement the required functions in your searchTfIdf.c for this part. We will individually tests these functions for awarding marks.

In this part, you need to implement a content-based search engine that uses tf-idf values of all query terms for ranking. You need to calculate tf-idf values for each query term in a page, and rank pages based on the summation of tf-idf values for all query terms. Use "inverted index" you created in Part-1 to locate matching pages for query terms.

Read the following wikipedia page that describes how to calculate tf-idf values:

assignment 2 simple search engines

Write a content-based search engine in file searchTfIdf.c that given search terms (words) as commandline arguments, outputs (to stdout) top 30 pages in descending order of number of search terms found and then within each group, descending order of summation of tf-idf values of all search terms found. Your program must also output the corresponding summation of tf-idf along with each page, separated by a space and using format "%.6f", see example below .

If number of matches are less than 30, output all of them. Your program must use data available in two files invertedIndex.txt and collection.txt , and must derive result from them. We will test this program independently to your solutions for Part-1.

Submit (Part-2)

... tba ...

Part-3: Hybrid/Meta Search Engine using Rank Aggregation

Please follow hints provided in the document for Part-3, later in week-10 we will release API (file scaledFootrule.h) based on the hints, and you must implement the required functions in your scaledFootrule.c for this part. We will individually tests these functions for awarding marks.

In this part, you need to combine search results (ranks) from multiple sources (say from Part-1 and Part-2) using " Scaled Footrule Rank Aggregation " method, described below. All the required information for this method are provided below. However, if you are interested, you may want to check out this presentation on "Rank aggregation method for the web" .

Let T1 and T2 are search results (ranks) obtained using two different criteria (say Part-1 and Part-2). Please note that we could use any suitable criteria, including manually generated rank lists.

A weighted bipartite graph for scaled footrule optimization (C,P,W) is defined as,

  • C = set of nodes to be ranked (say a union of T1 and T2)
  • P = set of positions available (say {1, 2, 3, 4, 5})

assignment 2 simple search engines

  • k is number of rank lists.

assignment 2 simple search engines

If you use such a brute-force search, you will receive maximum of 65% for Part-3. However, you will be rewarded 100% for Part-3 if you implement a "smart" algorithm that avoids generating unnecessary alternatives, in the process of finding the minimum scaled-footrule distance. Please document your algorithm such that your tutor can easily understand your logic, and clearly outline how you plan to reduce search space, otherwise you will not be awarded mark for your "smart" algorithm! Yes, it's only 35% of part-3 marks, but if you try it, you will find it very challenging and rewarding.

Write a program scaledFootrule.c that aggregates ranks from files given as commandline arguments, and output aggregated rank list with minimum scaled footrule distance.

  • How to Get Started, Part-3

Example, file rankA.txt

Example, file rankD.txt

The following command will read ranks from files "rankA.txt" and "rankD.txt" and outputs minimum scaled footrule distance (using format %.6f) on the first line, followed by the corresponding aggregated rank list.

Two possible values of P with minnimum distance are: C = [url1, url2, url3, url4, url5] P = [1, 4, 2, 5, 3] and P = [1, 5, 2, 4, 3]

By the way, you need to select any one of the possible values of P that has minium distance, so there could be multiple possible answers. Note that you need to output only one such list.

One possible answer for the above example, for P = [1, 4, 2, 5, 3] :

Another possible answer for the above example, P = [1, 5, 2, 4, 3] :

Please note that your program should also be able to handle multiple rank files, for example:

Assignment-2 Group Creation

Please create your Assignment-2 group under the group " ass2grp ", click on "Groups" in the left panel on the class web page.

  • You must create your group by 5pm Monday Week-12 .
  • start with alphabetic character, followed by alphanumeric chars ( no spaces or special characters ; chars _ and - are allowed),
  • max 15 chars .

If your group name does not satisfy the above criteria, we will not be able mark your Assignment-2!

Later, instructions on how to create groups will be posted here.

Additional files: You can submit additional supporting files, *.c and *.h , for this assignment.

IMPORTANT: Make sure that your additional files (*.c) DO NOT have "main" function.

For example, you may implement your graph adt in files graph.c and graph.h and submit these two files along with other required files as mentioned below. However, make sure that these files do not have "main" function.

I explain below how we will test your submission, hopefully this will answer all of your questions.

You need to submit the following files, along with your supporting files (*.c and *.h):

  • searchPagerank.c
  • searchTfIdf.c
  • scaledFootrule.c

Now say we want to mark your pagerank.c program. The auto marking program will take all your supporting files (other *.h and *.c) files, along with pagerank.c and execute the following command to generate executable file say called pagerank. Note that the other four files from the above list ( inverted.c , searchPagerank.c , searchTfIdf.c and scaledFootrule.c ) will be removed from the dir:

So we will not use your Makefile (if any). The above command will generate object files from your supporting files and the file to be tested (say pagerank.c ), links these object files and generates executable file, say pagerank in the above example. Again, please make sure that you DO NOT have main function in your supporting files (other *.c files you submit).

We will use similar approach to generate other four executables (from inverted.c , searchPagerank.c , searchTfIdf.c and scaledFootrule.c ).

How to Submit

Later, instructions on how to submit will be posted here.

Go to the following page, select the tab "Make Submission", select "Browse" to select all the files you want to submit and submit ising "Submit" button. The submission system will try to compile each required file, and report the outcome (ok or error). Please see the output, and correct any error. If you do not submit a file(s) for a task(s), it will report it as an error(s).

You can now submit this assignment , click on " Make Submission " tab, and follow the instructions.

This is a group assignment. You are not allowed to use code developed by persons other than in your group. In particular, it is not permitted to exchange code or pseudocode between groups. You are allowed to use code from the course material (for example, available as part of the labs, lectures and tutorials). If you use code from the course material, please clearly acknowledge it by including a comment(s) in your file. If you have questions about the assignment, ask your tutor.

Before submitting any work you should read and understand the sub section named Plagiarism in the section " Student Conduct " in the course outline. We regard unacknowledged copying of material, in whole or part, as an extremely serious offence. For further information, see the course outline .

COMP9024 18x1

  • Course Outline
  • Course Work
  • Problem sets
  • Assignments
  • Lecture recordings
  • In-class quizzes
  • Help Sessions
  • Home Computing
  • Using Vlab to Connect to CSE
  • Home Computing Using OSX
  • Home Computing Using Windows

Toggle Menu

Assignment 2: Simple Graph structure-based Search Engine

Please carefully read instructions in the section 'Submission" at:

http://www.cse.unsw.edu.au/~cs9024/18x1/assn/assn2/index.html#submission

  • You can submit text files - *.c and *.h , do NOT submit *.zip, *.gz, *.tar, etc. files!
  • All files must be in one directory, and you need to select all the files you want to submit.
  • Do not submit non-text files, you can only submit *.c or *.h files.
  • You can submit this assignment multiple times , your last submission will be assessed (marked).
  • Every time you need to submit all the required files.
  • Carefully read messages after submitting your files, and if required, correct any errors and resubmit all the required files.

Resource created Monday 20 November 2017, 11:30:48 AM , last modified Monday 22 January 2018, 02:52:03 PM .

website: http://www.cse.unsw.edu.au/~cs9024/18x1/assn/assn2/index.html

Back to top

CRICOS Provider No. 00098G

Search Engine

Assignment written by Julie Zelenski

An introduction to search engines

google search home page

Search engines are one of the most influential developments of the modern internet age and have completely revolutionized how users interact with the web. What was once an intractable jumble of data with no semblance of organization has become a treasure trove of useful information due to the magic of search. Armed with the power of Map and Set , you can recreate this phenomenal achievement yourself. Once completed, you will understand the core technology underpinning Spotlight, Google, and Bing, and have solid practice in the use of the Map and Set ADTs.

Want to see the power of search right now? Click 🔍 Search in the top-right of the page navigation bar and search for easter egg to see what lurks deep in the recesses of the course website…

In your search engine, each web page has a URL ("Uniform Resource Locator") that serves as its unique id and a string containing the body text of the page. You will first write functions that pre-process the body text and populate the data structure. Next you'll implement the function to search for pages matching a search query. Finally, you will write a console program that allows the user to enter many search queries and retrieve the matching web pages. Put all together, you will have built your own mini search engine!

Understanding the web page data

We have harvested the body text from the pages of our course website into a database file that is provided in the starter project. The format of the database file is as follows:

  • The first line of a pair is a page URL.
  • The second line of a pair is the body text of that page, with all newlines removed (the entire text of the page in a single string).
  • The first two lines in the file form the first pair. The third and fourth lines form another pair, the fifth and sixth another, and so on, with alternating lines of page URL and page content.

To view an example database, open the file tiny . txt or website . txt in the folder Other files / res of the starter project.

Using an inverted index for searching

To make our search efficient, we need to be thoughtful about how we structure and store the data. A poor choice in data structures would make search painfully slow, but clever use of our wonderful ADTs can avoid that fate. A search engine typically uses a nifty arrangement known as an inverted index .

An inverted index is akin to the typical index in the back of a book. If you look up the keyword "internet" in the index of the CS106B textbook , it lists two page numbers, 18 and 821. The word internet occurs on page number 18 and again on page number 821. Thus, an inverted index is a mapping from word to locations. You look up a keyword in an inverted index to find all the locations where that keyword occurs.

In contrast a forward index operates in the other direction. For our book example, the forward index would be a mapping from a location (page number) to all the words that occur on that page. To build an inverted index, you typically start with the data in the form of a forward index and then process to convert to the inverted form. Inverting the index takes time up front, but once complete, the inverted index supports extremely fast operations to find query matches.

On to the code!

Decomposition is one of your best tools for tackling a complex problem. We'll guide you along by breaking the problem down into a sequence of steps. Follow our steps to success!

All of the functions and tests for the search engine are to be implemented in the file search . cpp .

1) Write helper function cleanToken ()

Start by writing a helper function for this small but key string-processing task:

cleanToken takes in a string token and returns a "cleaned" version of the token, ready to be stored in the index. To clean a token, extract only the letter and number characters and convert all letters to lowercase, i.e. "Wow!" becomes "wow", "mid-quarter" becomes "midquarter", and "CS106B" becomes "cs106b". Standardizing on this simple canonical form allows search queries to operate case-insensitively and ignore punctuation.

You might want to start by reworking the code from the lettersOnly function you wrote for Soundex and change/extend as necessary. The return value from cleanToken is the cleaned lowercase version, or will be empty string if the token is to be discarded (i.e. contains no letter or number characters).

Write test cases to confirm that cleanToken works correctly in all situations before moving on! You will make heavy use of cleanToken when building the inverted index and if this helper mishandles tokens, it will throw off your results and lead to sad times. It is definitely worth your time to confirm with a solid set of test cases before moving on. Remember to label your tests as STUDENT_TEST .

2) Write helper function gatherTokens ()

The helper function gatherTokens extracts the set of unique tokens from the body text.

The argument to gatherTokens is a string containing the body text from a single web page. The function returns a Set of the unique cleaned tokens that appear in the body text.

The function first tokenizes the body text — this means to separate the string into words using the space character as delimiter. Call the stringSplit function from the Stanford library for this task. Then call your cleanToken helper function on each token and store the cleaned tokens into a Set. No matter how many repeat occurrences of a given word are in the body text, repeatedly adding it to a set stores just the one copy, which makes this ADT ideal for gathering the unique tokens.

Time to test! Add test cases that confirm the output from gatherTokens , so you will later be able to call on this function with confidence that it does its job correctly.

3) Create inverted index in buildIndex ()

The function buildIndex reads the content from the database file and processes it into the form of an inverted index.

The first argument to buildIndex is the name of the database file of the web page data, the second argument is the Map to be populated with data for the inverted index. The return value of buildIndex is the number of documents processed from the database file.

Before starting to code, work through a small example on paper to ensure you understand the data structure you are trying to build. Open the res / tiny . txt database file and manually process the content to build the inverted index.

Q7 . List the contents of the inverted index built from the res / tiny . txt database file.

When you are ready to start writing code, read the previous section Understanding the web page data to review the format of the database file. Look at the code we provided for readMazeFile in maze . cpp for an example of C++ code to open a file and read the contents into a Vector of lines. Feel free to reuse that code for buildIndex .

For each page, you will call gatherTokens to extract the set of unique tokens. For each token in the page, you record the match to the page's URL by adding to its entry in the inverted index. The index is of type Map < string , Set < string >> , this map associates each keyword with the set of the URLs where that word occurs. The function returns the count of web pages that were processed.

Our starter code contains some provided tests to get you started; add student tests of your own to ensure your coverage is comprehensive. Use a TIME_OPERATION test to confirm your function operates in reasonable time. Building the inverted index should generally not take more than 5 seconds.

4) Search using findQueryMatches ()

Next up is to implement the function:

The query string can either be a single search term or a compound sequence of multiple terms. A search term is a single word, and a sequence of search terms is multiple consecutive terms, each of which (besides the first one) may or may not be preceded by a modifier like + or - (see below for details). The same stringSplit function used to divide the body text into tokens will be used to divide the query sentence into search terms.

The matches to a query are found by following these rules:

  • If the query is a single search term, the matches are the web pages containing that term.
  • Hint: use Set operation unionWith
  • A search term has a slightly altered meaning when the term is prefaced by certain modifiers:
  • Hint: use the Set operation intersect
  • Hint: use the Set operation difference
  • The same token cleaning applied to the body text is also applied to query terms. Call your helper cleanToken to process each search term to strip punctuation and convert to lowercase before performing the search for matches.

Note that searching is case-insensitive , that is, a search for "binky" returns the same results as "Binky" or "BINKY". Be sure to consider what implications this has for how you create and search the index.

Here are some example queries and how they are interpreted

  • matches all pages containing the term "quokka"
  • means simple OR cheap
  • matches pages that contain either "simple" or "cheap" or both
  • means tasty AND healthy
  • matches pages that contain both "tasty" and "healthy"
  • means tasty WITHOUT mushrooms
  • matches pages that contain "tasty" but do not contain "mushrooms"
  • means tasty WITHOUT mushrooms OR simple AND cheap
  • matches pages that match ((("tasty" without "mushrooms") or "simple") and "cheap")

There is no precedence for the operators, the query is simply processed from left to right . The matches for the first term are combined with matches for second, then combined with matches for third term and so on. In the last query shown above, the matches for tasty are first filtered to remove all pages containing mushrooms , then joined with all matches for simple and lastly intersected with all matches for cheap .

You may assume that the query sentence is well-formed, which means:

  • The query sentence is non-empty and contains at least one search term
  • A modifier will not appear on its own as a search term
  • A + or - character within a search term that is not the first character is not considered a modifier
  • The first search term in the query sentence will never have a modifier
  • You can assume that no search term will clean to the empty string (i.e. has at least one letter)

There is a lot of functionality to test in query processing, be sure you add an appropriate range of student tests to be sure you're catching all the cases.

5) Put it all together with searchEngine ()

Thus far, your amazing code has re-arranged a mass of unstructured text data into a highly-organized inverted index with instantaneous retrieval and fancy query-matching capability. Now take it over the finish line to build your own search engine!

The final function to implement is:

This function implements a console program that works as follows:

  • It first constructs an inverted index from the contents of the database file.
  • It prints how many web pages were processed to build the index and how many distinct words were found across all pages.
  • It enters a loop that prompts the user to enter a query
  • For each query entered by the user, it find the matching pages and prints the URLs.
  • When the user enters the empty string ( "" ), this indicates they are done and the program finishes.

After you have completed this function, your program should behave as shown in the transcript shown below.

Example program run (executed by running searchEngine ( "res/website.txt" ) in main . cpp ):

Way to go, 👍 you're well on your way to becoming the next internet search pioneer!

  • The res folder of the starter project includes two database files: tiny . txt is the small example used in the writeup and website . txt is the body text extracted from all of the pages in our course website (as of Jan 21). You can open these files in Qt Creator to view their contents. The project resource files are listed under Other files -> res . Your program can open a resource file by specifying the path as "res/myfilename" .
  • Inverted Index on GeeksForGeeks
  • Wikipedia article on Inverted Indexes
  • Stanford Natural Processing Group on Tokenization

If you have completed the basic assignment requirements and and want to go further, we encourage you to try adding an extension! A non-exhaustive list of potential extensions is listed below:

  • When you get the results from a Google search, they are ranked so that the more relevant results are first on the list. The current Google algorithm is a well-kept trade secret (though it was originally the Page Rank algorithm, named for its creator, then-Stanford graduate student Larry Page), but a simple approach is to give higher rank to pages with more occurrences of the search term. For this extension, you would need to re-think how you create your index to include the number of matches.
  • The assignment does not allow a search for multi-word terms, such as section leader . Searching for phrases is not trivial, as you cannot simply keep a mapping of all possible phrases in the document. You could, however, keep track of where in each document a word is, and then use that information to determine if words in a phrase are next to each other in any particular document.
  • The English language has many, many words that show up in text but are not particularly important for content, such as the , and , if , a , etc. These words are called Stop Words , and it would make your index smaller if you removed such stop words from the index. Here is more info about stop words .
  • In the current design, if a user searches for section they won't find matches for sections , even though pages that mention either might be a relevant match. Stemming is the process of reducing words to their base form, so that (for example) both section and sections would become, simply, section in the index.

If you have other creative ideas for extensions, run them by the course staff, and we'd be happy to give you guidance!

Assignment 2 Simple Search Engine(s)

COMP2521 (20T1): Assignment 2 Simple Search Engine(s) [The specification may change, Please check the change log on this page.] Objectives to implement simple search engines using well known algorithms like (Weighted) PageRank and rank aggregation, simplified versions for this assignment! to give you further practice with C and data structures (Graph ADT) Admin Marks 24 marks (part-1: 11 marks, part-2: 4 marks, part-3: 5 marks and 4 marks for "Style and Complexity") (14 marks towards total course mark). Individual Assignment This assignment is an individual assignment. Due 08pm Friday of Week-10 Late Penalty 2 marks per day off the ceiling. Last day to submit this assignment is 05pm Monday Week-11, of course with late penalty. Submit Read instructions in the "Submission" section below. Aim In this assignment, your task is to implement simple search engines using the well known algorithm (Weighted) PageRank, simplified for this assignment, of course! You should start by reading the wikipedia entries on these topics. Later I will also discuss these topics in the lecture. PageRank (read up to the section "Damping factor") The main focus of this assignment is to build a graph structure and calculate Weighted PageRanks, and rank pages based on these values. You don't need to spend time crawling, collecting and parsing weblinks for this assignment. You will be provided with a collection of "web pages" with the required information for this assignment in a easy to use format. For example, each page has two sections, Section-1 contains urls representing outgoing links. Urls are separated by one or more blanks, across multiple lines. Section-2 contains selected words extracted from the url. Words are separated by one or more spaces, spread across multiple lines. Hint: If you use fscanf to read the body of a section above, you do not need to impose any restriction on line length. I suggest you should try to use this approach - use fscanf! However, if you want to read line by line using say fgets, you can assume that maximum length of a line would be 1000 characters. Example file url31.txt #start Section-1 url2 url34 url1 url26 url52 url21 url74 url6 url82 #end Section-1 #start Section-2 Mars has long been the subject of human interest. Early telescopic observations revealed color changes on the surface that were attributed to seasonal vegetation and apparent linear features were ascribed to intelligent design. 4/29/2020 COMP2521 20T1 - Assignment 2 www.cse.unsw.edu.au/~cs2521/20T1/assigns/ass2/Ass2.html 2/6 #end Section-2 Summary In Part-1: You need to create a graph structure that represents a hyperlink structure of given collection of "web pages" and for each page (node in your graph) calculate Weighted PageRank and other graph properties. In Part-2: Search Engine : Graph structure-based search engine. In Part-3: Rank Aggregation (Hybrid search engine), you need to combine multiple ranks (for example, PageRank and tf-idf values) in order to rank pages. Additional files: You can submit additional supporting files, *.cand *.h, for this assignment. For example, you may implement your graph adt in files graph.c and graph.h and submit these two files along with other required files as mentioned below. Sample files Sample1.zip Part-1: Graph structure-based Search Engine (11 marks) Read the following for this part: Hints on "How to Implement Ass2 (Part-1)" , to be discussed in the lecture. Weighted PageRank Algorithm (paper) Hints on Wout : How to calculate? Calculate Weighted PageRanks You need to write a program in the file pagerank.c that reads data from a given collection of pages in the file collection.txt and builds a graph structure using Adjacency Matrix or List Representation. Using the algorithm described below, calculate Weighted PageRank for every url in the file collection.txt. In this file, urls are separated by one or more spaces or/and new line character. Add suffix .txt to a url to obtain file name of the corresponding "web page". For example, file url24.txt contains the required information for url24. Example file collection.txt url25 url31 url2 url102 url78 url32 url98 url33 Simplified Weighted PageRank Algorithm you need to implement (for this assignment) is shown below. Please note that the formula to calculate PR values is slightly different to the one provided in the corresponding paper (for explanation, read Damping factor). PageRankW(d, diffPR, maxIterations) Read "web pages" from the collection in file "collection.txt" and build a graph structure using Adjacency List or Matrix Representation N = number of urls in the collection For each url pi in the collection End For iteration = 0; diff = diffPR; // to enter the following loop While (iteration < maxIteration AND diff = diffPR) For each url pi in the collection 4/29/2020 COMP2521 20T1 - Assignment 2 www.cse.unsw.edu.au/~cs2521/20T1/assigns/ass2/Ass2.html 3/6 End For iteration++; End While Your program in pagerank.c will take three arguments (d - damping factor, diffPR - difference in PageRank sum, maxIterations - maximum iterations) and using the algorithm described in this section, calculate Weighted PageRank for every url. For example, % ./pagerank 0.85 0.00001 1000 Your program should output a list of urls in descending order of Weighted PageRank values (use format string "%.7f") to a file named pagerankList.txt. We will use a tolerance of 10e-4 to check pagerank values for auto-marking. The list should also include out degrees (number of out going links) for each url, along with its Weighted PageRank value. The values in the list should be comma separated. For example, pagerankList.txt may contain the following: Example file pagerankList.txt url31, 3, 0.2623546 url21, 1, 0.1843112 url34, 6, 0.1576851 url22, 4, 0.1520093 url32, 6, 0.0925755 url23, 4, 0.0776758 url11, 3, 0.0733884 Sample Files for 1A You can download the following three sample files with expected pagerankList.txt files. For your reference, I have also included the file "log.txt" which includes values of Win, Wout, etc. Please note that you do NOT need to generate such a log file. Use format string "%.7f" to output pagerank values. Please note that your pagerank values might be slightly different to that provided in these samples. This might be due to the way you carry out calculations. However, make sure that your pagerank values match to at least the first 4 decimal points to the expected values. For example, say an expected value is 0.1843112, your value could be 0.1843xxx where x could be any digit. All the sample files were generated using the following command: % ./pagerank 0.85 0.00001 1000 4/29/2020 COMP2521 20T1 - Assignment 2 www.cse.unsw.edu.au/~cs2521/20T1/assigns/ass2/Ass2.html 4/6 ex1 ex2 ex3 Part-2: Search Engine (4 marks) Your program must use data available in two files invertedIndex.txt and pagerankList.txt, and must derive result from them for this part. We will test this program independently to your solutions for other sections. The file named invertedIndex.txt contains one line per word, words are alphabetically ordered, using ascending order. Each list of urls (for a single word) are alphabetically ordered, using ascending order. You should read the file invertedIndex.txt line by line, can assume max line length of 1,000 chars. Later, tokenise words (hint: use the function "strtok" from the sting.h lib) for more info see sample example-1 or sample example-2. You can assume that you will have no duplicates for search terms, so "mars mars" is not possible. Example file invertedIndex.txt design url2 url25 url31 mars url101 url25 url31 vegetation url31 url61 Write a simple search engine in file searchPagerank.c that given search terms (words) as commandline arguments, finds pages with one or more search terms and outputs (to stdout) top 30 pages in descending order of number of search terms found and then within each group, descending order of Weighted PageRank. If number of matches are less than 30, output all of them. Example: % ./searchPagerank mars design url31 url25 url2 url101 Part-3: Hybrid/Meta Search Engine using Rank Aggregation (5 marks) In this part, you need to combine search results (ranks) from multiple sources (say from Part-1 and assignment-1) using "Scaled Footrule Rank Aggregation" method, described below. All the required information for this method are provided below. Let T1 and T2 are search results (ranks) obtained using two different criteria (say Part-1 and Part-2). Please note that we could use any suitable criteria, including manually generated rank lists. A weighted bipartite graph for scaled footrule optimization (C,P,W) is defined as, C = set of nodes to be ranked (say a union of T1 and T2) P = set of positions available (say {1, 2, 3, 4, 5}) W(c,p) is the scaled-footrule distance (from T1 and T2 ) of a ranking that places element 'c' at position 'p', given by where n is the cardinality (size) of C, |T1 | is the cardinality (size) of T1 , |T2 | is the cardinality (size) of T2 , T1 (x3 ) is the position of x3 in T1 , k is number of rank lists. For example, 4/29/2020 COMP2521 20T1 - Assignment 2 www.cse.unsw.edu.au/~cs2521/20T1/assigns/ass2/Ass2.html 5/6 The final ranking is derived by finding possible values of position 'P' such that the scaled-footrule distance is minimum. There are many different ways to assign possible values for 'P'. In the above example P = [1, 3, 2, 5, 4]. Some other possible values are, P = [1, 2, 4, 3, 5], P = [5, 2, 1, 4, 3], P = [1, 2, 3, 4, 5], etc. For n = 5, possible alternatives are 5! For n = 10, possible alternatives would be 10! that is 3,628,800 alternatives. A very simple and obviously inefficient approach could use brute-force search and generate all possible alternatives, calculate scaled-footrule distance for each alternative, and find the alternative with minimum scaled-footrule distance. If you use such a brute-force search, you will receive maximum of 65% for Part-3. However, you will be rewarded 100% for Part-3 if you implement a "smart" algorithm that avoids generating unnecessary alternatives, in the process of finding the minimum scaled-footrule distance. Please document your algorithm such that your tutor can easily understand your logic, and clearly outline how you plan to reduce search space, otherwise you will not be awarded mark for your "smart" algorithm! Yes, it's only 35% of part-3 marks, but if you try it, you will find it very challenging and rewarding. Write a program scaledFootrule.c that aggregates ranks from files given as commandline arguments, and output aggregated rank list with minimum scaled footrule distance. How to Get Started, Part-3 Example, file rankA.txt url1 url3 url5 url4 url2 Example, file rankD.txt url3 url2 url1 url4 The following command will read ranks from files "rankA.txt" and "rankD.txt" and outputs minimum scaled footrule distance (using format %.6f) on the first line, followed by the corresponding aggregated rank list. % ./scaledFootrule rankA.txt rankD.txt For the above example, there are two possible answers, with minimum distance of 1.400000. Two possible values of P with minnimum distance are: C = [url1, url2, url3, url4, url5] P = [1, 4, 2, 5, 3] and P = [1, 5, 2, 4, 3] By the way, you need to select any one of the possible values of P that has minium distance, so there could be multiple possible answers. Note that you need to output only one such list. One possible answer for the above example, for P = [1, 4, 2, 5, 3] : 1.400000 url1 url3 url5 url2 url4 Another possible answer for the above example, P = [1, 5, 2, 4, 3] : 4/29/2020 COMP2521 20T1 - Assignment 2 www.cse.unsw.edu.au/~cs2521/20T1/assigns/ass2/Ass2.html 6/6 1.400000 url1 url3 url5 url4 url2 Please note that your program should also be able to handle multiple rank files, for example: % ./scaledFootrule rankA.txt rankD.txt newSearchRank.txt myRank.txt Submission Additional files: You can submit additional supporting files, *.c and *.h, for this assignment. IMPORTANT: Make sure that your additional files (*.c) DO NOT have "main" function. For example, you may implement your graph adt in files graph.c and graph.h and submit these two files along with other required files as mentioned below. However, make sure that these files do not have "main" function. I explain below how we will test your submission, hopefully this will answer all of your questions. You need to submit the following files, along with your supporting files (*.c and *.h): pagerank.c searchPagerank.c scaledFootrule.c Now say we want to mark your pagerank.c program. The auto marking program will take all your supporting files (other *.h and *.c) files, along with pagerank.c and execute the following command to generate executable file say called pagerank. Note that the other four files from the above list (searchPagerank.c and scaledFootrule.c) will be removed from the dir: % gcc -Wall -lm -std=c11 *.c -o pagerank So we will not use your Makefile (if any). The above command will generate object files from your supporting files and the file to be tested (say pagerank.c), links these object files and generates executable file, say pagerank in the above example. Again, please make sure that you DO NOT have main function in your supporting files (other *.c files you submit). We will use similar approach to generate other executables (from searchPagerank.c and scaledFootrule.c). How to Submit Go to the following page, select the tab "Make Submission", select "Browse" to select all the files you want to submit and submit ising "Submit" button. The submission system will try to compile each required file, and report the outcome (ok or error). Please see the output, and correct any error. If you do not submit a file(s) for a task(s), it will report it as an error(s). You can now submit this assignment, click on "Make Submission" tab, and follow the instructions. Plagiarism This is an individual assignment. You are not allowed to use code developed by persons other than you. In particular, it is not permitted to exchange code or pseudocode between students. You are allowed to use code from the course material (for example, available as part of the labs, lectures and tutorials). If you use code from the course material, please clearly acknowledge it by including a comment(s) in your file. If you have questions about the assignment, ask your tutor. Before submitting any work you should read and understand the sub section named Plagiarism in the section "Student Conduct" in the course outline. We regard unacknowledged copying of material, in whole or part, as an extremely serious offence. For further information, see the course outline. -- end --

Starting from: $27

More products

Navigation Menu

Search code, repositories, users, issues, pull requests..., provide feedback.

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly.

To see all available qualifiers, see our documentation .

  • Notifications You must be signed in to change notification settings

Assignment 2 for COMP1927 2017s1

HenryQuan/Simple_SearchEngine

Folders and files.

NameName
111 Commits

Repository files navigation

Comp1927: assignment 2, simple search engines.

More Information

This is the assignment 2 for COMP1927 2017s1

Contributors 2

IMAGES

  1. Assignment 2 Simple Search Engine.pdf

    assignment 2 simple search engines

  2. Assignment search engines

    assignment 2 simple search engines

  3. search engine information data

    assignment 2 simple search engines

  4. Comparing different search engines and Conestoga’s Library database

    assignment 2 simple search engines

  5. Comparing different search engines and Conestoga's Library database

    assignment 2 simple search engines

  6. ASSIGNMENT 2.docx

    assignment 2 simple search engines

VIDEO

  1. what is search engine

  2. 2 search engines to very different answers

  3. Google Search Engine for Beginners: Your Step-by-Step Guide!

  4. Why Search Engines Will Be More Powerful Than Ever

  5. How to search article by different types of search engines II Private Batch II

  6. The Hidden Complexity of Search Engine Queries Revealed

COMMENTS

  1. Comp 2521 Assignment 2: C-based Search Engine

    Part 1 C: (COMPLETE) Simple Search Engine. Part 2: (COMPLETE) Content-based Search Engine. Part 3: (IN PROGRESS) Hybrid/Meta Search Engine using Scaled Footrule Rank Aggregation. About. Comp 2521 Assignment 2: C-based Search Engine Resources. Readme License. MIT license Activity. Stars. 1 star Watchers. 0 watching Forks. 3 forks Report repository

  2. COMP2521 22T3

    Assignment 2 Simple Search Engines Changelog. All changes to the assignment specification and files will be listed here. [25/10 16:00] Assignment released [28/10 18:00] Released resources [31/10 00:00] Corrected expected output files in part3/02 to use 7 decimal places instead of 6

  3. PDF GitHub: Let's build from here · GitHub

    {"payload":{"allShortcutsEnabled":false,"fileTree":{"":{"items":[{"name":"Sample1","path":"Sample1","contentType":"directory"},{"name":"ex1","path":"ex1","contentType ...

  4. COMP2521 21T3

    In this assignment, your task is to implement a simple search engine using a simplified version of the well-known Weighted PageRank algorithm. You should start by reading the Wikipedia article on PageRank (read up to the section Damping Factor). Later I will release a video lecture introducing this assignment and discuss the required topics.

  5. COMP2521 18s2

    COMP2521: Assignment 2 Simple Search Engines [The specification may change, Please check the change log on this page.] ... In this assignment, your task is to implement simple search engines using well known algorithms like (Weighted) PageRank and tf-idf, simplified for this assignment, of course!. You should start by reading the wikipedia ...

  6. Assignment 2 Simple Search Engine.pdf

    View Assignment 2_ Simple Search Engine.pdf from COMP 2521 at University of New South Wales. COMP2521 Assignment 2: Simple Search Engine(s) Good luck ! You can do it :) Please note that this is a

  7. CS106B Search Engine

    For this part of the assignment, you will recreate this phenomenal technological development by using the Map and Set ADTs to build a document search engine that can find matching pages for a user's query with lighting-fast response time. This is a simplified version of the technology underpinning Spotlight, Google, Bing, and every other modern ...

  8. Assignment 2: Simple Graph structure-based Search Engine

    Assignment 2: Simple Graph structure-based Search Engine. You can submit this task online or view your submission status if you are a part of this course when you login to WebCMS3. You can submit text files - *.c and *.h , do NOT submit *.zip, *.gz, *.tar, etc. files! All files must be in one directory, and you need to select all the files you ...

  9. 1 introduction.pdf

    Assignment 2 The Fury of Dracula — [introduction] — the rules — the data — faq — the view — the hunt — courtesy Richard Buckland, John Shepherd, and many tutors Objectives to implement the engine and some AIs for a game to give you experience working in a team to give you further practice with C and data structures Admin worth 15 marks towards your nal mark in the course ...

  10. PDF Search Engines

    Classic approach: Documents/query similarity is a function of term frequency within the document and across all documents. TF(w) = frequency of term w in a document/query. Intuition: a word appearing more frequently in a document is more likely to be related to its "meaning". IDF(w) = log (N/nw) + 1.

  11. CS106B Search Engine

    A search engine typically uses a nifty arrangement known as an inverted index. An inverted index is akin to the typical index in the back of a book. If you look up the keyword "internet" in the index of the CS106B textbook, it lists two page numbers, 18 and 821. The word internet occurs on page number 18 and again on page number 821.

  12. Assignment 2 Simple Search Engine (s)

    In Part-2: Search Engine : Graph structure-based search engine. In Part-3: Rank Aggregation (Hybrid search engine), you need to combine multiple ranks (for example, PageRank and tf-idf values) in order to rank pages. Additional files: You can submit additional supporting files, *.cand *.h, for this assignment. For example, you may

  13. Assignment 2

    Assignment 2 - Search Engine Optimization - Free download as PDF File (.pdf), Text File (.txt) or read online for free. This document provides instructions for a search engine optimization assignment. It asks students to demonstrate their knowledge of search engine optimization concepts by [1] explaining how a search engine works using the search engine thinking model with images/illustrations ...

  14. HELP PLEASE!! Assignment

    HELP PLEASE!! Assignment - simple search engine. I have to create a simple code for a web with a search bar connected to Google, where after entering key words you get results from Google (only first site), that you can download in CSV or any other engine readable format (not HTML). I tried many ways but Google just won´t let me get the ...

  15. PLEASE HELP! :(( Assignment

    PLEASE HELP! :(( Assignment - simple search engine . I have to create a simple code for a ¨HTML web with a search bar connected to Google, where after entering key words you get results from Google (only first site), that you can download in CSV or any other engine readable format (not HTML). I tried many ways but Google just won´t let me get ...

  16. GitHub

    A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior.

  17. SweetSearch

    A search engine for students that emphasizes high quality resources evaluated and approved by educators, librarians and research experts. ...

  18. search engine information data

    ii) Search phrase #2: ("Modern conveniences" OR "Modern technology") And Teaching in future PART B: In this section you will be using 2 different search engines to research your topic. Be sure to complete the table for each search engine [5 marks/table]. Search Engine #1 Used URL of the source and ranking in results page

  19. Comparing Search Engines: Impact of Mobile Phones on Creativity

    "Impact of mobile phones on imagination" PART B: In this section you will be using 2 different search engines to research your topic. Be sure to complete the table for each search engine [5 marks/table]. Search Engine #1 Used URL of the source and ranking in results page (make sure the source you choose passes the CARS test) Search phrase #1 ("mobile phones OR cell phones") AND creativity.

  20. The Anatomy of a Search Engine

    Task 2. Run the tests for the class IndexFlipperTest, and then SearcherTest. You should have a sucessful build to pass the test. In this task, you should first complete the file IndexFlipper.java (and pass the IndexFlipperTest) to generate the index_flipped.csv file from the index.csv file. The structure of the flipped index is indicated in the ...

  21. Steven Deng and Ying Zhong COMP2521 Assignment 2

    Steven Deng and Ying Zhong COMP2521 Assignment 2 - Search Engine Algorithms - GitHub - ytsersius/SD-YZ-2521-ass2: Steven Deng and Ying Zhong COMP2521 Assignment 2 - Search Engine Algorithms

  22. 22527065 Apoorva Dewar

    Ans engines allow users to search the Internet for content using keywords. Although the market is dominated by a few, there are many search engines that people can use. When a user enters a query into a search engine, a search engine results page (SERP) is returned, ranking the found pages in order of their relevance.

  23. HenryQuan/Simple_SearchEngine: Assignment 2 for COMP1927 2017s1

    COMP1927: Assignment 2 Simple Search Engines. More Information. This is the assignment 2 for COMP1927 2017s1. About. Assignment 2 for COMP1927 2017s1 Topics. c search-engine assignment comp1927 Resources. Readme Stars. 0 stars Watchers. 3 watching Forks. 0 forks Releases No releases published. Packages 0. No packages published . Contributors 3