Sale!

Numerical Computing Project 1 PageRank Algorithm solution

Original price was: $35.00.Current price is: $30.00. $25.50

Description

5/5 - (6 votes)

The purpose of this project1
is to learn the importance of numerical linear algebra algorithms to solve fundamental
linear algebra problems that occur in search engines.
Page-Rank Algorithm – Linear Systems
One of the reasons why GoogleTM is such an effective search engine is the Page-RankTM algorithm developed by
Google’s founders, Larry Page and Sergey Brin, when they were graduate students at Stanford University. PageRank
is determined entirely by the link structure of the World Wide Web. It is recomputed about once a month and does
not involve the actual content of any Web pages or individual queries. Then, for any particular query, Google finds
the pages on the Web that match that query and lists those pages in the order of their PageRank. Imagine surfing the
Web, going from page to page by randomly choosing an outgoing link from one page to get to the next. This can lead
to dead ends at pages with no outgoing links, or cycles around cliques of interconnected pages. So, a certain fraction
of the time, simply choose a random page from the Web. This theoretical random walk is known as a Markov chain
or Markov process. The limiting probability that an infinitely dedicated random surfer visits any particular page is its
PageRank. A page has high rank if other pages with high rank link to it.
Let W be the set of Web pages that can be reached by following a chain of hyperlinks starting at some root page
and let n be the number of pages in W. For Google, the set W highly fluctuates over time and is not exactly known,
but was estimated to be about 30 billion in 20162
. Let G be the n-by-n connectivity matrix of a portion of the Web,
that is gij = 1 if there is a hyperlink to page i from page j and zero otherwise. The matrix G can be huge, but it is
very sparse. Its j-th column shows the links on the j-th page. The number of nonzeros in G is the total number of
hyperlinks in W. Let ri and cj be the row and column sums of G:
ri =
X
j
gij and cj =
X
i
gij . (1)
The quantities rj and cj are the in-degree and out-degree of the j-th page. Let p be the probability that the random
walk follows a link. A typical value is p = 0.85. Then 1 − p is the probability that some arbitrary page is chosen and
δ = (1 − p)/n is the probability that a particular random page is chosen. Let A be a n-by-n matrix with elements
aij =

p gij/cj + δ cj 6= 0,
1/n cj = 0.
(2)
Notice that A comes from scaling the connectivity matrix by its column sums. The j-th column is the probability of
jumping from the j-th page to the other pages on the Web. If the j-th page is a dead end – i.e., it has no out-links –
then we assign a uniform probability of 1/n to all the elements in its column. Most of the elements of A are equal
to δ, the probability of jumping from one page to another without following a link. If n = 4 · 109
and p = 0.85,
then δ = 3.75 · 10−11. Matrix A is the transition probability matrix of the Markov chain. Its elements are all strictly
between zero and one and its column sums are all equal to one. An important result from matrix theory, known as the
Perron-Frobenius theorem, applies to such matrices: a nonzero solution of the equation
x = Ax (3)
1This project is originally based on a SIAM book chapter from Numerical Computing with Matlab by Clever B. Moler.
2
see ”Estimating search engine index size variability” by van den Bosch et al., Scientometrics,
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4833824/
1
Numerical Computing, Fall Semester 2022
Lecturer: Prof. O. Schenk
Assistants: E. Vecchi, L. Gaedke-Merzhauser ¨
exists and is unique to within a scaling factor. If this scaling factor is chosen so that
X
i
xi = 1, (4)
then x is the state vector of the Markov chain and corresponds to Google’s PageRank. The elements of x are all
positive and less than one. Vector x is the solution to the singular, homogeneous linear system
(I − A)x = 0 (5)
For modest n, an easy way to compute x in Julia is to start with some approximate solution, such as the PageRanks
from the previous month, or
x = ones(n, 1) / n
Then simply repeat the assignment statement
x = A * x
until the difference between successive vectors is within a specified tolerance. This is known as the power method
and is about the only possible approach for very large n. In practice, the matrices G and A are never actually formed.
One step of the power method would be done by one pass over a database of Web pages, updating weighted reference
counts generated by the hyperlinks between pages. The best way to compute PageRank in Julia is to take advantage
of the particular structure of the Markov matrix. Here is an approach that preserves the sparsity of G. The transition
matrix can be written
A = pGD + ez|
, (6)
where D is the diagonal matrix formed from the reciprocals of the out-degrees,
djj =

1/cj cj 6= 0
0 cj = 0.
(7)
e is the n-vector of all ones, and z is the vector with components
zj =

δ cj 6= 0
1/n cj = 0.
(8)
The rank-one matrix ez|
accounts for the random choices of Web pages that do not follow links. The equation
x = Ax (9)
can be written as
(I − pGD)x = γe (10)
where
γ = z
|x. (11)
We do not know the value of γ because it depends on the unknown vector x, but we can temporarily take γ = 1. As
long as p is strictly less than one, the coefficient matrix I − pGD is nonsingular and the equation
(I − pGD)x = e (12)
can be solved for x. Then the resulting x can be rescaled so that
X
i
xi = 1. (13)
Notice that vector z is not involved in this calculation. The following Julia statements implement this approach.
2
Numerical Computing, Fall Semester 2022
Lecturer: Prof. O. Schenk
Assistants: E. Vecchi, L. Gaedke-Merzhauser ¨
c = vec(sum(G, dims=1));
k = findall(!iszero, c);
D = sparse(k, k, map(x -> 1/x, c[k]), n, n);
e = ones(n, 1);
sI = sparse(I, n, n);
x = (sI – p * G * D) \ e;
x = x/sum(x);
The power method [1, 2] can also be implemented in a way that does not actually form the Markov matrix and so
preserves sparsity. Compute
G = p * G * D;
z = ((1- p) * map(!iszero, c) + map(iszero, c)) / n;
z = reshape(z, (1,500));
Start with
x = e / n;
Then repeat the statement
x = G * x + e * (z * x);
until x settles down to several decimal places.
It is also possible to use an algorithm known as inverse iteration [1, 3] where one defines
x = e / n;
A = p * G * D + e * z;
and then the following statement is repeated until convergence given a suitable value α.
x = (alpha * I – A) \ x;
x = x / sum(x);
For details of when and why this method works, please refer to the references cited above. However, we would like
to briefly observe what happens when we take α = 1 in the expression (αI − A). Since the resulting matrix (I − A)
is theoretically singular, with exact computation some diagonal elements of the upper triangular factor of (I − A)
would, in principle, be zero and this computation should fail. But with roundoff error, the computed matrix (I −A) is
probably not exactly singular. Even if it is singular, roundoff during Gaussian elimination will most likely prevent any
exact zero diagonal elements. We know that Gaussian elimination with partial pivoting always produces a solution
with a small residual, relative to the computed solution, even if the matrix is badly conditioned. The vector obtained
with the backslash operation, (I − A)\x, usually has very large components. If it is rescaled by its sum, the residual
is scaled by the same factor and becomes very small. Consequently, the two vectors x and Ax equal each other to
within roundoff error. In this setting, solving the singular system with Gaussian elimination blows up, but it blows up
in exactly the right direction.
Figure 1 is the graph for a tiny example, with n = 6 instead of n = 4 · 109
. The pages on the Web are identified by
strings known as uniform resource locators, or URLs. Most URLs begin with https because they use the hypertext
transfer protocol. In Julia, we can store the URLs as an array of strings in a cell array. The example in Figure 1 can
be written as follows by using a 6-by-1 cell array:
U = [“https://www.alpha.com”
“https://www.beta.com”
“https://www.gamma.com”
“https://www.delta.com”
“https://www.rho.com”
“https://www.sigma.com”]
In order to access the string contained in a cell we can simply write U[k], with k = 1, . . . , 6 in this small web graph.
Thus, U[1] would, e.g., correspond to the string ”https://www.alpha.com”.
We can generate the connectivity matrix by specifying the pairs of indices (i, j) of the nonzero elements. Because
there is a link to beta.com from alpha.com, the (2, 1) element of G is nonzero. The nine connections are described by
3
Numerical Computing,
Figure 1: A tiny web graph.
i = [ 2 6 3 4 4 5 6 1 1]
j = [ 1 1 2 2 3 3 3 4 6]
A sparse matrix is stored in a data structure that requires memory only for the nonzero elements and their indices.
This is hardly necessary for a 6-by-6 matrix with only 27 zero entries, but it becomes crucially important for larger
problems. The statements
n = 6;
G = sparse(vec(i), vec(j), vec(ones(Bool, 1, 9)));
Matrix(G)
generate the sparse representation of an n-by-n matrix with ones in the positions specified by the vectors i and j and
display its full representation.
0 0 0 1 0 1
1 0 0 0 0 0
0 1 0 0 0 0
0 1 1 0 0 0
0 0 1 0 0 0
1 0 1 0 0 0
The statement
c = sum(G, dims=1)
computes the column sums
2 2 3 1 0 1
Notice that c(5) = 0 because the 5th page, labeled rho, has no out-links. The statements
x = (sI – p * G * D) \ e;
x = x / sum(x);
solve the sparse linear system to produce
julia> x
0.3210
0.1705
0.1066
0.1368
0.0643
0.2007
4
Numerical Computing, Fall Semester 2022
Lecturer: Prof. O. Schenk
Assistants: E. Vecchi, L. Gaedke-Merzhauser ¨
Figure 2: Page Rank for the tiny web graph
The bar graph of x is shown in Figure 2. If the URLs are sorted in PageRank order and listed along with their in- and
out-degrees, the result is
index page-rank in out url
1 0.321017 2 2 https://www.alpha.com
6 0.200744 2 1 https://www.sigma.com
2 0.170543 1 2 https://www.beta.com
4 0.136793 2 1 https://www.delta.com
3 0.106592 1 3 https://www.gamma.com
5 0.0643118 1 0 https://www.rho.com
We see that alpha has a higher PageRank than delta or sigma, even though they all have the same number of
in-links. A random surfer will visit alpha over 32% of the time and rho only about 6% of the time.
For this tiny example with p = .85, the smallest element of the Markov transition matrix is δ = .15/6 = .0250.
A =
0.0250 0.0250 0.0250 0.8750 0.1667 0.8750
0.4500 0.0250 0.0250 0.0250 0.1667 0.0250
0.0250 0.4500 0.0250 0.0250 0.1667 0.0250
0.0250 0.4500 0.3083 0.0250 0.1667 0.0250
0.0250 0.0250 0.3083 0.0250 0.1667 0.0250
0.4500 0.0250 0.3083 0.0250 0.1667 0.0250
We can notice that all the columns of matrix A sum to one, thus agreeing with the definition of left stochastic matrix.
This mini-project includes the Julia file surfer.jl. A statement like
U, G = surfer(’https://www.xxx.zzz’,n)
starts at a specified URL and tries to surf the Web until it has visited n pages. If successful, it returns an n-by-1 cell
array of URLs and an n-by-n sparse connectivity matrix. Surfing the Web automatically is a dangerous undertaking
and this function must be used with care. Some URLs contain typographical errors and illegal characters. There is a
list of URLs to avoid that includes .gif files and Web sites known to cause difficulties. Most importantly, surfer can
get completely bogged down trying to read a page from a site that appears to be responding, but that never delivers the
complete page. When this happens, it may be necessary to have the computer’s operating system ruthlessly terminate
Julia. With these precautions in mind, you can use surfer to generate your own PageRank examples. The statement
U, G = surfer(’https://www.inf.ethz.ch’, 500);
5
Numerical Computing, Fall Semester 2022
Lecturer: Prof. O. Schenk
Assistants: E. Vecchi, L. Gaedke-Merzhauser ¨
accesses the home page of the Department of Computer Science at the Swiss Federal Institute of Technology in Zurich
(ETH) and generates a 500-by-500 test case. The file ETH500.mat included in the code folder was generated by
using this function in September 2014. In the following, it is indicated as the ETH500 data set. The statement
spy(G)
produces a spy plot (Figure 3) that shows the nonzero structure of the connectivity matrix. The statement
pagerank(U, G)
computes the page ranks, produces a bar graph (Figure 4) of the ranks, and prints the most highly ranked URLs in
PageRank order. The highly ranked pages are
julia> pagerank(U,G)
index page-rank in out url
57 0.145755 292 1 http://www.zope.org
466 0.049996 19 1 http://purl.org/dc/elements/1.1
39 0.028766 311 0 http://www.ethz.ch
58 0.021863 291 0 http://www.infrae.com
53 0.021202 288 0 http://www.cd.ethz.ch/services/web
101 0.016601 234 6 http://www.hk.ethz.ch/index_EN
72 0.016473 235 0 http://www.ethz.ch/index_EN
486 0.007374 14 2 http://www.handelszeitung.ch
463 0.006896 17 0 http://ns.adobe.com/xap/1.0
Figure 3: Spy plot of the Department of Computer Science (ETH Zurich) web graph.
Solve the following problems:
Preliminary: Read “A First Course on Numerical Methods”
Read Chapter 8 [1] carefully, in order to gain a better understanding of the topic summarized above.
1. Theory [20 points]
(a) What are an eigenvector, an eigenvalue and an eigenbasis?
(b) What assumptions should be made to guarantee convergence of the power method?
6
Numerical Computing, Fall Semester 2022
Lecturer: Prof. O. Schenk
Assistants: E. Vecchi, L. Gaedke-Merzhauser ¨
Figure 4: PageRank of the Department of Computer Science (ETH Zurich) web graph.
(c) What is the shift and invert approach?
(d) What is the difference in cost of a single iteration of the power method, compared to the inverse iteration?
(e) What is a Rayleigh quotient and how can it be used for eigenvalue computations?
2. Other webgraphs [10 points]
Use surfer.jl and pagerank.jl to compute PageRanks for some subset of the Web of your choice. Do you
see any interesting structure in the results (e.g., cliques, see next question)? Report on three PageRanks for different
webgraphs by showing the connectivity matrix, the PageRanks, and the ten most important entries in the graph.
3. Connectivity matrix and subcliques [5 points]
The connectivity matrix for the ETH500 data set (ETH500.mat) has various small, almost entirely nonzero, submatrices that produce dense patches near the diagonal of the spy plot. You can use the zoom button to find their indices.
The first submatrix has, e.g., indices around 80. Mathematically, a graph where all nodes are connected to each other
is known as a clique. Identify the organizations within the ETH community that are responsible for these near cliques.
4. Connectivity matrix and disjoint subgraphs [10 points]
Figure 5 reports the graph of a tiny six-node subset of the Web. Unlike the example reported in Figure 1, in this case
there are two disjoint subgraphs.
1. What is the connectivity matrix G? Which are its entries?
2. What are the PageRanks if the hyperlink transition probability p assumes the default value of 0.85?
3. Describe what happens with this example to both the definition of PageRank and the computation done by
pagerank in the limit p → 1.
5. PageRanks by solving a sparse linear system [40 points]
The function pagerank(U,G) computes PageRanks by solving a sparse linear system. It then plots a bar graph and
prints the dominant URLs.
7
Numerical Computing, Fall Semester 2022
Lecturer: Prof. O. Schenk
Assistants: E. Vecchi, L. Gaedke-Merzhauser ¨
Figure 5: Another tiny Web.
1. Create pagerank1.jl by modifying pagerank.jl to use the power method instead of solving the sparse
linear system. The key statements are
G = p * G * D;
z = ((1- p) * map(!iszero, c) + map(iszero, c)) / n;
z = reshape(z, (1,500));
while termination_test
x = G * x + e * (z * x);
end
What is an appropriate test for terminating the power iteration?
2. Create pagerank2.jl by modifying pagerank.jl to use the inverse iteration. The key statements are
while termination_test
x = (alpha * I – A) \ x;
x = x / sum(x);
end
Use your functions pagerank1.jl and pagerank2.jl (set α = 0.99) to compute the PageRanks of the
six-node example presented in Figure 1. Make sure you get the same result from each of your three functions.
3. We now want to analyse the impact of α on the inverse iteration. Using the ETH500 example, set α equal
to 0.8, 0.9, 0.95 and 1. Comment on the different number of iterations the four cases take until convergence.
Analyse your results and explain what you observe.
Hint: Check your solution x for all 4 cases. Are they always the same?
4. Use your functions pagerank1.jl and pagerank2.jl (set α = 0.99) to compute the PageRanks of the
three selected graphs from exercise 2. Report on the convergence of the two methods for these subgraphs and
summarize the advantages and disadvantages of the power method implemented in pagerank1.jl against
the inverse iteration in pagerank2.jl.
8
Numerical Computing, Fall Semester 2022
Lecturer: Prof. O. Schenk
Assistants: E. Vecchi, L. Gaedke-Merzhauser ¨
Quality of the code and of the report [15 Points]
The highest possible score of each project is 85 points and up to 15 additional points can be awarded based on the
quality of your report and code (maximum possible grade: 100 points). Your report should be a coherent document,
structured according to the template provided on iCorsi. If there are theoretical questions, provide a complete and
detailed answer. All figures must have a caption and must be of sufficient quality (include them either as .eps or .pdf).
If you made a particular choice in your implementation that might be out of the ordinary, clearly explain it in the
report. The code you submit must be self-contained and executable, and must include the set-up for all the results that
you obtained and listed in your report. It has to be readable and, if particularly complicated, well-commented.
Additional notes and submission details
Summarize your results and experiments for all exercises by writing an extended LaTeX report, by using the template provided on iCorsi (https://www.icorsi.ch/course/view.php?id=14666). Submit your gzipped
archive file (tar, zip, etc.) – named project number lastname firstname.zip/tgz – on iCorsi (see [NC
2022] Project 1 – Submission PageRank Algorithm) before the deadline. Submission by email or through any other
channel will not be considered. Late submissions will not be graded and will result in a score of 0 points for that
project. You are allowed to discuss all questions with anyone you like, but: (i) your submission must list anyone you
discussed problems with and (ii) you must write up your submission independently. Please remember that plagiarism
will result in a harsh penalization for all involved parties.
In-class assistance
If you experience difficulties in solving the problems above, please join us in class either on Tuesdays 16:30-18:15 in
room C1.03 or on Wednesdays 14:30-16:15 in room D1.15.
References
[1] The power method and variants, Chapter 8: Eigenvalues and Singular Values, SIAM Book “A First Course on
Numerical Methods”, C. Greif, U. Ascher, pp. 219-229.
[2] Power iteration, http://en.wikipedia.org/wiki/Power_iteration
[3] Inverse iteration, http://en.wikipedia.org/wiki/Inverse_iteration
9