Page Rank

The basic PageRank algorithm with random start probability 0.15.

The WebGraphs are as provided with the collection. WebGraphs not only include in-collection pages as nodes, but also all the outlinks from those pages. For example, the category A English portion has about 500 million (503,860,525) pages, and the graph includes roughly 4.8 billion (4,780,950,903) URLs/nodes.

It should be noted that the PageRank files available on this page contain duplicate entries. There were cases where DocNOs corresponded to multiple NodeIDs in the WebGraph. Since the PageRank scores are calculated based on NodeIDs, and then the NodeIDs are mapped back to DocNOs, it caused duplicate DocNOs. To correct for this, you should sum over the PageRank scores of all occurrences of the same DocNO.

Edit Section

Category A English portion

The lists contain 502,511,675 DOCNOs after deduplication. All duplicate URLs are removed from this list, during translating node_ids to WARC DOCNOs. (For DOCNOs with the same URL, only the smallest DOCNO is kept as the DOCNO for all the node_ids corresponding to that URL.)
There are about 52% of the DOCNOs in the raw pagerank list that have the default minimum pagerank because there are no inlinks pointing to them in the Web Graph. About 86% of the bottom DOCNOs in the prior file are in the last bin of the 10 pagerank prior bins.

The duplicate record list file contains in each line duplicate DOCNOs that correspond to the same URL. The list also includes prefixes and are in the same format. If the DOCNO is in the file as a complete DOCNO, or if its prefix appears, it's a duplicate. As noted above, only the smallest document number will be included in the PRranked data. There are a small number of DOCNOs that are not in the pagerank data and are not included in the duplicate record list. Malformed or incomplete html and parser read errors caused the DOCNO to not be included node list and therefore not in the webgraph.

Edit Section

Category B

Since Category B is just a subset of A, PageRank scores for Category B documents can be found as a subset of the Category A scores.
The following two files contain PageRank scores computed only on the WebGraph of the Category B set.
The lists contain 148,148,553 DOCNOs. Duplicate URLs are removed from this list. And the lists contain about 100 million DOCNOs that are not in Category B but in Category A.