The ClueWeb12 Dataset:
Dataset Details

This document describes how the ClueWeb12 dataset was created, what it contains, and how it is organized for distribution to the research community.

 

Crawling the Web

Most of the web documents were collected by five instances of the Internet Archive's Heritrix web crawler running on five Dell PowerEdge R410 machines with 64GB RAM. Heritrix was configured to follow typical crawling guidelines. There is a FAQ page for the crawler, in case you are curious.

The crawl was initially seeded with 2,820,500 uniq URLs. This list was generated by taking the 10 million ClueWeb09 urls that had the highest PageRank scores, and then removing any page that was not in the top 90% of pages as ranked by Waterloo spam scores (i.e., least likely to be spam). Two hundred sixty-two (262) seeds were added from the most popular sites in English-speaking countries, as reported by Alexa. The number of sites selected from each country depended on its relative population size, for example, United States (71.0%), United Kindom (14.0%), Canada (7.7%), Australia (5.2%), Ireland (3.8%), and New Zealand (3.7%). Finally, Charles Clark, University of Waterloo, provided 5,950 seeds specific to travel sites.

A blacklist was used to avoid sites that are reported to distribute pornography, malware, and other material that would not be useful in a dataset intended to support a broad range of research on information retrieval and natural language understanding. The blacklist was obtained from a commercial managed URL blacklist service, URLBlacklist.com, which was downloaded on 2012-02-03. The crawler blackliset consisted of urls in the malware, phishing, spyware, virusinfected, filehosting and filesharing categories. The blacklist also included a small number (less than a dozen) of sites that opted out of the crawl.

Urls mentioned in English tweets from a Twitter Gardenhose stream were harvested each day by a sixth Heritrix crawler. The domains of tweeted urls were injected into the main web crawl on a regular basis, which was intended to create a more connected graph between the web crawl and the tweeted urls.

A seventh Heritrix crawler crawled the English part of WikiTravel. Wikitravel crawling began on April 5, 2012 and ended on May 1, 2012.

The crawlers were configured to capture page text, css, xml,and javascript files, any images on a page, and http response headers. The crawler skipped multimedia files, for example, flash images, audio and video files as well as compressed files (e.g. zip, tar, gz, sit, hqx). The crawler also truncated any file that was larger than 10MB in size.

Image files were collected during the crawl so that the Lemur Project can support user studies and offer a page rendering service to the research community (as we do for the ClueWeb09 dataset). However, the image files are not part of the ClueWe12 dataset. They are not covered by the ClueWeb12 dataset license, and we do not distributed them.

 

Processing the Crawled Data

After all crawling was completed, the downloaded documents were transformed into a research dataset by a series of processing steps, described below.

 

 

Duplicate Records

After some of the ClueWeb12 version 1.0 datasets were shipped, we discovered that we made an error when we constructed it. During post-processing, we inadvertently processed some of the crawler files twice, which caused the dataset to contain more than 100 million duplicate documents. Although duplicate documents are to be expected in any large web crawl, these duplicates were caused by our error, not by characteristics of the crawler or re-use on the web. Thus, we decided to remove them from the dataset. We also took this opportunity to remove a smaller number of other duplicates that should have been filtered out by our crawling architecture. The total number of documents removed was about 140 million documents (16% of the collection). Removing them to create a new version of the dataset reduces everyone's storage and computational costs by 16% for years to come.

The new version of the ClueWeb12 dataset is v1.1. This is the standard version of the dataset; we are no longer distributing v1.0.

V1.1 has the same directory structure and document ids as v1.0. The only differences are that some files are a little smaller because duplicate documents are removed, and some files are missing because they contained only duplicate documents. Thus, document ids within some files are not consecutive (i.e., there may be gaps where the missing documents were previously).

We provided a Java program to run on the v1.0 dataset to produce a v1.1 dataset. You may download the program from http://lemurproject.org/clueweb12/ClueWeb12-RemoveduplicateRecords.php. This package includes a verification procedure to guarantee that you have a correct and complete v1.1 dataset. On our system, this conversion takes about 24 hours.

 

 

Creating the ClueWeb12 B13 Dataset

After all the processing was performed on the downloaded documents, a representative or uniform 7% sample was created. This was done by taking every 14th document (WARC response record), from each of the files of the full dataset. This process was done after the duplicate records were removed. This dataset is shipped on 1 logical disk (500GB drive) and the organization is similar to the ClueWeb12 dataset described below. Beginning with ClueWeb12 v1.1, we will be destributing the B13 dataset on the disk with the full dataset. If you have version 1.0, you can create the ClueWeb12 B13 dataset by using the tool provided here. This package includes a verification procedure to guarantee that you have a correct and complete ClueWeb12 B13 dataset. On our system, it took about 8 hrs to create the ClueWeb12 B13 dataset.

ClueWeb12 B13 Summary Statistics

Size compressed:389 GB
Size uncompressed:1.95 TB
Number of WARC files:33,447
Number of documents:52,343,021

 

 

Dataset Organization

The dataset is organized hierarchically, as follows.

Each disk contains about 220 million pages. Each segment contains about 50 million pages. Each directory contains about 3 million pages. Each file contains about 30,000 pages, and requires about 1 GB of space when uncompressed.

 

Naming Conventions

Segment names: Segments are named ClueWeb12_<segment #>, where <segment #> is a 2-digit segment sequence number that begins with "00". For example, the first segment is named "ClueWeb12_00".

Directory names: Directories are named <segment #><directory #><type> where <directory #> is a 2-digit directory sequence number that begins with 00, and <type> is a two character representation of the type of data contained in the directory. (See below for the definition and description of the two character data types.) Each type of data has its own sequencing. For example, the first directories in the ClueWeb12_00 segment are named "0000tw", "0000wb", and "0000wt".

File names: Files are named <directory>-<file #>.warc.gz where <directory< is the directory where the file is located, and <file #> is a two digit sequence number that begins with "00". For exmaple, the first file in the ClueWeb12_00/0000wb directory is named 0000wb-00.warc.gz.

 

File Format

Web pages are grouped together in files that conform to the WARC ISO 28500 version 1.0 standard ("WARC files"). The WARC file format is described at http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf. WARC files are compressed with gzip. When a WARC file is uncompressed, it requires about 1 GB of storage.

The WARC warcinfo header has three custom fields:

  1. WARC-Number-of-Documents: The number of ClueWeb12 documents contained in the file;

  2. WARC-File-Length: The length in bytes of the uncompressed file; and

  3. WARC-Data-Type: A short description of the type of documents contained in the file (see below under Data Type Identifiers).

The WARC response header has one custom field:

  1. WARC-TREC-ID: A unique identifier that describes the location of the individual record within the ClueWeb12 Dataset. The WARC-TREC-ID value is in the format clueweb12-<directory>-<file>-<record>. <directory> and <file> areas described above (Naming Conventions). <record> is a 5-digit number beginning with "00000" that corresponds to the record's sequence within the file. (This is the sequence number in the full ClueWeb12 dataset file before duplicate records were removed.)

Below is a file taken from the Clueweb12 dataset. This file has 967 pages (WARC response records).

 

Data Type Identifiers

The ClueWeb12 dataset contains data from three distinct sources. The 2-letter data type identifiers indicate the source of the data, as described below.

 

Checksum Files

Files with the name "ClueWeb_*.md5" are the md5 sums of the individual WARC files in the dataset. These MD5 sums are in the format:


<md5 checksum hash> <file>

with multiple lines in the file - one line for each file in the dataset. For example, the following line:


f8ea2571adfdb792e83c6dfde0d5179b ./ClueWeb12_01/0100tw/0100tw-21.warc.gz

denotes the md5 checksum for the file 0100tw-22.warc.gz under the 0100tw directory in the ClueWeb12_01 directory.

 

Record Count Files

Files with the name "ClueWeb12_*_counts.txt" are the record counts of each WARC file in the dataset. The record count files are in the format:


<file> <# of records>

with multiple lines in the file - one line for each file in the dataset. For example, the following line:


./0200wb/0200wb-24.warc.gz           33311

denotes that the file 0200wb-24.warc.gz under the 0200wb directory has 33,311 individual page records in it.

 

Language Encodings

All content is encoded in the encoding given by the web server that supplied the web page. When available, the content encoding appears in the individual HTTP header information for the record in the key/value pair "Content-Type".

 

Record Counts

Segment# Records
ClueWeb12_0045,278,522
ClueWeb12_0144,389,316
ClueWeb12_0244,069,951
ClueWeb12_0342,491,359
ClueWeb12_0436,026,724
ClueWeb12_0521,720,416
ClueWeb12_0623,101,855
ClueWeb12_0730,503,029
ClueWeb12_0839,712,288
ClueWeb12_0938,540,335
ClueWeb12_1039,802,260
ClueWeb12_1140,754,618
ClueWeb12_1238,606,284
ClueWeb12_1331,329,242
ClueWeb12_1432,95,0768
ClueWeb12_1537,716,513
ClueWeb12_1634,996,028
ClueWeb12_1734,051,249
ClueWeb12_1840,074,978
ClueWeb12_1936,903,637
Total733,019,372

 

Summary Statistics

Size compressed:5.54 TB
Size uncompressed:27.3 TB
Number of WARC files:33,447
Number of documents:733,019,372

 

Other Data

Disk1 includes three other datasets that may be useful to researchers working with the ClueWeb12 dataset. This data is not considered part of the ClueWeb12 dataset, and does not contain ClueWeb12 document ids. It is provided "as is" as a convenience to researchers who may need copies of these datasets that are contemporaneous with the ClueWeb12 crawl.

Wikipedia English Language Dump

The Wikipedia English Language Dump is an XML version of the English Wikipedia that was provided by the Wikmedia Foundation, Inc. It was obtained from http://dumps.wikimedia.org/enwiki/. The information below was provided by the Wikimedia Foundation.

enwiki-20120502-pages-articles.xml.bz2 - (7.9 GB) contains recombine articles, templates, media/file descriptions, and primary meta-pages from 2012-05-03 08:51:39.

WikiTravel English Language Dump

The WikiTravel English Language Dump is an XML version of the English WikiTravel that was provided by Internet Brands, Inc. It was obtained from wikitravel.org from 2012-03-06 to 2012-04-05 using dumpgenerator.py. dumpgenerator.py is a script to generate backups of MediaWiki wikis. It was acquired from http://archiveteam.org/index.php?title=WikiTeam. The information below was provided by Internet Brands, Inc.

There are two files:
  1. wikitravelorg_en_main_page-20120306-titles.txt.bz2 - A list of page WikiTrave English titles generated by software.
  2. wikitravelorg_en_main_page-20120306-current.xml.bz2 - A compressed xml file of the dumped WikiTravel xml documents. There are 78543 pages within the xml file.

Freebase RDF Dump

The Freebase Dump is an RDF version of Freebase that was provided by Google, Inc. It was obtained from Freebase Data Dumps (http://wiki.freebase.com/wiki/Data_dumps). The information below was provided by Google, Inc.

freebase-rdf-2012-12-09-00-00.gz - (8.4 GB) contains a full data dump of every fact and assertion in Freebase from 2012-12-09.