How Can We Help?
You are here:
< Back
Shortcuts:
WP:DD
WP:DUMP

Wikipedia offers free copies of all available content to interested users. These databases can be used for mirroring, personal use, informal backups, offline use or database queries (such as for Wikipedia:Maintenance). All text content is multi-licensed under the Creative Commons Attribution-ShareAlike 3.0 License (CC-BY-SA) and the GNU Free Documentation License (GFDL). Images and other files are available under different terms, as detailed on their description pages. For our advice about complying with these licenses, see Wikipedia:Copyrights.

Where do I get...

English-language Wikipedia

  • Dumps from any Wikimedia Foundation project: http://download.wikimedia.org/
  • English Wikipedia dumps in SQL and XML: http://download.wikimedia.org/enwiki/
    • pages-articles.xml.bz2 - Current revisions only, no talk or user pages. (This is probably the one you want. WARNING: 5.6 GB compressed, up to 20 times that size uncompressed.)
    • pages-current.xml.bz2 - Current revisions only, all pages
    • pages-full.xml.bz2/7z - Current revisions, all pages (includes talk and user pages)
    • pages-meta-history.xml.bz2 - All revisions, all pages WARNING: expands to several Terabytes of text. Please only download this if you know you can cope with this quantity of data
    • pages-meta-history.xml.7z - All revisions, all pages WARNING: expands to several Terabytes of text. Please only download this if you know you can cope with this quantity of data
    • abstract.xml.gz - page abstracts
    • all_titles_in_ns0.gz - Article titles only
    • SQL files for the pages and links are also available
    • Caution: Some dumps may be incomplete - pay attention to such warnings (e.g. "Dump complete, 1 item failed") near the dump file.
  • To download a subset of the database in XML format, such as a specific category or a list of articles see: Special:Export, usage of which is described at Help:Export.
  • Wiki front-end software: Wikipedia:MediaWiki.
  • Database backend software: You want to download MySQL.
  • Image dumps: See below.

Other languages

In the http://download.wikimedia.org/ directory you will find the latest SQL dumps for the projects, not just English. For example, (others exist, just select the appropriate two letter language code and the appropriate project):

Some other directories (e.g. simple, nostalgia) exist, with the same structure.

Latest complete dump of english wikipedia

As of 12 March 2010 (2010 -03-12), the latest complete dump of the English-language Wikipedia can be found at http://download.wikimedia.org/enwiki/20100130/ This is the first complete dump of the English-language Wikipedia to have been created since 2008.

The .bz2 file enwiki-20100130-pages-meta-history.xml.bz2 linked from that page contains the complete text of all publicly-viewable revisions found in the current database. It has an unofficial md5sum of "65677bc275442c7579857cc26b355ded" (Tomasz Finc. "Posting to Wikitech-l: 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D". http://lists.wikimedia.org/pipermail/wikitech-l/2010-March/047117.html. )

The .7z version of the same file is still being processed, and will be available soon.

Warning: the compressed file enwiki-20100130-pages-meta-history.xml.bz2 is over 280.3 GB in size, and decompresses to several Terabytes of text. Before consuming Wikipedia's bandwidth downloading this, ask yourself: do you really have enough hard disk space and computing resources to work on this file? Can't you use Wikipedia's API instead and work on a small random sample of the dataset?

Images and uploaded files

Currently Wikipedia does not allow or provide facilities to download all images. As of 17 May 2007 (2007 -05-17), Wikipedia disabled or neglected all viable bulk downloads of images including torrent trackers. Therefore, there is no way to download image dumps other than scraping Wikipedia pages up or using Wikix, which converts a database dump into a series of scripts to fetch the images.

Unlike most article text, images are not necessarily licensed under the GFDL & CC-BY-SA-3.0. They may be under one of many free licenses, in the public domain, believed to be fair use, or even copyright infringements (which should be deleted). In particular, use of fair use images outside the context of Wikipedia or similar works may be illegal. Images under most licenses require a credit, and possibly other attached copyright information. This information is included in image description pages, which are part of the text dumps available from download.wikimedia.org. In conclusion, download these images at your own risk (Legal)

Dealing with large files

You may run into problems downloading files of unusual size. Some older operating systems, file systems, and web clients have a hard limit of 2GB on file size. If you seem to be hitting this limit, try using wget version 1.10 or greater, cURL version 7.11.1-1 or greater, or a recent version of lynx (using -dump).

It is recommended that you check the MD5 sums (provided in a file in the download directory) to make sure your download was complete and accurate. You can check this by running the "md5sum" command on the files you downloaded. Given how large the files are, this may take some time to calculate. Due to the technical details of how files are stored, file sizes may be reported differently on different filesystems, and so are not necessarily reliable. Also, you may have experienced corruption during the download, though this is unlikely.

The file size limits for the various file systems are as follows:

  • FAT16 (MS-DOS version 6, Windows 3.1, and earlier) supports files up to 2GB.
  • FAT32/VFAT (Windows 95, 98, 98SE, and ME) supports files up to 4GB.
  • ext2 and ext3 filesystems can handle 16GB files and larger, depending on your block size. See http://www.suse.com/~aj/linux_lfs.html for more information.
  • Ext4 supports files up to 16TB.
  • HFS Plus (Mac OS X 10.2+) and XFS both support files up to 8 exabytes.
  • NTFS (Windows NT 3.51+, 2000, XP, Server 2003, Vista and Windows 7) supports up to 16 exabytes.

Many standard programming libraries and functions may also cause problems when accessing large files. For example, the standard C function, fopen, limits file sizes to 2GB on 32-bit systems. This is due to it using signed 32-bit integers, limiting file pointers to 2^31 bits (2GB).

Why not just retrieve data from wikipedia.org at runtime?

Suppose you are building a piece of software that at certain points displays information that came from wikipedia. If you want your program to display the information in a different way than can be seen in the live version, you'll probably need the wikicode that is used to enter it, instead of the finished HTML.

Also if you want to get all of the data, you'll probably want to transfer it in the most efficient way that's possible. The wikipedia.org servers need to do quite a bit of work to convert the wikicode into html. That's time consuming both for you and for the wikipedia.org servers, so simply spidering all pages is not the way to go.

To access any article in XML, one at a time, access Special:Export/Title of the article.

Read more about this at Special:Export.

Please be aware that live mirrors of Wikipedia that are dynamically loaded from the Wikimedia servers are prohibited. Please see Wikipedia:Mirrors and forks.

Please do not use a web crawler

Please do not use a web crawler to download large numbers of articles. Aggressive crawling of the server can cause a dramatic slow-down of Wikipedia. Our robots.txt blocks many ill-behaved bots.

Sample blocked crawler email

IP address nnn.nnn.nnn.nnn was retrieving up to 50 pages per second from wikipedia.org addresses. Robots.txt has a rate limit of one per second set using the Crawl-delay setting. Please respect that setting. If you must exceed it a little, do so only during the least busy times shown in our site load graphs at http://stats.wikimedia.org/EN/ChartsWikipediaZZ.htm. It's worth noting that to crawl the whole site at one hit per second will take several weeks. The originating IP is now blocked or will be shortly. Please contact us if you want it unblocked. Please don't try to circumvent it - we'll just block your whole IP range.
If you want information on how to get our content more efficiently, we offer a variety of methods, including weekly database dumps which you can load into MySQL and crawl locally at any rate you find convenient. Tools are also available which will do that for you as often as you like once you have the infrastructure in place. More details are available at http://en.wikipedia.org/wiki/Wikipedia:Database_download.
Instead of an email reply you may prefer to visit #mediawiki at irc.freenode.net to discuss your options with our team.

Note that the robots.txt currently has a commented out Crawl-delay:

 ## *at least* 1 second please. preferably more :D
 ## we're disabling this experimentally 11-09-2006
 #Crawl-delay: 1

Please be sure to use an intelligent non-zero delay regardless.

Doing SQL queries on the current database dump

You can do SQL queries on the current database dump (as a replacement for the disabled Special:Asksql page) . For more information about this service, see de:Benutzer:Filzstift/wikisign.org (in German only).

Dealing with compressed files

Approximate file sizes are given for the compressed dumps; uncompressed they'll be significantly larger.

Some older archives are compressed with gzip, which is compatible with PKZIP (the most common Windows format). Newer archives are available in both bzip2 and 7zip compressed formats.

Windows users may not have a bzip2 decompressor on hand; a command-line Windows version of bzip2 (from here) is available for free under a BSD license.

The LGPL'd GUI file archiver, 7-zip [1], is also able to open bz2 compressed files, and is available for free.

MacOS X ships with the command-line bzip2 tool.

Please note that older versions of bzip2 may not be able to handle files larger than 2GB, so make sure you have the latest version if you experience any problems.

Database schema

SQL schema

See also: mw:Manual:Database layout

The sql file used to initialize a MediaWiki database can be found here.

XML schema

The XML schema for each dump is defined at the top of the file.

Help parsing dumps for use in scripts

Help importing dumps into MySQL

See:

You can find extensive information on downloading and importing dumps at http://github.com/babilen/wp-tools/ [dead link]

Static HTML tree dumps for mirroring or CD distribution

MediaWiki 1.5 includes routines to dump a wiki to HTML, rendering the HTML with the same parser used on a live wiki. As the following page states, putting one of these dumps on the web unmodified will constitute a trademark violation. They are intended for private viewing in an intranet or desktop installation.


See also:

Dynamic HTML generation from a local XML database dump

Instead of converting a database dump file to many pieces of static HTML, one can also use a dynamic HTML generator. Browsing a wiki page is just like browsing a Wiki site, but the content is fetched and converted from a local dump file upon request from the browser.

WikiFilter

WikiFilter is a program which allows you to browse over 100 dump files without visiting a Wiki site.

WikiFilter system requirements

  • A recent Windows version (WinXP is fine; Win98 and WinME won't work because they don't have NTFS support)
  • A fair bit of hard drive space (To install you will need about 12 - 15 Gigabytes; afterwards you will only need about 10 Gigabytes)

How to set up WikiFilter

  1. Start downloading a Wikipedia database dump file such as an English Wikipedia dump. It is best to use a download manager such as GetRight so you can resume downloading the file even if your computer crashes or is shut down during the download.
  2. Download XAMPPLITE from [2] (you must get the 1.5.0 version for it to work). Make sure to pick the file whose filename ends with .exe
  3. Install/extract it to C:\XAMPPLITE.
  4. Download WikiFilter 2.3 from this site: https://sourceforge.net/projects/wikifilter. You will have a choice of files to download, so make sure that you pick the 2.3 version. Extract it to C:\WIKIFILTER.
  5. Copy the WikiFilter.so into your C:\XAMPPLITE\apache\modules folder.
  6. Edit your C:\xampplite\apache\conf\httpd.conf file, and add the following line:
    • LoadModule WikiFilter_module "C:/XAMPPLITE/apache/modules/WikiFilter.so"
  7. When your Wikipedia file has finished downloading, uncompress it into your C:\WIKIFILTER folder. (I used WinRAR http://www.rarlab.com/ demo version - BitZipper http://www.bitzipper.com/winrar.html works well too.)
  8. Run WikiFilter (WikiIndex.exe), and go to your C:\WIKIFILTER folder, and drag and drop the XML file into the window, click Load, then Start.
  9. After it finishes, exit the window, and go to your C:\XAMPPLITE folder. Run the setup_xampp.bat file to configure xampp.
  10. When you finish with that, run the Xampp-Control.exe file, and start apache.
  11. Browse to http://localhost/wiki and see if it works
    • If it doesn't work, see the forums.

WikiTaxi

WikiTaxi is an offline-reader for wikis in MediaWiki format. It enables users to search and browse popular wikis like Wikipedia, Wikiquote, or WikiNews, without being connected to the Internet. WikiTaxi works well with different languages like English, German, Turkish, and others.

WikiTaxi system requirements

  • Any Windows version starting from Windows 95 or later. Large File support (greater than 4 GB) for the huge wikis (English only at the time of this writing).
  • 16 MB RAM or less for the WikiTaxi reader, 128 MB recommended for the importer (more for speed).
  • Storage space for the Wiki database. At the moment, this grows to about 9.5 GB for the English Wikipedia, 2 GB for German, less for other Wikis. These figures are likely to grow in the future.

WikiTaxi usage

  1. Download WikiTaxi and extract to an empty folder. No installation is otherwise required.
  2. Download the XML database dump (*.xml.bz2) of your favorite wiki.
  3. Run WikiTaxi_Importer.exe to import the database dump into a WikiTaxi database. The importer takes care to uncompress the dump as it imports, so make sure to save your drive space and do not uncompress beforehand.
  4. When the import is finished, start up WikiTaxi.exe and open the generated database file. You can start searching, browsing, and reading immediately.
  5. After a successful import, the XML dump file is no longer needed and can be deleted to reclaim disk space.
  6. To update an offline Wiki for WikiTaxi, download and import a more recent database dump.

For WikiTaxi reading, only two files are required: WikiTaxi.exe and the .taxi database. Copy them to any storage device (memory stick or memory card) or burn them to a CD or DVD and take your Wikipedia with you wherever you go!

Offline wikipedia reader

(for Mac OS X, Linux, FreeBSD/OpenBSD/NetBSD, and other Unices)

Build a fast Wikipedia offline reader describes a step-by-step process than can be followed by UNIX users, to install a (fast) offline reader of Wikipedia (and any other mediawiki-based content). It depends on the pages-articles XML dump file, which is periodically published by wikimedia, and uses a set of open-source tools and technologies to quickly index the data and offer a local, offline, viewer.

Main features

  1. Very fast searching
  2. Keyword (actually, title words) based searching
  3. Search produces multiple possible articles: you can choose amongst them
  4. LaTEX based rendering for mathematical formulae
  5. Minimal space requirements: the original .bz2 file plus the index
  6. Very fast installation (a matter of hours) compared to loading the dump into MySQL

BzReader and MzReader (for Windows)

BzReader is an offline Wikipedia reader with fast search capabilities. It renders the Wiki text into HTML and doesn't need to decompress the database. Requires Microsoft .NET framework 2.0.

MzReader by Mun206 works with (though is not affiliated with) BzReader, and allows further rendering of wikicode into better HTML, including an interpretation of the monobook skin. It aims to make pages more readable. Requires Microsoft Visual Basic 6.0 Runtime, which is not supplied with the download. Also requires Inet Control and Internet Controls (Internet Explorer 6 ActiveX), which are packaged with the download.

Rsync2

This section is out of date.

You can use rsync to download the database. For example, this command will download the current English database:

rsync rsync://download.wikimedia.org/dumps/wikipedia/en/cur_table.sql.bz2 . --partial --progress

The "--partial" switch prevents rsync from deleting the file in the event the download is interrupted. You may then issue the very same command again to resume the download. The "--progress" switch will show the download progress; for less verbose output, do not use this switch.

The rsync utility is designed to synchronize files in a manner such that only the differences between the files are transferred. This provides a considerable performance enhancement, especially when synchronizing large files that have relatively few changes. However, if a file is compressed or encrypted, rsync will not perform well; in fact, it may perform worse than downloading a fresh copy of the file. Many of the database files are only available compressed. Therefore, there is little, if anything, to be gained by attempting to use rsync as a means of expediting an update of an older SQL dump. If the SQL dumps were available uncompressed, this process should work extremely well, especially if rsync is invoked with the on-the-fly compression switch (-z). It is uncertain as to whether uncompressed database dumps will become available. However, rsync does remain a useful and expedient tool for resuming downloads that have been interrupted, repairing downloads that have become corrupted, or updating any files that are not compressed (i.e. upload.tar). For more information, see rsync.

Technical notes

  • There is some discussion about a modified gzip that can improve rsync performance. This patch to gzip resets the output stream at fixed intervals. This results in fixed-size blocks of compressed data, which are friendlier to rsync. Bzip2 is designed from the start to create blocks of compressed data, and works well with rsync. Since bzip2’s compression ratios are almost always better than gzip’s, there is no reason to switch.
  • Technically speaking, upload.tar is compressed, in the sense that it mostly contains compressed files such as images (which is why it should not be compressed otherwise). However, usually the files themselves do not change. The addition, removal, or reordering of static files in an uncompressed tarball should still yield excellent rsync performance, regardless of the content of those files.

See also

Categories
Table of Contents