How Can We Help?
You are here:
< Back
Plot of the percentage growth rate against predictions by Gompertz model, logistic model and extended growth model
Shortcut:
WP:GROWTH
Growth in total article text in English Wikipedia, measured in gigabytes (compressed)[1]

This page analyzes the article count data in Wikipedia:Size of Wikipedia and attempts to fit a simple numerical model of past and future growth to the observed article count size and growth data.

The current growth curve of the English Wikipedia seems to no longer be growing exponentially but seems to fit other models, such as a Gompertz function which predicts perhaps 4.4 million articles, a logistic function that can be projected to reach perhaps 3.5 million articles, or perhaps an extended-growth model, which predicts a much larger ultimate size.

Growth of the article count

The following graph shows the number of articles on the English Wikipedia from its creation in 2001 up to the present.

EnwikipediaArt.PNG

Here, several models are presented to attempt to explain the observed general trends in article growth.

Old exponential model for article count of Wikipedia

Note: Between 2003 and 2006/2007 this was the general model for article count of Wikipedia.

Graphs of the article count for the English Wikipedia, from January 10, 2001, to September 9, 2007, based on statistics from this page and Wikipedia:Announcements. The two graphs show both logarithmic and linear y-axes. The graphs also show the approximate rate of article increase per day, along with the projected number of articles based on annual doubling referenced to January 1, 2003.

EnglishWikipediaArticleCountGraphs.png

The growth in articles had been approximately 100% per year from 2003 through most of 2006, but has tailed off since roughly September 2006. The trend is no longer one of exponential growth, but has been closer to linear since that time.

Notes

A few notes on features of the graph:

  • The start of the project showed a slow rise, which slowly increased in speed with time.
  • The big slowdown in the rate of article creation in June–July 2002 was caused by major server performance problems, remedied by extensive work on the software.
  • The sudden jump in article count in October 2002 is due to roughly 30,000 stub articles on U.S. towns and cities generated from a database being added by an auto-posting robot, Rambot, during an eight-day period. Although initially controversial as to whether these were "real" encyclopedia articles or merely "stubs", most of the Rambot articles have since been substantially expanded.
  • Not counting the Rambot operation, the true maximum rate of article creation was in August 2006, when about 2400 net new articles were being added each day. From September 2006 through May 2007, the article count has increased by an average of about 1670 articles per day.
  • During the first half of May 2007, the article growth rate dropped below 1500 articles per day, the lowest rate since October 2005. The growth rate has since rebounded to about 2000 articles per day from late July through early September 2007.

Critique of the exponential model

Note: This was developed from 2004 to 2006.
WPlogsize.png

The exponential model of Wikipedia growth is based on the following:

  • more content leads to more traffic
  • which leads to more edits
  • which generate more content

Moreover, the average rate of growth is assumed to be proportional to the size of the Wikipedia, as a consequence of which, the growth would be exponential.

The graph of article count on the right is plotted on a logarithmic scale, so exponential growth should manifest itself as linear behavior of the data. Between October 2002 and July 2006, the data do fit very well along the dotted line shown, while from July 2006 onwards there is a noticeable fall off from linear behaviour. Before October 2002, the behaviour is more complex.

Number of articles reg 01.jpg

The graph on the right below is a close-up of the data points that follow a linear trend: the best-fit line in red was computed using linear regression. From the slope of this best-fit line, the proper time of the exponential growth can be found, giving:

N(t)=N(0)\ e^{t/\tau};\quad\tau\approx 500\ \mathrm{days}

The previous expression means that the number of articles doubled once every 346 days from October 2002 to October 2006, to a very good approximation. If Wikipedia had kept up with this trend, as shown on the graph, the number of articles by December 2006 would have been 1,900,000, by June 2007 2,800,000 and by December 2007 4,000,000, although there has been a slowdown of the growth and Wikipedia has apparently ceased growing exponentially.

Wikipedia growth and predictions from July 2006 to December 2008

The graph on the right is an exponential growth projection made in July 2006. The number of articles on the English Wikipedia up to July 2006 is shown in red, and this is extrapolated in blue using an exponential function (approximately 38000*exp(0.0017t) articles, where t is the number of days since January 1, 2001).

By the end of 2006, when there were 1.5 million articles, the projection was already overestimating the growth by 10-15%, and the prediction of over 3 million articles by the end of 2007 is significantly more than the actual figure of about 2.1 million articles.

It has been hypothesized that the growth rate of Wikipedia consists of a constant number of articles per day, submitted by "hard-core" wikipedians, with additional articles submitted by less enthusiastic wikipedians proportional to the current article count of Wikipedia. In this model the growth rate should be a linear function of the size of Wikipedia.

Questions:

  • is this model even remotely valid?
  • how long can exponential growth go on, or is this just really the early part of a logistic curve?
  • what does this imply for server and traffic scaling?

Eventually there will probably be a point where the amount of articles created each day will begin to slow down, due to a lack of things to write about. But it is probable that the amount of information in each article will begin to increase in lieu of an increase in the number of articles. Limitations on the (current) Wikipedia interface will cause a bottle neck of sorts, limiting the type (and by default, the amount) of growth to vertical monolingual growth patterns, as opposed to lateral cross-lingual ones.

Note that from the beginning of December 2005, only registered users can create new pages.

Logistic model for growth in article count of Wikipedia

Note: This was developed in 2007.
Number of articles on en.wikipedia.org and logistic extrapolations to a max of 3, 3.5 and 4 million articles
Article growth per month (6 months average, smoothened at Oct 2002). Extrapolation to a max of 3, 3.5 and 4 million articles
Percentage growth per month

If Wikipedia's growth follows the exponential growth model, the average rate of growth would be proportional to the size of the Wikipedia. The annual growth rate would stay constant, as would the average time the number of articles will double. As can been seen here and on the third graph this is not the case; the percentual growth is steadily declining.

Maybe Wikipedia's growth follows the logistic growth model better. This model is based on:

  • more content leads to more traffic, which in turn leads to more new content
  • however, more content also leads to less potential content, and hence less new content
  • the limit is the combined expertise of the possible participants.

Some characteristics of this model are:

  • there will be a maximum to the number of articles. On Wikipedia one can hardly imagine this as there will be new events and people to describe in the future. Compared to the large number of existing articles this is a very small effect though.
  • at the end the growth is zero.
  • at the pivot point (halfway the maximum) the growth is at its peak. For the en.wikipedia this might have been in August 2006 with 60,000 new articles a month.

This model is related to the quantity (number of articles). The quality might still increase independently.

A best fit of the logistic model to the statistical data available by the end of 2008 suggested that the growth limit to the number of articles, where on average the creation and deletion of articles are in balance, will be between 3 million and 3.5 million articles, with the 3 million point being reached around March 1, 2010.

However, by July 2009 it was clear that 3 million would be achieved by the end of August 2009 and the plateau would likely be close to 3.5 million articles.

Critique of the logistic model

  • The model seems to imply that the number of articles that the Wikipedia will have is fixed.
  • The rate of creation in the Wikipedia is unlikely to eventually fall to zero as new articles on new topics that arise due to new events and discoveries will still be required. As of June 2010, about half of the new articles created that weren't immediately deleted seem to be articles that couldn't have been created before 2001 when the Wikipedia started. This would seem to imply that 'logistic growth plus linear' might be a better model in the long run.


Quadratic model for article count of Wikipedia

Note: At the end of 2008, WP:Size of Wikipedia#Annual growth rate used a simple model with a reducing rate of new articles to predict when growth would come to an end.
Date     Article Count       Increase during  
Preceding Year
  % Increase during  
Preceding Year
  Average Increase
per Day during Preceding Year
2009-01-01 2,679,000 526,000 24% 1437
 2018-01-01^ ~4,759,000  ~0  0% ~0
NOTE: January 2018 is projected from 2009/ 2008/2007 (adding 60,000 fewer articles each year). Final article count plateau is: 2,679k + 470+410+350+290+230+170+110+50k = ~4,759,000 articles (deleted/merged articles will balance the number of added articles). Assumes same attitudes about notability, merging & lists.

Extended-growth model

Past & projected monthly growth rate in articles per month.

In 2009, the continued strong growth indicated there was no obvious nearby mid-point in the growth for new articles. Although growth was slowing, it was slowing more gradually, and could be expected to continue beyond another 15 years, creating up to 10 million articles. The predicted date for the 3-million-article mark would be much earlier, in mid-August 2009. The growth was supported by the need for various spin-off articles, such as unseen-hand and lost-world articles, millions of missing red-link articles, plus many thousands of new disambiguation pages needed to connect the other millions of pages. The new projected mid-point might occur in year 2011, although any massive auto-upload of numerous articles could change the schedule, such as a mass, automated effort to auto-generate red-link stubs with sources suggested from search-engine results. The continued strong growth fits the model reaching about 10 million articles, before deletions and merges would offset the increase of new articles being added.


Two-phase exponential model

The growth rate N'(t) of Wikipedia (number of new articles per unit of time) can be accurately modeled by two exponentials, one increasing ("phase 1") and one decreasing ("phase 2"), with a fairly sharp crossover around January 2006. In the following plots, the dots are the observed counts N(t) (cleaned and resampled at equal 28-day "months") and the respective increments N'(t) (new articles per 28-day month). The solid lines are the values of N'(t) and N(t) computed by the model.

Wp-size-irr-2009-11-prd-p0-s0-dz-e0-y0.png
Growth rate N'(t) - linear scale
Wp-size-irr-2009-11-prd-p0-s0-dz-e0-y1.png
Growth rate N'(t) - log scale
Wp-size-irr-2009-11-prd-p0-s0-sz-e0-y0.png
Article count N(t) - linear scale
Wp-size-irr-2009-11-prd-p0-s0-sz-e0-y1.png
Article count N(t) - log scale

Seasonal modulation since 2006

Since 2006, there is also a strong semestral variation in the new article rate, with peaks in February and August. The following plots include this modulating factor:

Wp-size-irr-2009-11-prd-p0-s1-dz-e0-y0.png
Growth rate N'(t) - linear scale
Wp-size-irr-2009-11-prd-p0-s1-dz-e0-y1.png
Growth rate N'(t) - log scale
Wp-size-irr-2009-11-prd-p0-s1-sz-e0-y0.png
Article count N(t) - linear scale
Wp-size-irr-2009-11-prd-p0-s1-sz-e0-y1.png
Article count N(t) - log scale

Implications

Some implications of this model:

  • The slowdown is not a "natural" phenomenon but rather the consequence of some change in Wikipedia policy and/or tools.
  • The "fertility" of Wikipedia's corps of editors (their output of new articles) is shrinking.
  • Wikipedia will stop growing before reaching 6 million articles.

Further info

Here is the text file with the data used to generate these plots. The first column is the time t, specifiaclly elapsed days since January 1, 2001. Columns 2,3,4 are year,month,day. Column 5 is the observed article count N(t) on that date (cleaned and resampled). Column 7 is the value of N(t) predicted by the model. Columns 9 and 11 are the observed and predicted growth rates N'(t) in articles per "lunar" month (28 days). There is also a technical report describing the model and the data set.

Gompertz model (2010–)

This model is based on the Gompertz function. The Gompertz function is like a logistic function, but the future value asymptote of the function is approached much more gradually, in contrast to the logistic function in which both asymptotes are approached by the curve symmetrically.

The reasons for this new model are

  • The growth rate function does not seem to be time-symmetrical, unlike the logistic function
  • The percentage of article growth per month in the logarithmic graphs seem to be linear ( (1) and (2) ), as the Gompertz function

The formula for the Gompertz function for the en.wikipedia is y(t)=ae^{be^{ct}}, with

a= 4378449 (the predicted maximum for about 4.4 million articles)
b= -15.42677
c= -0.384124
t is the time in years since 1/1/2000 (so 1/1/2010 is t=10.00)

The expected maximum of the Gompertz model is between the logistic model and the Modelling Wikipedia extended growth.

See below 3 Gompertz model graphs, followed by 3 corresponding graphs of the Logistic model, a graph for a general comparison between the Logistic, Gompertz and the Extended Growth models, and a graph of the top 20 wikipedia's which in general show the same behavior in Percentage of article growth.

EnwikipediagrowthGom.PNG EnwikipediaGom.PNG EnwikipediapercgrowthGom.PNG
Number of article growth on en.wikipedia.org
and Gompertz extrapolation
Number of articles on en.wikipedia.org
and Gompertz extrapolation
Percentage of article growth per month
on en.wikipedia.org and Gompertz extrapolation
Enwikipediagrowth6.PNG Enwikipedialin.PNG Enwikipediapercgrowth.PNG
Same graphs for logistic model with extrapolation to 3, 3.5 and 4 million articles
Enwikipediagrowthcomparison.PNG WikigrowthTopPerc.PNG
Comparison of number of articles growth on en.wikipedia.org
and Logistic, Gompertz and Extended Growth extrapolations
Percentage of article growth per month of the Top Wikipedias

Data set for number of articles

As Erik Zachte's statistics for the English language wikipedia is not updated since October 2006, these are the figures I (HenkvD) use for generating the graphs. The data up to October 2006 was taken from one of Erik's Downloads. The data since I took manually each month at the date (or a day later) using the Special:Statistics page. See also Wikipedia:Size_of_Wikipedia#The_data_set for the official count, but at irregular intervals.

Other measurements of article growth

Edits per article

The following graph shows the mean number of edits per article, and is intended as a measure of the quality of the articles, assuming that editing improves the content.

Number of edits 01.jpg

The graph is plotted in logarithmic scale, and this data also fits well with exponential growth starting from October 2002. The number of edits per article has since doubled once every 505 days.

Relationship of Usenet cites to article growth

The relationship of Usenet cites of the word "Wikipedia" to the official article count for the English language Wikipedia appears to show a curve, rather than a linear relationship. (See Wikipedia:Awareness statistics for data). Or does it show a line broken into two parts, one before and one (horizontally shifted) after the Rambot-created articles? If so, this would suggest that the Rambot articles do not stimulate significant comment on Usenet, but that the linear relationship does in fact hold for all other articles. As ever, more data are needed.

Usenet cites vs article count dec 2003.png

Modelling growth of Wikipedia page views per million

Using the Alexa page views per million data from Wikipedia:Awareness statistics (see [1] for a graph) in the period 1 January 2003 to 5 September 2005, filtering out all points less than 28 days away from the previous point (to avoid excessive weighting during time periods where points are densely sampled), and performing a linear least-squares fit of the logarithm of the data, gives the following approximate formula:

log_e(page_views_per_million) = -50 + 5e-08 * unix_epoch_of_date

for n = 21 points fitted

This implies a doubling period of (log_e(2) / 5e-08) / 86400 days, which is approximately 160 days, and an annual growth factor in page views per million of appoximately exp(5e-08*365*86400), which is approximately 5.

Playing around with different time periods and filter times, we get a range of results from which can reasonably say that Wikipedia's estimated page views per million doubling time is somewhere in the range 130 - 160 days, with the recent (2005) doubling time of 156 days or so being within the range of the longest-term doubling time of about 155 - 159 days, with the 2004 period being the exception to the long-term and short-term trends.

Modelling improvement in Wikipedia's Alexa traffic rank

Applying a similar linear regression fit to the log of Wikipedia's Alexa traffic rank from October 2002 to September 2005 gives a similar result, with a halving period (lower is better for rank) of roughly 134 - 138 days over the long term, with a more recent (2005 data only) halving time of 114 days! Since the current page rank as of September 2005, is roughly 40, this suggests, if taken to logical extremes, and using the most cautious of the three figures, and rounding it to 4.5 months, that Wikipedia will reach:

  • page rank 20 in 4.5 months
  • page rank 10 in 9 months
  • page rank 5 in 13.5 months
  • be fighting its way into the top 3 in 18 months, and
  • be fighting its way to the #1 spot in 22.5 months...

So, clearly this exponential growth has got to stop or slow down, or it's going to be a wild ride...

November 2005 — the daily page rank is averaging 34 and reached 31 in October.

January 2006 — the daily page rank has been averaging 20 for about a week; in line with the original predictions above.

April 2006 — averaging 16/17 this month, although in March it reached as high as rank 12, the current record.

July 2006 — deviating from predictions; Wikipedia was supposed to have reached rank 10 by now, yet for the whole of June we hovered between 16/18.

September 2006 — Heavily deviating from predictions; by the end of October, Wikipedia was supposed to reach rank 5, yet still only making small gains, hovering between 14/16 now. The climb up the rankings has slowed down - but for now we are still climbing! Wikipedia has broken the "50,000 reach" barrier, meaning we reach as many people as youtube.com and even more than myspace.com!

November 2006 — Alexa weekly rank is now 12, and is still climbing, with occasional daily blips up to 11. Wikipedia once made the daily rank in the top 10 on 12th!

February 2007 — 18 months after the predictions, I think it's safe to say the model is flawed. We should be ranked as 3rd, but the current high is 8, with the average being 10/11. We're still getting gaining popularity, just not as fast as expected.

May 2008 — Swaying between 7 and 8 for the past few months with 8 being slightly more common. The rise, though slow, continues.

December 2008 — The traffic rank continues to be around 8. No clear trend is evident in the rank, but the number of daily pageviews displays a steady decline since June 2008.

March 2009 — The traffic rank is consistently 7 for more than 6 weeks now, and has not been below 8 for three months. The half-year graph suggests a transition period from October to February for the move from rank 8 to 7. Pageviews have slightly recovered, again reaching July 2008 levels, though still far from those of June 2008.

June 2009 — Fairly consistently 7, with only intermittent falls to 8. Pageviews are fairly steady at around 0.5% of global, with a very slight upward trend evident.

September 2009 - Spending more time at 6, with intermittent returns to 7. Pageviews are about 0.55-0.6% of global with an upward trend still evident.

November 2009 - Mostly at 6, with occasional returns to 7. Pageviews are level at about 0.53-0.6% of global.

April 2011 - currently at 8. However, ComScore results as of January 2010 put all Wikimedia properties collectively at 5: see http://meta.wikimedia.org/wiki/User:Stu/comScore_data_on_Wikimedia

Growth of Wikipedia network

In the context of complex networks theory there is a number of efforts to model the growth of Wikipedia network in which the nodes represent the articles and links are the hyper links between articles.[2][3] This type of models are based on simple local probabilistic rules which should reproduce different distributions of Wikipedias statistical variables. Analysis show that the distribution of the number of hyper links pointing to a given article have a very stable power law exponent for a number of Wikipedias in different languages. It was also confirmed that the reciprocity - ratio between the number of hyper links connecting two articles in both directions to the total number of hyper links is a very stable across the number of different Wikipedias.

See also

References

  1. ^ Data from en:Wikipedia:Database download
  2. ^ Zlatić, Vinko; Štefančić, Hrvoje (2009), Model of Wikipedia growth based on information exchange via reciprocal arcs, arXiv:0902.3548 
  3. ^ Capocci, A.; Servedio, V. D.; Colaiori, F.; Buriol, L. S.; Donato, D.; Leonardi, S.; Caldarelli, G. (2006), "Preferential attachment in the growth of social networks: the internet encyclopedia Wikipedia", Physical Review E 74 (3): 036116, doi:10.1103/PhysRevE.74.036116 

External links

Personal tools
  • Log in / create account
Variants
Actions
Navigation
Toolbox
Print/export
Languages
Categories
Table of Contents