Bloc-notes de Sylvain

Wikimedia metrics API

On 16th November 2015, the Analytics team of the Wikimedia Foundation officially announced the long awaited Pageview API. Till now, if you wanted to get figures about wikipedia articles views, you either had to download huge data files, or use http://stats.grok.se/ that is the external tool suggested by many wikipedia linguistics version to get pageviews statistics. http://www.wikifamo.us was using the monthly dump files but we were then :

  1. obliged to process big files each month. Wikifamo.us is a small tool, processing files with millions of data each month when you really need only thousands of them is not very efficient.
  2. limited to articles with a certain amount of visits each month (pages with less than 5 visits were excluded)

With the release of this new API, I decided to make the move and limit the amount of time required each month by switching to the API. The use case of wikifamo.us is one that is not directly handled by the API at the moment (but that might change). Our base ID is the Q number from http://www.wikidata.org which then allows us to get all the articles about a subject in various languages. At the moment, the API can be queried to get the daily views of one article at a time (one call for fr.wikipedia.org/Londres, one call for en.wikipedia.org/wiki/London ...). As most articles are available in decades of languages, it generates a lot of calls for each subject and can be complicated to handle by our small server (in particular in term of concurrency when a specific user wants to make a comparison if there is already a long queue on our server).

So the workflow that has been developed is delegating the API requests to the visitor browser. But as we want the user to be able to share the result of the comparison, we need to be able to easily cache the stats. So what we are doing is :

  • user create a comparison on wikifamo.us and, for each of the subjects involved in the comparison, the browsers requests from our server a list of 10 articles for which we need the number of views.
  • the browser is requesting the API for each article, when it has the data for the 10 first articles, it is sending the values for these 10 pages to server which save them in a local database before returning 10 other articles for which the statistics are still needed, etc...

After some iterations, everything is available in the database and can easily be reused later.

This new API is now live on the server, and has another main advantage, it is more precise than the method used before, and for articles that don't receive a lot of visits, that can be interesting to get stats even when there is less than 5 visits a month.