Skip to content

In the pipeline

Over the years, the toolserver, mostly organised and financed by Wikimedia Deutschland, has become an invaluable asset to many Wikimedia projects. However, the open access policy of the toolserver, which enabled many volunteers to write helpful tools, comes at a price. Most tools are “stand-alone”; they perform a single function, and the results are usually a “dead end”, in the sense that they cannot be used by another tool. That is despite the fact that many tools output, in essence, page lists, with or without additional metadata.

The UNIX philosophy states that programs should be able to perform only a single function, but do that very well; multiple programs can then be “chained” into a pipeline, where the output of one program becomes the input of the next one. The toolserver has single-purpose-tools aplenty, but so far lacks the ability to pipeline them.

Here, I present my pipeline editor, an attempt to bring the pipeline part to the toolserver. It uses my tools to manage assets, especially list of (annotated) pages on Wikimedia projects, to chain tools together. The individual tools are run asynchronously on the toolserver, in the “background”; the status of the currently running tool is updated every few seconds. A full pipeline consists of three parts:

  1. An initial step to generate input data. Currently, one can either paste a list of pages as text, or run my CatScan rewrite to generate such a list from category trees. You can also use an existing asset ID, which allows for computationally “cheap” experimentation on an existing dataset.
  2. Any number of intermediate steps. Currently, I only offer a filter based on links and page sizes, but there could be any number of filters or “annotators”: Filter for pages without images? Or those that have last been edited by a human over a year ago? Or “annotate” the pages with potential free images from Commons or Flickr?
  3. The end or viewing stage. At the moment, this only shows a link to view the resulting asset (annotated page list), but there could be other viewing interfaces, or even Wiki(m|p)edia-editing bots.

While the current number of tools for the pipeline is rather sparse, basically any tool on the toolserver that processes or outputs lists of Wiki pages can be integrated, either by altering the tool itself to work with the asset system, or by using a bespoke “wrapper” tool. Also, saving and re-using of pipelines is something I plan to add soon. I am looking forward to your feedback and tool suggestions!