Skip to content

More topics

After my recent blog post about the TopicMatcher tool, I had quite a few conversations about the general area of “main topic”, especially relating to the plethora of scientific publications represented on Wikidata. Here’s a round-up of related things I did since:

As a first attempt, I queried all subspecies items from Wikidata, searched for scientific publications, and added them to TopicMatcher.

That worked reasonably well, but didn’t yield a lot of results, and they need to be human-confirmed. So I came at the problem the other way: Start with a scientific publication, try to find a taxon (species etc.) name, and them add the “main subject” match. Luckily, many such publications put taxon names in () in the title. Once I have the text in between, I can query P225 for an exact match (excluding cases where there are more than one!), and then add “main subject” directly to the paper item, without having to confirm it by a user. I am aware that this will cause a few wrong matches, but I imagine those are few and far between, can be easily corrected when found, and are dwarfed by the usefulness of having publications annotated this way.

There are millions of publications to check, so this is running on a cronjob, slowly going through all the scientific publications on Wikidata. I find quite a few topic in () that are not taxa, or have some issue with the taxon name; I am recording those, to run some analysis (and maybe other, advanced auto-matching) at a later date. So far, I see mostly disease names, which seem to be precise enough to match, in many cases.

Someone suggested to use Mix’n’match sets to find e.g. chemical substances in titles that way, but this requires both “common name” and ID to be present in the title for a sufficient degree of reliability, which is rarely the case. Some edits have been made for E numbers, though. I have since started a similar mechanism running directly off Wikidata (initial results).

Then, I discovered some special cases of publications that lend themselves to automated matching, especially obituaries, as they often contain name, birth, and death date of a person, which is precise enough to automatically set the “main subject” property. For cases where there is no match found, I add them to TopicMatcher, for manual resolution.

I have also added “instance of:erratum” to ~8,000 papers indicating this from the title. This might be better places in “genre”, but at least we have a better handle on those now.

Both errata and obituaries will run regularly, to update any new publications accordingly.

As always, I am happy to get more ideas to deal with this vast realm of publications versus topic.