symac

OpenRefine - detect language

In order to load references into HAL we needed to detect the language of thousands of references. We decided to automate that process by using a dedicated library. As our source original data were in a CSV file and because using tabular data made the control operation we decided to go with OpenRefine. Below are some notes on what we did to get an extra column containing the language of the reference based on the title. This detection is not 100% successfull but the general quality is quite satisfying.

There is an issue on Openrefine github to add a detect lang function to the core (Issue #642) that has been opened since 2012. When it is fixed the instructions below won't be need but for the time being (as of April 2020 and OR 3.3) this is the best solution I have found.

These instructions have been tested on Ubuntu 18.04.

First thing is to install Jython (instructions on OpenRefine wiki) by going to https://www.jython.org/download then running :

java -jar jython-installer-2.7.1.jar

Then we need to install the langdetect python library :

~/jython2.7.1/bin/pip install langdetect

From there, when running OpenRefine, we are going to add a column based on the title column, define it as a Jython result and input the following script :

import sys
sys.path.append('/home/username/jython2.7.1/Lib/site-packages')

from langdetect import detect

return detect(value)