Tagged: authorship attribution

A Peek at the ORCID Registry Public Data File

Proper authorship attribution oforcid_128x128 intellectual works is an ongoing and challenging problem for bibliographic services and the academic community alike. Often, an author’s signature in a publication is nothing more than a name and an affiliation, which might not suffice for confirming their identity beyond question, and this is particularly true when the name is especially common. A potential approach to solve this problem is the creation of a central registry of unique person identifiers together with a way of linking the authors’ profiles to their scientific output. One of such initiatives is the Open Researcher and Contributor ID (ORCID) project, which was launched in October 2012 as an open, non-profit, community-driven effort.

Release of ORCID’s public data file

One year after launching the ORCID Registry, a first Public Data File containing a snapshot of the database has been released. The data can be downloaded from the ORCID site, and at The Data Science Lab we of course want to have a peek at it. The compressed file contains public information associated with each user’s ORCID record. This is an example of an ORCID profile. Each person’s record is included as a separate file in both json and XML. For this table-top exploratory experiment we will be using the entries in the json format. To avoid performing too many I/O operations, we first concatenate all json files together (this takes a little while, but it pays off later)

for f in /json/*.json; do (cat "${f}"; echo) >> orcid.json; done

Using python pandas to manage the ORCID dataset

We load the records into pandas simply by reading this file line by line and storing certain fields in a dictionary class created to this effect. For our exploratory analysis we are interested in the timestamp of the profile creation, the researcher’s name and the number of publications listed in their profile, in addition to the unique alphanumeric ORCID identifier. We create a pandas data frame easily from dictionary.

columns = ['orcid_id', 'family_name', 'first_name', 'num_pubs', 'date_created']
orcidDict = OrcidDict(columns)
with open('/path/to/orcid/orcid.json') as infile:
    for line in infile:
df = pd.DataFrame.from_dict(orcidDict.printItems())

When reading timestamps from the json data, we drop the time of the day and keep only year, month, day:

def format_json_timestamp(jsonTimestamp, myFormat):
history = data['orcid-profile']['orcid-history']
dateFormat = '%Y-%m-%d'
date_created = format_json_timestamp(history['submission-date']['value'], dateFormat)

The ORCID project has grown at a steady rate over the last year. The data contained in the public file amounts to 361,209 unique profiles (compare with their latest, live statistics). A simple grouping by profile creation month shows this evolution

df['month_created'] = df.date_created.map(lambda t: t[:-3])
gCrMon = df.drop_duplicates().groupby('month_created').count()[['month_created']]
gCrMon.columns = ['counts_month']
gCrMon.index = gCrMon.index.to_datetime()


If we group by creation day rather than month, we can investigate how the rate of profile creation evolved during the year.



In both plots, corresponding to mid October 2012 and mid September 2013 respectively, we observe the expected weekly periodic ups and downs, indicating less activity during the weekends. While ORCID was accreting new users at a rate of few hundreds per day a year ago, about 2,000 new profiles are getting registered daily now. Weekends are shadowed in the plot corresponding to October 2012, where a peak of unusually high activity on November 3rd 2012 can be appreciated. This might most likely have to do with the Altmetrics workshop and hackathon held in San Francisco around that date, which some of the ORCID promoters attended.


If the ORCID project is to succeed at a global scale, we should expect that researchers not only reserve their ID in the system, but also use it as a repository for their publications. It is thus important that not only the researcher’s name be assigned an ORCID identifier, but that the corresponding profile be completed with biographical and bibliographic information. The plot on the left shows the proportion of ORCID profiles that have uploaded a list of publications so far, which amounts to slightly under 20% of all registered researchers. This is despite the fact that the procedure to upload a list of bibliographic records from Scopus is tremendously straightforward. Moving forward, this is an area that the ORCID promoters might want to emphasize.

Update 09.12.13: As Amir Aryani correctly points out, the figure of 20% researchers with uploaded publications might be obscured by the ORCID privacy settings (it is possible to limit the visibility of parts of the profile, including publications). We can roughly estimate how many of the 80% profiles that don’t show any publications do indeed have some, but just private: by looking at the ORCID statistics at the project page we find that 24% of the current live ORCID IDs have at least one work. This supports our claim that the majority of researchers at ORCID have not yet uploaded the list of their bibliographic records.
End update 09.12.13

Gender distribution of the researchers registered in ORCID

A sociologically important question we may ask in our little exploratory experiment involves the gender distribution of the researchers registered in ORCID. Generally speaking, gender assignment in function of given names is not a trivial task, for many names are gender neutral, or culture specific. We can, however, attempt at getting an approximate result.

For this task, we resort to a python package that uses information from about 40,000 first names that have been manually curated and labeled by native speakers. Names that can not be given a definite gender are classified as unknown.

d = gender.Detector(unknown_value=u'unknown')
names = ['Jill', 'Martin', 'Chris', 'Leslie', 'Wei']
print('\n'.join(['\t'.join([name, d.get_gender(name)]) for name in names]))

Jill female
Martin male
Chris mostly_male
Leslie mostly_female
Wei unknown

The gender identifier is thus applied to the first names of all available ORCID records. Barely 53% of all first names in the ORCID registry are undoubtedly labeled male or female by this procedure. Of those, roughly 30% are female. The plot below shows the histogram of number of publications, broken down by gender, for those profiles that were either identified as male or female. Very interestingly, we observe that the percentage of female authors diminishes with increasing number of publications. Also noteworthy is the anomalous bump at 20 publications, which might correlate with some kind of technical artifact in the submission of the bibliographic data to the ORCID system. Other than that, the distribution shows exponential decay, corresponding to the fact that prolific authors (of both genders) are rare.


Our exploratory analysis of the public data from the ORCID service shows that the initiative is growing their user base, and that the growth is so far sustained. The relative small number of profiles with accompanying publications shows that the majority of users are not uploading the list of their publications to the system. This is somehow unfortunate, but not catastrophic, provided that a reliable mechanism is implemented to link the DOIs of those publications to the ORCID of their authors. Finally, a naive analysis of the gender of the given names behind the ORCID profiles provides a first look at the demographics of the researchers.

Table-top data experiment take-away message

Exploratory analysis is always a first step towards the understanding of any data science challenge. Exploration should serve to assess the quality and completeness of the data, and to inform further decisions in the process of devising more complicated data analysis techniques. At this stage, exploratory plots are very useful, for they can help formulate first hypotheses as well as clarify the task at hand.