Debian Conference 2025, Brest, Jul 14th - 19th

Debian in the Research Software Ecosystem:

An Exploratory Bibliometric Analysis

 
Joenio M. Costa and Christina von Flach
Institute of Computing, Federal University of Bahia

Joenio Marques da Costa

http://joenio.me

Institute of Computing, Federal University of Bahia (UFBA)

  • PhD candidate in Software Engineering
  • Work at Cortext Platform (www.cortext.net)
  • Debian Developer
  • Member of the SEED.br research group

Christina von Flach

https://christinaflach.github.io

Institute of Computing, Federal University of Bahia (UFBA)

  • Ph.D. in Computer Science (2004)
  • Professor at the UFBA Institute of Computing (1990-today)
  • Head of the SEED.br research group

Research Software is software that is designed and developed to support research activities.

Source: Morane Gruenpeter et al. 2021.
Defining Research Software: A controversial discussion.

Research Software Ecosystem

Source: James Howison and James D. Herbsleb. 2011.
Scientific software production: incentives and collaboration.

Bibliometric Analysis is a method for exploring and analyzing large volumes of scientific data.

Source: Naveen Donthu et al. 2021.
How to conduct a bibliometric analysis: An overview and guidelines.

Bibliometric Analysis data

Research Strategy

  • Search and collect data from Scopus
  • Filter data by inclusion and exclusion criterias
  • Explore the data with a bibliometric analysis

Data search and collection

Due to the exploratory nature of our work, we only collected data from one bibliographic research database, Scopus.

  • Results: 473 documents

Data filter and screening

Name Inclusion criteria Exclusion criteria
Language Is English Isn’t English
Year Beetween 1993-2024 Isn’t beetween 1993-2024
Author Isn’t empty Is empty
Title Match “debian” Doesn’t match “debian”
Abstract Match “debian” Doesn’t match “debian”

In this step 53 documents were removed in a semi-automated procedure.

  • Results: 420 documents

Paper excluded from our dataset:

Rodríguez R. et al. 1995.
La perspectiva profesional en la reforma de la atención primaria de salud: una aproximación cualitativa.

Abstract: “La mayor parte de los profesionales opinaban que los programas de enfermedades prevalentes debían venir elaborados de forma vertical…“

Paper metadata:
Type = Article, Language = English
Source = Gaceta Sanitaria

Paper included in our dataset:

Ahsan Ullah et al. 2024.
A Comparative Study on Vulnerabilities, Challenges, and Security Measures in Wireless Network Security.

Abstract: “Dataset, generated using Tpot within Debian operating system, serves as the cornerstone of this research, with…“

Paper metadata:
Type = Article, Language = English,
Source = Lecture Notes in Networks and Systems

Publication count per year

What is the annual number of publications?

What is oldest publication found?

Tommi Syrjänen. 2000.
Including Diagnostic Information in Configuration Models.

Abstract: “As an example, a subset of the configuration problem for the Debian GNU/Linux system is formalized using the new rule-based language.”

The first paper indexed by Scopus mentioning Debian was published only 7 years after the Debian release in 1993.

Paper metadata:
Type = Conference paper, Language = English,
Source = Lecture Notes in Artificial Intelligence

Which papers have been cited the most by other papers?

The top-cited papers are Research Software

Ardavan F. Oskooi. 2010.
Meep: A flexible free-software package for electromagnetic simulations by the FDTD method.
(cited 2,437 times)

Abstract: “Operating system: Any Unix-like system; developed under Debian GNU/Linux 5.0.2.”

Scopus metadata:
Type = Article, Language = English,
Source = Computer Physics Communications

The 1st top-cited paper is a Research Software

Role-based category: Modeling, Simulation, and Data Analytics

Ardavan F. Oskooi. 2010.
Meep: A flexible free-software package for electromagnetic simulations by the FDTD method.

Abstract: “This paper describes Meep, a popular free implementation of the finite-difference time-domain (FDTD) method for simulating electromagnetism. In particular, we focus on aspects of implementing a full-featured FDTD package that go beyond standard textbook descriptions of the algorithm, or ways in which Meep differs from typical FDTD implementations…
Operating system: Any Unix-like system; developed under Debian GNU/Linux 5.0.2…“

The top 10 most cited papers are about algorithm implementation and research software (tools).

# Title Year Cited by Type
01 Meep: A flexible free-software package for electromagnetic simulations by the FDTD method 2010 2437 Article
02 SNP-sites: rapid efficient extraction of SNPs from multi-FASTA alignments 2016 893 Article
03 Seamless R and C++ integration with Rcpp 2013 281 Book
04 Herding cats: Modelling, simulation, testing, and data mining for weak memory 2014 227 Article
05 Toward Large-Scale Vulnerability Discovery using Machine Learning 2016 216 Conference paper
06 ReDeBug: Finding unpatched code clones in entire OS distributions 2012 208 Conference paper
07 THERMINATOR: THERMal heavy-IoN generATOR 2006 190 Article
08 Flip feng Shui: Hammering a needle in the software stack 2016 189 Conference paper
09 A new monotonic, clone-independent, reversal symmetric, and condorcet-consistent single-winner election method 2011 174 Article
10 FreeContact: Fast and free software for protein contact prediction from residue co-evolution 2014 134 Article

Role-based Category = Modeling, Simulation, and Data Analytics

There are Research Software among the 10 oldest publications?

# Title Research software?
01 Including diagnostic information in configuration models (2000) Yes: Smodels lang
02 Demudi: The Debian Multimedia Distribution (2001) Yes: Debian-derivative
03 Statistically based postprocessing of phylogenetic analysis by clustering (2002) Yes: Matlab lang
04 Towards intelligent support for managing evolution of configurable software product families (2003) Yes: Product-line prototype
05 A Framework for Blood Flow Analysis and Research (2003) Yes: Ultrasound Doppler
06 Open source software development should strive for even greater code maintainability (2004) No
07 Managing volunteer activity in free software projects (2004) No
08 Can we trust cryptographic software? Cryptographic flaws in GNU privacy guard v1.2.3 (2004) No
09 Demonstration abstract: BNF converter (2004) Yes: BNF converter
10 Configurable coprocessing with an ARC-PCI board (2004) Yes: Hardware + Linux driver

(Yes = 7, No = 3)

Most active researchers

Who are the most active researchers, measured by number of published papers?

# Author N. of papers The oldest paper
1 Zacchiroli S. 13 papers The Ultimate Debian Database: Consolidating bazaar metadata for Quality Assurance and data mining (2010)
2 German D.M. 10 papers A Model to Understand the Building and Running Inter-Dependencies of Software (2007)
3 Di Cosmo R. 9 papers Predicting upgrade failures using dependency analysis (2011)
4 Robles G. 9 papers Evolution of volunteer participation in libre software projects: Evidence from debian (2005)

“Among the most active researchers, there are publications with studies about Debian or for Debian.”

Active countries

Which countries are contributing the most, based on the affiliations of the researchers?

(MCP: Multiple Country Publications, SCP: Single Country Publications)

Top-relevant terms and frequent words

What are the most relevant terms and concepts in the field?

Future work and next steps:

  1. Current study:
    • Add more databases besides Scopus: WOS, Pubmed, etc.
    • Finish the bibliometric analysis.
  2. New study:
    • Cross bibliometric data with upstream source code metrics.
    • Cross bibliometric data with Debian source package metadata.

Thanks!

joenio@joenio.me


This presentation is available at:

http://joenio.me/debconf2025-academictrack-talk

Export this presentation as PDF (require chromium browser)

(source-code: https://gitlab.com/joenio/joenio.gitlab.io)

License Creative Commons

Presentation history

Where and when this presentation was done