logo or R for Healthcare DataScience

Introduction

Data science is the art of managing the process that can transform hypothesis and data into actionable predictions. This actionable predictions include predicting who will win an election, possible length of hospital stay of patients, who is likely to suffer from a complication of a procedure or disease, or what will the population of a place be in the next few years. It is an interdisciplinary field of statistics, computer science, information technology, mathematics and other related fields (in social science, core science and applied science).

Many statistical and graphing softwares exists to fulfill this purpose. Some have been in existence for about 50 years (e.g. SPSS). The more popular softwares include Microsoft Excel, IBM SPSS, CDC Epi-Info, SAS, Stata, minitab, Python, MATLAB and R. It is generally believed that a good data science software should be able to:

  • allow for easy data import
  • easy data manipulation and transformation
  • generate great and aesthetically pleasing graphs
  • perform powerful data exploration and analysis
  • support multiple data modelling and prediction methods
  • allow for easy data communication between data scientists and the general audience.

Why use R?

R is a programming language for statistical computing developed by Ross Ihaka and Robert Gentlemen about 25 years ago from the implementation of the S language. It is a more recent data science software compared to most of the other popular softwares.

It meets all the criteria listed above and much more. Among the many reasons that has made it gained its popularity despite its relatively short existence include:

Free and Open-source

It is completely free and open-source!!! Many data scientists/analysts (especially in low and middle-income countries) will reckon with this statement. Imagine having to purchase a new version of a statistical software everytime it is realised, some of which cost a couple of hundred dollars or having to subscribe annually for using the software. It also runs on a wide array of platforms including Windows, Unix, Mac Os X and Ubuntu.

By free, it means there is no cost implication incurred from using the software (apart from the cost of data for downloading the software of course), no subscription, no additional cost whatsoever!

Its open source nature allows users access to the source code of the programme, understand it and possibly modify it to suit their purpose.

Graphic Capability

It has state of the art graphic capabilities allowing for static, dynamic and interactive data visualization. This is made possible by its integration with other programming languages including Javascript, leaflet and a whole lot of others for generating such graphs. This site will be capitalising on this to display some powerful and mind-blowing graphics made possible with R.

Data Exploration, Inferencing and Modelling

R’s ability for data exploration, inferencing and modelling is beyond this world. Being free and open-source, it is one of the favorite for many statistician for building their statistical packages for data analysis. This makes most statistical techniques available in R much earlier than in most of the other data science softwares.

Collaboration

Its nature allows for easy access by different data scientists. This means everyone can conduct their work using the same software. Also, it can seamlessly integrate with other data science software, easily importing and exporting data for use by the more popular softwares. In essence, one can start a data science project with any other statistical software, save the data and continue with R and vice versa.

IBM SPSS leveraged on its collaborative power by integrating its version 21 with it. Microsoft is also leveraging on this by incorporating R into its products (including Microsoft R Open, Microsoft R client, and Microsoft R server).

Reproducible Research

At the core of data science (and indeed research) is “reproducibility”. In an ideal setting, the goal is “replicability”. However, due to time and financial constraints, most researches are not replicable. Even if time and financial constraints are not present, the research settings are almost always going to be different making replicability almost impossible or impractical. The next big thing is “reproducibility”. R completely supports reproducible research allowing for version control and a reproducible programming environment.

Communication of Results

It can easily generate reports in various formats including in Microsoft word, pdf, html and PowerPoint among others.This allows the R user to write only a single script and convert it to various output format as necessary.

Conclusion

The choice of R stems from its powerful data science ability as well as its seamless integration with a whole lot of other computer packages. It is of interesting note that the WHO Trends in Maternal Mortality: 1990 to 2015 report was analysed with R. The abstract is available here and the full paper here

The following articles will make interesting reads and will further drive home the choice of R:

  1. The popularity of data science software

  2. Quanititative analysis guide: which statistical software to use?

  3. UK government using R to modernize the reporting of official statistics

  4. Comparison of statistical packages