Leveraging R – The Good, The Bad, and The Better Way by Dr. Steve Poulin
Those who are working with Predictive Analytics are always trying to find a better more effective way. We have chosen a field that that can get better and better every day. And in the last few years, with the development of new technologies and approaches to acquiring data – we are finding the spotlight on us to do just that: find better ways. Traditionally, there are three primary ways to develop models and algorithms for predictive analytics: (1) the expensive SAS solution, (2) the cheaper but just as effective IBM SPSS, or (3) the open source “R”. We, at CompassRed, think there is a fourth: Leverage the best of all.
What is R?
R is a relatively new open source software program that is similar to the long-standing proprietary statistical software programs of IBM SPSS and SAS, both of which have been available for nearly 50 years. R includes a set of basic data preparation functions, statistical functions such as linear regression and ANOVA, and the ability to produce graphs. In addition to its standard features, a large number of “packages” developed by volunteers are available from the Comprehensive R Archive Network (https://cran.r-project.org). As is the case with all open source software, anyone can develop and edit R routines.
One of R’s most important advantages is that it is available free of cost. Anyone can download the software and its packages from CRAN. There are also no license restrictions on the sale of applications built with R code.
Another important advantage is that contributors are constantly uploading new routines, which provides access to many of the newest innovations in statistical analysis. For the proprietary statistical software programs, new routines are only available in the next release of the software, which may not occur for a year or more. Think of it as the new “Tesla” of statistical software: Why wait for next year’s model when you can constantly improve?
Proprietary statistical software companies also generally do not allow users to communicate directly with their development teams; in contrast, the names of R developers are published with each R routine. This ability to correspond directly with developers makes it possible to notify them of bugs, and makes it easier for users to modify R routines in order to fix bugs themselves and to add enhancements.
One of the major disadvantages of R is the time required to develop routines. Unlike IBM SPSS, a sophisticated graphical user interface (GUI) has not been developed for the program, although there has been some progress on this front made by Microsoft’s Revolution Analytics, which was recently purchased by Microsoft and renamed Microsoft R (Norman Nie, the developer of SPSS, was at one time the CEO of Revolution Analytics). Without a GUI, R is run from line commands.
Users of R must accept the principle of caveat emptor (let the buyer beware), because no organization guarantees the integrity of R routines. The R Foundation promotes the use of R and facilitates communication among R developers and users, but it does not evaluate the quality of R code. Much like Wikipedia, the user assumes that any flaws in the R program will be continuously detected and fixed by users. Use of R requires faith in the power of crowdsourcing.
The openness of R code means that there is some risk of viruses within the software. R routines should always be downloaded through a secure connection with an URL that begins with https, not http. The R Foundation makes the following statement in regard to the security of R code:
“CRAN does some checks on these binaries for viruses, but cannot give guarantees. Use the normal precautions with downloaded executables.”
The original R code processes data in-memory (i.e. within a computer’s RAM), which limits its ability to work with large datasets. Although developers are creating solutions that enable R to store data on a hard drive, users should ensure that their version of R has this capability before using it with large datasets.
The Better Way
To overcome some of the disadvantages of R, one could consider running R within a proprietary software program. For instance, IBM SPSS makes it easy to add R routines to its GUI, so one could use IBM SPSS as a platform for running R, and R as an extension of IBM SPSS’s capabilities. It is possible to do the same from SAS – but it is not as straight forward.
Whether R is run on its own or within proprietary software, it has become an integral part of statistical analysis. There are advances coming out every day that make all the toolsets more and more valuable and we are likely to see a resurgence in Predictive Intelligence as a result of these changes. Artificial Intelligence (AI) and machine learning are all enablers of this resurgence. Advances in Predictive Intelligence are predicated on the success of all these tools.