Distributed R

1.2.0^[1] / 22 October 2015; 9 years ago

github.com/vertica/DistributedR

Written inC++, R

Distributed R
Developer(s)	HP

Stable release	1.2.0^[1] / 22 October 2015; 9 years ago (22 October 2015)

Repository	github.com/vertica/DistributedR
Written in	C++, R
Operating system	Linux
Type	machine learning algorithms
License	GNU General Public License
Website	www.distributedr.org

Distributed R is an open source, high-performance platform for the R language. It splits tasks between multiple processing nodes to reduce execution time and analyze large data sets. Distributed R enhances R by adding distributed data structures, parallelism primitives to run functions on distributed data, a task scheduler, and multiple data loaders.^[2] It is mostly used to implement distributed versions of machine learning tasks. Distributed R is written in C++ and R, and retains the familiar look and feel of R. As of February 2015^[update], Hewlett-Packard (HP) provides enterprise support for Distributed R with proprietary additions such as a fast data loader from the Vertica database.^[3]

Distributed R was begun in 2011 by Indrajit Roy, Shivaram Venkataraman, Alvin AuYoung, and Robert S. Schreiber as a research project at HP Labs.^[4] It was open sourced in 2014 under the GPLv2 license and is available at GitHub.

In February 2015, Distributed R reached its first stable version 1.0, along with enterprise support from HP.^[5]

Components

Distributed R is a platform to implement and execute distributed applications in R. The goal is to extend R for distributed computing, while retaining the simplicity and look-and-feel of R. Distributed R consists of the following components:

Distributed data structures: Distributed R extends R's common data structures such as array, data.frame, and list to store data across multiple nodes. The corresponding Distributed R data structures are darray, dframe, and dlist. Many of the common data structure operations in R, such as colSums, rowSums, nrow and others, are also available on distributed data structures.
Parallel loop: Programmers can use the parallel loop, called foreach, to manipulate distributed data structures and execute tasks in parallel. Programmers only specify the data structure and function to express applications, while the runtime schedules tasks and, if required, moves around data.
Distributed algorithms: Distributed versions of common machine learning and graph algorithms, such as clustering, classification, and regression.
Data loaders: Users can leverage Distributed R constructs to implement parallel connectors that load data from different sources. Distributed R already provides implementations to load data from files and databases to distributed data structures.

v t e R (programming language)
Features	Sweave
Implementations	Distributed R Microsoft R Open (Revolution R Open) Renjin
Packages	Bibliometrix easystats qdap lumi RGtk2 Rhea Rmetrics rnn RQDA Shiny SimpleITK Statcheck tidyverse ggplot2 dplyr knitr
Interfaces	Emacs Speaks Statistics Java GUI for R KH Coder Rattle GUI R Commander RExcel RKWard RStudio
People	Roger Bivand Jenny Bryan John Chambers Peter Dalgaard Dirk Eddelbuettel Robert Gentleman Ross Ihaka Friedrich Leisch Thomas Lumley Brian D. Ripley Julia Silge Luke Tierney Hadley Wickham Yihui Xie
Organisations	R Consortium Revolution Analytics R-Ladies Posit PBC (formerly RStudio PBC)
Publications	The R Journal

Components

Integration with databases

References

External links

Related Articles