The source code of this document is available at https://gitlab.com/rgaiacs/sheffield-r-users-group-2018-07-03.

Why use Python?

Python is a programming language popular in many domains and with a rich ecosystem of libraries, e.g.

Why mix Python and R?

Because R also have a rich ecosystem of libraries, e.g.

And you don’t want to re-implement the wheels.

Why mix languages is hard?

There are two ways that we can mix languages: in the workflow and in execution time.

Workflow

Each language will read or write from files in our storage device. For example, we could have one Python-script that write a CSV file, such as

with open("data.csv", "w") as _file:
    _file.write("x,y\n1,1\n2,2\n3,2\n4,1\n")

and one R-script that read a CSV file and create some visualisation, such as

data <- read.csv("data.csv")
plot(data)

The problem with this approach is that input/output is very slow.

Execution Time

Each language will read or write data in memory that can be accessed by the other. For example, we could have one Fortran subroutine, such as

      subroutine sum(x, y, total)
              real :: x, y, total
              total = x + y
      end subroutine sum

and one C program that call that Fortran subroutine, such as

#include <stdlib.h>
#include <stdio.h>

// Function declaration.
// GFortran adds _ at the end of the name of the subroutine.
void sum_(float *x, float *y, float *total);

int main()
{
    float x;
    float y;
    float total;

    x = 1.0;
    y = 1.0;
    sum_(&x, &y, &total);  // Call to Fortran subroutine.

    printf("x     = %f\n", x);
    printf("y     = %f\n", y);
    printf("total = %f\n", total);

    return 0;
}

On the previous C code, note that the variables are being passed by reference to the Fortran code and not by value.

cd c-and-fortran && make -B all
gfortran -c sum.f
gcc main.c sum.o -o main
./main
x     = 1.000000
y     = 1.000000
total = 2.000000

Access to memory is faster than access to storage devices but manipulate memory shared with another language requires extra attention because different languages store information in different ways, e.g. row-major order versus column-major order for storing multidimensional arrays in linear storage.

Why bridge Python and R is hard?

Python and R are different in many elements and the fact that both are interpreted only make memory access harder.

For example, R is a lazy loading/lazy evaluation language but Python is not. What does this mean? Let see a few examples inspired by John D. Cook’s blog post.

f <- function(a) {
  a * log(a)
}
f(2)
## [1] 1.386294

can be implemented in Python as

import math
def f(a):
    return a * math.log(a)
print(f(2))  # The print function is need to include the output into the document.
## 1.3862943611198906

The same R function could be implemented as

f <- function(a, b=c) {
  c = log(a)
  a * b
}
f(2)
## [1] 1.386294

The default argument is a variable that does not exist until the body of the function executes! Python does not support constructions like that.

def f(a, b=c):
    c = math.log(a)  # We already imported math
    return a * b
## NameError: name 'c' is not defined
## 
## Detailed traceback: 
##   File "<string>", line 1, in <module>

Reticulate

The reticulate package includes the following features:

This means shared variables/state between Python chunks. - Printing of Python output

This also includes graphical output from matplotlib. - Access Python from R - Access R from Python

Access Python from R

Python objects can be accessed using the R py object.

x = 1
y = 1
py$x + py$y
## [1] 2

The R py object is a named list of R objects mapped to Python objects. If you change the value of py elements in R, the Python variable will change.

py$x <- 0
print(x)
## 0.0

Is also possible to create variables by adding new elements to py variable.

py$z <- 2
print(z)
## 2.0

Lists

fruits = ["apple", "banana", "melon"]
fruit_units = [53, None, 7]

For Python, fruits and fruit_units are list. But in R, fruits is a multi-element vector

py$fruits
## [1] "apple"  "banana" "melon"

and fruit_units is a list of multiple types

py$fruit_units
## [[1]]
## [1] 53
## 
## [[2]]
## NULL
## 
## [[3]]
## [1] 7

Dictionaries

fruits = {
    "apple": 53,
    "banana": None,
    "melon": 7,
}

For Python, fruits is a dict and, in R, fruits is a named list.

py$fruits
## $apple
## [1] 53
## 
## $banana
## NULL
## 
## $melon
## [1] 7

Data Frame

You can read tabular data using pandas.

import pandas
data = pandas.read_csv("data.csv")

reticulate will convert pandas’ DataFrame into R Data Frame.

py$data
##   x y
## 1 1 1
## 2 2 2
## 3 3 2
## 4 4 1

That can be visualise in R.

plot(py$data)

Access R from Python

R objects can be accessed using the Python r object.

x <- 1
y <- 1
print(r.x + r.y)
## 2.0

Similar conversion happens as demonstrated before.

False Friends

When working with Python and R, you will need to be careful with a few things.

Numeric Types

Python and R have different default numeric types.

print(type(1))
## <class 'int'>
typeof(1)
## [1] "double"

In Python, you can include a decimal point to ensure that the number is a floating point.

print(type(1.))
## <class 'float'>

And in R, you can use the L suffix to ensure that the numbers is a integer.

typeof(1L)
## [1] "integer"

Indexes

Python collections are addressed using 0-based indices rather than the 1-based indices.

collection = ["apple", 53, "banana", None, "melon", 7]
print(collection[0])
## apple
collection <- c("apple", 53, "banana", NA, "melon", 7)
collection[1]
## [1] "apple"

You always need to use the based indices of the language you are coding with. This means that if you are using Python you must use 0-based indices even if you are accessing a R object.

print(r.collection[0])
## apple

In a similar fashion, if you are using R you must use 1-based indices even if you are accessing a Python object.

py$collection[1]
## [[1]]
## [1] "apple"

Functions that takes an index

When passing arguments that will be index to a function, you must follow the base indice of the language in which the function was written. If you have a Python function

def get_from_list(i):
    fruits = ["apple", "banana", "melon"]
    return fruits[i]
print(get_from_list(0))
## apple

In R, you must use 0-based indices

py$get_from_list(0L)
## [1] "apple"

And remember to use the L suffix to ensure that the numbers is a integer, otherwise

py$get_from_list(0)
## Error in py_call_impl(callable, dots$args, dots$keywords): TypeError: list indices must be integers or slices, not float
## 
## Detailed traceback: 
##   File "<string>", line 3, in get_from_list

Copy of List

In R, copy of lists is by value.

x <- list(1, 2, 3)
print(x)
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 2
## 
## [[3]]
## [1] 3
y <- x
print(y)
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 2
## 
## [[3]]
## [1] 3
y[1] <- 4
print(x)
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 2
## 
## [[3]]
## [1] 3
print(y)
## [[1]]
## [1] 4
## 
## [[2]]
## [1] 2
## 
## [[3]]
## [1] 3

But in Python, copy of lists is by value.

x = [1, 2, 3]
print(x)
## [1, 2, 3]
y = x
print(y)
## [1, 2, 3]
y[0] = 4
print(x)
## [4, 2, 3]
print(y)
## [4, 2, 3]

This means that you should be extremaly careful when editing lists that are shared between Python and R.

x = [1, 2, 3]
y = x
py$y[1] <- 4
print(x)
## [1, 2, 3]
print(y)
## [4.0, 2.0, 3.0]

More Examples

  1. Determining and plotting the altitude/azimuth of a celestial object
  2. Using Python in R to study Game Theory by Vince Knight

What Next?

Build a binder workshop

Mozilla and Sloan Foundation funded workshops run by data scientist Tim Head. The UK edition will be at University of Birmingham on 17th July 2018. 1 day free workshop to

  • Learn how to use mybinder.org to turn you GitHub repo into an interactive notebook,
  • Learn how others are using mybinder.org to make their research reproducible.

For more info and registration see https://build-a-binder.github.io/. registration currently closes on 4th July.