I’ve recently answered a question on the German Quora (https://de.quora.com/Warum-sollte-man-in-Java-lieber-auf-die-Rekursion-verzichten/answer/Markus-Toman) that stated: “Why should you refrain from recursion in Java?”.

I think we should avoid recursion by default, except if:

  • you are working in a language where recursion is essential and highly-optimized (Haskell, LISP dialects, Erlang/elixir and most other functional languages)
  • the recursive version is much easier to maintain and understand AND does not negatively affect performance to an unreasonable degree
  • the recursive algorithm is easier parallelizable (typically because of immutability)

In this post we discuss this topic, starting with high-level Python, going over to C++ and finally ending up in assembly.

Especially when a language does not support Tail call optimization (TCO – see also Tail Recursion – GeeksforGeeks) we have to worry about memory usage and performance and in the worst case a stack overflow. While recursion can often be beautiful and represent a solution clearer, many languages are not well-designed to handle them. So of course, if you’re a functional programmer – have fun and recurse you brain out. If you’re dealing with a language like Python, be a bit more careful.

Especially if you’re dealing with high recursion depth or many parallel tasks performing recursion.

Let’s start at a high level

Dynamic time warping in Python

A couple years ago I used an existing, recursive implementation of Dynamic time warping (DTW) with cross-entropy to map sequences of probability distributions.

The relevant part is shown here:

    def calculate_backward(self, i1, i2):
        '''
        Calculate the dtw distance between
        seq1[:i1 + 1] and seq2[:i2 + 1]
        '''
        if self._map.get((i1, i2)) is not None:
            return self._map[(i1, i2)]
         
        if i1 == -1 or i2 == -1:
            self._map[(i1, i2)] = float('inf')
            return float('inf')
         
        min_i1, min_i2 = min((i1 - 1, i2), (i1, i2 - 1), (i1 - 1, i2 - 1),
                                key=lambda x: self.calculate_backward(*x) )
         
        self._map[(i1, i2)] = self.get_distance(i1, i2) + self.calculate_backward(min_i1, min_i2)
         
        return self._map[(i1, i2)]

(see also m-toman/AMTV @ github)

While I wasn’t dealing with very long sequences, this implementation blew up pretty quickly with a stack overflow.

So I implemented the iterative version as follows:

        for i in range(0, len(self._seq1)):
            self._map[(i,-1)] = float('inf')
            
        for i in range(0, len(self._seq2)):
            self._map[(-1,i)] = float('inf')
            
        self._map[(-1,-1)] = 0

        for i in range(0, len(self._seq1)):
            for j in range(0, len(self._seq2)):
                cost = self.get_distance(i, j)
                self.distance_map[(i,j)] = cost
                self._map[(i,j)] = cost + min(self._map[ i-1 , j    ],      # insertion
                                              self._map[ i   ,  j-1 ],      # deletion
                                              self._map[ i-1 ,  j-1 ]       # match
                                             )
        return self._map

m-toman/AMTV @github

And it worked like a charm.

Let’s dig deeper…

Toy example in C++

I came up with a slightly simpler example so we have an easier time analyzing it.

The following program fills an array with its index times two and also calculates the sum of it (don’t ask…). We repeat this a couple of times so we can better compare the runtime costs.

So let’s look at an iterative and a recursive implementation:

// iterative.cpp
int main() {
  const int max_arr_len = 100000;
  const int max_runs = 10000;
  int arr[max_arr_len];
  for (int runs = 0; runs < max_runs; ++runs) {
      long sum = 0L;
      for(int i = 0; i < max_arr_len; ++i) {
         arr[i] = i + i;
         sum += (long)arr[i]; 
      }
   } 
} 

// recursive.cpp
long recur(int* arr, int idx, long sum, int max) {
   if(idx >= max) {
      return sum;
   }
   else {
      arr[idx] = idx + idx;
      return recur(arr, idx+1, sum+(long)arr[idx], max);
   }
}
int main() {
  const int max_arr_len = 100000;
  const int max_runs = 10000;
  int arr[max_arr_len];
  for (int runs = 0; runs &lt; max_runs; ++runs) {
      long sum = recur(arr, 0, 0L, max_arr_len);
  }
}

Now we’ll time the run:

$ g++ recursive.cpp ; time ./a.out 
real	0m17.732s
user	0m16.236s
sys	0m0.139s
 
$ g++ iterative.cpp ; time ./a.out 
real	0m3.305s
user	0m3.254s
sys	0m0.016s

Wow, the iterative version is more than 5x faster.

Again, let’s dig a bit deeper.

Toy example in Assembly

Let’s first look at the (unoptimized) assembly of the recursive version. I’ve thrown out non-essential parts and added a couple comments:

## recursive.s
__Z5recurPiili:                         ## @_Z5recurPiili             
## Saving stackpointer, reserve 48 bytes for this call:
        pushq   %rbp                    
        movq    %rsp, %rbp
        subq    $48, %rsp
 
## ...
## go into the recursion part:
        cmpl    -36(%rbp), %ecx                                       
        jl      LBB0_2
 
## ...
## otherwise leave:
       jmp     LBB0_3
 
## do calculations and then call the recursion:
LBB0_2: 
        movl    -20(%rbp), %eax
        addl    -20(%rbp), %eax
        movq    -16(%rbp), %rcx                                       
        movslq  -20(%rbp), %rdx
        movl    %eax, (%rcx,%rdx,4)                                   
        movq    -16(%rbp), %rdi
        movl    -20(%rbp), %eax
        addl    $1, %eax
        movq    -32(%rbp), %rcx                                       
        movq    -16(%rbp), %rdx
        movslq  -20(%rbp), %rsi
        movslq  (%rdx,%rsi,4), %rdx
        addq    %rdx, %rcx
        movl    -36(%rbp), %r8d
        movl    %eax, %esi
        movq    %rcx, %rdx
        movl    %r8d, %ecx
        callq   __Z5recurPiili      ## &lt;----
 
## return
LBB0_3:
        movq    -8(%rbp), %rax
        addq    $48, %rsp
        popq    %rbp
        retq

“callq __Z5recurPiili” first advances the stack pointer with each call for 48 bytes (subq $48, %rsp) to hold the states of the local variables. Also, additional space will be reserved for the return addresses etc. (see CALL – Call Procedure for more details). This means the need for stack increases linearly with recursion depth as we gradually access more virtual address space.

Now the iterative version. To make the a bit easier to compare, I’ve pulled out the inner loop into a function:

// iterative_inner.cpp
long iterate(int* arr, int len) {
  long sum = 0;
  for(int i = 0; i &lt; len; ++i) {
      arr[i] = i + i;
      sum += (long)arr[i];
  }
  return sum; 
} 
//snip

And the resulting assembly, again pruned to the essentials and comments added:

## iterative_inner.s
__Z7iteratePii:                         ## @_Z7iteratePii
## save the stack pointer again
        pushq   %rbp
        movq    %rsp, %rbp
 
## ...
## check the loop condition and leave if done
LBB0_1:
        movl    -28(%rbp), %eax
        cmpl    -12(%rbp), %eax
        jge     LBB0_4
 
## otherwise do our calculations
        movl    -28(%rbp), %eax
        addl    -28(%rbp), %eax
        movq    -8(%rbp), %rcx
        movslq  -28(%rbp), %rdx
        movl    %eax, (%rcx,%rdx,4)
        movq    -8(%rbp), %rcx
        movslq  -28(%rbp), %rdx
        movslq  (%rcx,%rdx,4), %rcx
        addq    -24(%rbp), %rcx
        movq    %rcx, -24(%rbp)
        movl    -28(%rbp), %eax
        addl    $1, %eax
        movl    %eax, -28(%rbp)
## now continue the loop
        jmp     LBB0_1
LBB0_4:
        movq    -24(%rbp), %rax
        popq    %rbp
        retq

The function only once reservers the stack for the function call, otherwise we don’t access any further virtual address space.

To sum up, the iterative version is not only more than 5x faster, it also has constant memory requirements, no matter how many iterations.

But this is without optimization…

Toy example in Assembly, optimized

Now we activate the compiler optimizations and recompile:

 

$ g++ -O3 iterative.cpp ; time ./a.out 
real	0m0.005s
user	0m0.001s
sys	0m0.002s
 
$ g++ -O3 recursive.cpp ; time ./a.out 
real	0m0.004s
user	0m0.001s
sys	0m0.002s

Oh, wait, what happened here?
Seems the compiler pruned a bit too much, as we don’t seem to use any of the results.

So let’s do modify the code a bit:

// iterative.cpp
#include <iostream>
int main() {
  const int max_arr_len = 100000;
  const int max_runs = 1000000;
  int arr[max_arr_len];
  long sumsum = 0;
  long randsum = 0;
  for (int runs = 0; runs < max_runs; ++runs) {
      long sum = 0;
      for(int i = 0; i < max_arr_len; ++i) {
        arr[i] = i + i;
        sum += (long)arr[i];
      }
      sumsum += sum;
      randsum += arr[100];
  }
  std::cout << sumsum << std::endl;
  std::cout << randsum << std::endl;
}
 
// recursive.cpp
#include <iostream>
long recur(int* arr, int idx, long sum, int max) {
  if(idx >= max) {
      return sum;
  }
  else {
      arr[idx] = idx + idx;
      return recur(arr, idx+1, sum+(long)arr[idx], max);
  }
}
 
int main() {
  const int max_arr_len = 100000;
  const int max_runs = 1000000;
  int arr[max_arr_len];
  long sumsum = 0;
  long randsum = 0;
  for (int runs = 0; runs < max_runs; ++runs) {
      long sum = recur(arr, 0, 0L, max_arr_len);
      sumsum += sum;
      randsum += arr[100];
  }
  std::cout << sumsum << std::endl;
  std::cout << randsum << std::endl;
}

I’ve also pumped up the number of iterations a bit for the optimized version.
So now we can run it again:

$ g++ -O3 iterative.cpp ; time ./a.out 
9999900000000000
200000000
real	0m21.900s
user	0m21.127s
sys	0m0.098s
 
$ g++ -O3 recursive.cpp ; time ./a.out 
9999900000000000
200000000
real	0m23.792s
user	0m21.582s
sys	0m0.172s

Neat, not much of a difference now.

So it seems the compiler was able to deal with the recursion.

Summary

What we saw in this example is that not only functional language compilers are able to deal with recursion, but we also saw in the Python example that we can’t necessarily rely on that. Especially in dynamic and interpreted languages.

So coming back to my original statement: the iterative version of an algorithm is usually the safer bet in imperative languages. This is partially because of the inherent immutability of recursion, i.e. we don’t mutate the index and the sum and without TCO we have to store all their copies. On the other hand, this aspect makes parallelization much easier, but we will deal with this topic another time.

But as usual: “Measure, measure, measure!” 🙂