Rollendurchmesserzeitsammler

Rollendurchmesserzeitsammler

About
Rollendurchmesserzeitsammler is a conservative garbage collector for realtime audio processing. Its main purpose is to make language implementations using conservative garbage collectors work reliably and efficiently in realtime.
The collector is made to be a drop in replacement for Hans Boehm's conservative garbage collector for C and C++ (BDW-GC). This means that Rollendurchmesserzeitsammler makes it easier to use programming languages for realtime audio programming which uses BDW-GC.
The garbage collection runs in a parallel thread and it operates on snapshots taken of the heap between processing blocks of audio data. Predictability and hard realtime are ensured by simulating worst case. CPU cache issues for computers with multiple processors should be handled appropriately.
How it works
The collector works by first allocating a fixed size heap small enough to be fully copied within '2*(m-s)', where 'm' is the sound block period (i.e. the time between two soundcard interrupts), and 's' is the time it takes to process one block of audio.
Copying this fixed size heap happens in a parallel thread, and it should normally not take very much time. An important restriction is that it is not possible to use more pointer-holding memory than the size of this heap. A pointer-heap size of 1MB seems to currently be enough for most types of instruments, and on most computer it should take much less time than 1 ms to copy 1MB of memory. (On new computers, a pointer-heap size of up to 100MB should work.)
It is also important to note that pointerless memory ("atomic memory"), for instance audio sample buffers or FFT data, is allocated from a different heap, called the atomic heap. The atomic heap can be as large as possible without causing any increase in CPU.
After making a copy of the pointer-holding heap, a tweaked mark-and-sweep collector runs in a parallel thread independent of the rest of the program. Messages about unreferenced memory is sent back and thereby consequently freed from the heap at appropriate times in a non-blocking fashion.
News
Rollendurchmesserzeitsammler V0.1.0 is a complete reimplementation of the garbage collector.
The performance should be much higher and the code easier to read.
Download
http://archive.notam02.no/arkiv/src/?C=M;O=D.

Rollendurchmesserzeitsammler also uses some code from the Boehm-Demers-Weiser garbage collector (BDW-GC), and it is necessary to have the source of BDW-GC installed before compiling Rollendurchmesserzeitsammler. Homepage of BDW-GC is here.
Rollendurchmesserzeitsammler can optionally use TLSF for allocating memory. The TLSF allocator is a tiny bit slower than the supplied allocator in Rollendurchmesserzeitsammler, but on the other hand, the TLSF allocator should handle fragmentation better. Homepage of TLSF is here. (Rollendurchmesserzeitsammler V0.1.0 or higher is currently not optimized for TLSF, so you could currently see a significant decrease in CPU using TLSF!)
Documentation
- README file.
- Master's thesis in Informatics: Realtime Audio with Garbage Collection.
- Paper: Conservative Garbage Collectors for Realtime Audio Processing. From ICMC 2009. Additional documentation for the paper:
FAQ
Q: Why use Rollendurchmesserzeitsammler instead of a snapshot-at-the-beginning collector using copy-on-write in virtual memory (SATBCUCOWIVM)?
A: This is a common question that probably comes up since a SATBCUCOWIVM collector is simple to understand (just fork the process!), and that a SATBCUCOWIVM collector is quite similar to Rollendurchmesserzeitsammler. Although a SATBCUCOWIVM collector is likely to use much less CPU on average, worst-case for a SATBCUCOWIVM collector is worse than Rollendurchmesserzeitsammler (most obviously because of the fragmented copying scheme, but also because of VM overhead). Furthermore, a SATBCUCOWIVM collector has no guarantee about when or how often worst-case is going to happen, while Rollendurchmesserzeitsammler behaves predictably since it has a tweakable variable for setting how often worst-case is going to happen. In addition, it is uncertain how well the OS's support for copy-on-write is going to work in a hard realtime setting.
Q: Why not use the <X> realtime garbage collectors instead to avoid the memory limitiation of Rollendurchmesserzeitsammler? (The <X> collector requires a read barrier or a write barrier (which are not hardware supported).)
A:
- One advantage of Rollendurchmesserzeitsammler compared to most realtime garbage collectors, is that it simulates worst case and carefully measures the time it takes to simulate worst-case, making sure (using buzy-looping) that it will always spend exactly the same amount of time every time. (+/- some microseconds) Experiencing worst-case very often is necessary when doing experimental audio work so that the human can have a fairly accurate feeling about the limitations of the system. In other words, just guaranteeing that there is a worst-case is not enough.
- Rollendurchmesserzeitsammler is tuned to work with accuracy and deadlines in the range of 1/100 of a millisecond. This accuracy can be necessary for realtime audio work when working in conjunction with other audio software, but it is not common for other kinds of realtime usage. The more advanced the code running in the audio thread is, the harder it is to maintain that kind of accuracy. Avoiding read barriers or write barriers is one way to make the code simpler.
- Programs using the garbage collector becomes simpler when not having to add read or write barriers into the source code.
- Using Rollendurchmesserzeitsammler in already existing languages can be quite simple if the language only uses a normal conservative collector, or a straight-forward mark-and-sweep collector.
- For audio work, we want to avoid, almost at any cost, code that can interfere with the inner DSP loops, because that's where almost all of the CPU is normally spent. Especially (unoptimized) read barriers are likely to interfere with inner DSP loops. (However, the ICMC2009 paper above contains the description of a garbage collector which is quite similar to Rollendurchmesserzeitsammler, but using a write barrier to avoid limiting the amount of memory. The write barrier used in that paper is made in such a way that it is unlikely to occur inside inner DSP loops.)
Q: Doesn't taking full snapshots use a significant amount of CPU?
A: Depends on the size of the heap, but sometimes: Yes. However, we have about 20% CPU available within a sound period, which can probably be used occasionally without increasing the chance of glitches in sound or lower interactivity. In other words, if it takes less than 1/5 of the time of a sound period to take a full snapshot, the CPU spent on taking snapshots should not be noticed by the user. (More about this in the ICMC2009 paper.)

Last updated Fri Mar 18 11:26:23 CET 2011

API

/* CURRENT API
 *
 * (Look at the example below for usage of functions not documented here!)
*/


void tar_init(int atomic_heap_size,
              int heap_size,
              int max_mem_size,
              int audio_priority_setting,
              float time_between_full_snapshot);

tar_heap_t *tar_create_heap(void);
void tar_delete_heap(tar_heap_t *heap,bool audio_is_running);

/* 'tar_alloc' returns a memory block which can be used for storing
 * any kind of data.
 *
 * The returned memory only contains zeros.
 */
void *tar_alloc(tar_heap_t *heap,size_t size);

/* 'tar_alloc_atomic' returns a memory block which can be used for storing
 * any kind of data, except pointers to other memory blocks allocated by either
 * 'tar_alloc' or 'tar_alloc_atomic'.
 * 
 * It's a huge optimization to use 'tar_alloc_atomic' instead of
 * 'tar_alloc' whenever possible.
 *
 * The returned memory only contains zeros. (Unlike the atomic function
 * in BDW-GC!)
 */
void *tar_alloc_atomic(tar_heap_t *heap,size_t size);

size_t tar_get_used_mem(tar_heap_t *heap);

size_t tar_get_used_atomic_mem(tar_heap_t *heap);

void tar_init_block(float duration);

void tar_before_using_heap(tar_heap_t *heap);
bool tar_after_using_heap(tar_heap_t *heap);

void tar_add_root(tar_heap_t *heap,void *start,void *end);
void tar_add_root_concurrently(tar_heap_t *heap,void *start,void *end);

void tar_start_gc(tar_heap_t *heap);

/* If having a global variable in a dynamic library pointed to by 'pointer',
 * 'tar_get_dynamic_roots_for' will find the range of global variables in that library.
 *
 * Afterwards, it makes sense to use those values
 * with the function 'tar_add_root_concurrenty'.
 *
 * The function returns 0 if the dynamic root could not be found.
 *
 * (This function is implemented using code from BDW-GC)
 */
int tar_get_dynamic_roots_for(char *pointer,char **start,char **end);

Example

/* EXAMPLE OF USAGE */

#include <rollendurchmesserzeitsammler.h>

tar_heap_t *heap;

int audio_thread(int blocksize){

  int i;
  static void *root[50];


  /* 'tar_init_block' must be called before using any heap in the current block.        
   *
   * The function takes as argument the time (in seconds) since last call to tar_init_block.
   * Note that this number is only used to calculate when to do the next full snapshot,
   * so it should usually be more than good enough just to return the average time between each
   * call to 'tar_init_block'.
   */
  tar_init_block((float)blocksize/44100.0f);


  /* 'tar_before_using_heap' must be called before using a specific heap in the current block.
   *
   *  (In case a snapshot of the heap was scheduled to start at the end of the previous
   *  block, 'tar_before_using_heap' waits until the snapshot is finished.)
   */
  tar_before_using_heap(heap);


  /* Allocate and use memory */
  for(i=0;i<blocksize;i++){
    root[x]=tar_alloc(heap,20);
  }


  /* 'tar_after_using_heap' must be called when the heap is no longer used
   * in the current block.
   *
   * If 'tar_after_using_heap' returns true, roots must be added before calling 'tar_run_gc':
   */
  if(tar_after_using_heap(heap)){

     /* Roots must be added before calling tar_run_gc
      *
      * We can use the concurrent function ('tar_add_root_concurrently') since the 'root' variable
      * is static and can't disappear after calling 'tar_add_root_concurrently'. (at least
      * as long as the C compiler is not doing anything really weird, but it should be safe to assume
      * that it doesn't.)
      *
      * The advantage of using the concurrent function is that the values
      * will be scheduled to be copied in a parallel thread, similarly
      * to how snapshots are performed.
      */
    tar_add_root_concurrently(heap,&root[0],&root[50]); /* [2] */

     /* Any value stored in CPU registers or stack are likely
      * to disappear though, so for such variables, the function 'tar_add_root'
      * must be used instead [1]: (Look at the file 'stalinwrapper.c' for a simple example
      * on how to get hold of stack and register values, and BDW-GC for a more thorough
      * way to get hold of stack and register values.)
      */
    tar_add_root(heap,registers_start,registers_end); /* [2] */
    tar_add_root(heap,stack_start,stack_end);

     /* After all roots have been added, 'tar_run_gc' must be called:
      */
    tar_run_gc(heap);
  }


  return 0;
}


int main(){
   tar_init(10*1024*1024, /* Size of the atomic memory heap. (Can be large). All heaps share the same atomic memory heap. */
            1024*1024,    /* Size of non-atomic memory heap per garbage collector heap (Should be as small as possible)  */
            1024*64       /* Maximum size of memory block to allocate. (Should be as small as possible)                  */
            audio_prior   /* Priority of the audio thread. Using jack, this would be jack_client_real_time_priority(client) */
            1.0f          /* Time (in seconds) between doing full snapshots. 1 second should work fine.                  */
    );
    heap=tar_create_heap(); /* Creates a new heap contaning non-atomic memory. */
    start_audio_thread(audio_thread);
    wait_forever();
    tar_delete(heap);
    return 0;
}

/*
[1] Adding registers and stack to the roots is unnecessary in this example. It is just
    to point out that it can be done, infact it is very often necessary.
    But in this specific example, there can't be pointers we need in the registers or stack
    since this function is exited after calling tar_run_gc() and all data stored
    in the registers and stack are lost anyway.

[2] The order of the two last arguments for tar_add_root_concurrently and tar_add_root does not matter.
*/