Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

1Learning Outcomes

OpenMP stands for “Open Multi-Processing” and is an API[1] with language extensions for C, C++, and Fortran. OpenMP stands for “Open Multi-Processing” and enables multi-threaded, shared memory parallelism with the fork-join model.

OpenMP is a portable, standardized API supported on many compilers, including GCC. We can write parallel programs in high-level code then compile easily:

In C, OpenMP uses compiler directives called pragmas. Pragmas are a C preprocessor mechanism provided for language extension.[2] OpenMP uses this mechanism because compilers that don’t recognize a pragma are supposed to ignore them, meaning that C code with embedded OpenMP pragmas can still feasibly compile and run on a sequential computer.

2OpenMP Hello World

OpenMP C program: hello_world.c

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
#include <stdio.h>
#include <omp.h>
int main() {
  /* Fork team of threads with private variable tid */
  #pragma omp parallel
  {
    int tid = omp_get_thread_num(); /* get thread id */
    printf("Hello World from thread = %d\n", tid);
    /* Only main thread does this */
    if (tid == 0) {
      printf("Number of threads = %d\n",
             omp_get_num_threads());
    }
  } /* All threads join main and terminate */
  return 0;
}

Output:

$ ./hello_world
Hello World from thread = 0
Number of threads = 12
Hello World from thread = 2
Hello World from thread = 7
Hello World from thread = 1
Hello World from thread = 9
Hello World from thread = 5
Hello World from thread = 8
Hello World from thread = 3
Hello World from thread = 11
Hello World from thread = 10
Hello World from thread = 4
Hello World from thread = 6

In the above code, the main thread forks into a team of parallel subthreads. Each subthread executes the parallel region (delineated by the parallel directive) concurrently, before control returns to the main thread.

3OpenMP Constructs

3.1Parallel region

A parallel region is a code section executed in parallel and is delineated by the parallel construct. In the above code, the parallel region spans Lines 6 to 14.

3.2OpenMP Thread

OpenMP creates as many software threads as specified in the environment variable OMP_NUM_THREADS. During execution, each OpenMP (software) thread is multiplexed onto the available hardware threads.

To specify the number of threads, use omp_set_num_threads(x); outside the parallel region. Otherwise, by default, the number of OpenMP threads is set to the maximum number of hardware threads on the machine. We saw earlier that the course hive machines have 12 hardware threads.

Table 1:OpenMP Software Threads

OpenMP IntrinsicDescription
omp_set_num_threads(x);Set number of threads to x.
num_th = omp_get_num_threads();Get number of threads.
tid = omp_get_thread_num();Get Thread ID number.

It is certainly possible to specify more OpenMP threads than hardware threads. There are likely other concurrent tasks running on the same machine. If your OpenMP parallel regions has significant I/O or memory accesses, multiplexing is somewhat inevitable. Be wary of too much context switching on a shared machine: during peak times, timing OpenMP programs may inadvertently measure shared machine workload, and not the benchmark target. Read more about OpenMP timing when we build our OpenMP DGEMM benchmark

3.3Shared and Private Variables

OpenMP has both shared and private variables.

  int var1, var2;
  char *var3 = malloc(…);
  #pragma omp parallel private(var2)
  {
    int var4;
    // var1 shared (default)
    // var2 private
    // var3 shared (heap)
    // var4 private (thread’s stack)
    …
  }

4OpenMP Worksharing with for

In our hello world example, the parallel region work was replicated across all OpenMP threads. In practice, we may want to write multi-threaded programs for worksharing, where we split/partition and distribute the work across OpenMP threads.

The OpenMP for construct can be associated with a parallel region’s for loop within a parallel region. Given this construct, the run-time system determines which chunk of loop iterations to assign to each thread. At a high-level, the code

#include <omp.h>

omp_set_num_threads(4);
#pragma omp parallel for
for (int i=0; i<100; i++) { 
    ...
}

will generate four subthreads:

  1. for (int i=0; i<25; i++) { … }

  2. for (int i=25; i<50; i++) { … }

  3. for (int i=50; i<75; i++) { … }

  4. for (int i=75; i<100; i++) { … }

The above example of the #pragma omp parallel for directive is sufficient for most workloads in this class. For those curious, there are two directives at play:

If a parallel region is one giant for loop, we can combine the two declarations with #pragma omp parallel for. For those curious, check out the related example.

5Beyond OpenMP

There is no universal solution to parallel programming. OpenMP assumes a fork-join model, though different models are needed for different applications. Table 2 lists the benefits of the OpenMP parallel programming model.

Table 2:Thread-level parallelism with OpenMP: pros and cons.

AssumptionProsCons
Threads are an explicit programming model with full programmer control over parallelization- Compiler directives are simple and easy to use
- Legacy serial code does not need to be rewritten
- Compiler must support OpenMP (e.g. gcc 4.2)
-Amdahl’s law is gonna get you after not too many cores
Multiple threads operate in a shared memory environment.- Reduces memory requirements
- Programmer need not worry (that much) about data placement
- Code can only be run in shared memory environments
-Synchronizing use of shared resources is hard

Parallel programming needs can be very-problem specific—scientific computing, machine learning, web servers, I/O-heavy applications, etc. As a result, other models—e.g, message passing for process-level parallelism, with concurrent independent tasks—may be needed.

Footnotes
  1. Application Programming Interface

  2. Commonly implemented pragmas not covered in this course: structure packing, symbol aliasing, floating point exception modes.