The purpose of CUDA applications is to speed up computation on GPUs.  Coding on CUDA naturally involves timing the application to measure the speedup over CPU counterpart.

To profile a CUDA application, you can either use tools such as NVIDIA NSight and Visual Profiler or you can use timing functions; in this article we will use the later approach. CUDA runtime API calls and kernel launches can be timed accurately using CUDA events available in the toolkit. For stuff other than device code, we use the host timers. The following example demonstrated how CUDA runtime calls can be timed using events.

cudaEvent_t start, end;

cudaEventCreate(&start);
cudaEventCreate(&end);

float milliseconds;

cudaEventRecord(start);

/* Kernel launch or memory copy*/

cudaEventRecord(end);
cudaEventSynchronize(end);

cudaEventElapsedTime(&milliseconds,start,end);

fprintf(stderr,"Elapsed Time = %f milliseconds\n",milliseconds);

cudaEventDestroy(start);
cudaEventDestroy(end);

The code above profiles the calls issued to the default CUDA stream. To measure the elapsed time of asynchronous events occurring on any other stream, we have to record the event on the specified stream like this.

cudaEventRecord(start,stream);
cudaEventRecord(end,stream);

Timing Host Code

To measure the elapsed time of the host code on Windows, we can use QueryPerformanceCounter function of the windows SDK.

#include<windows.h>
LARGE_INTEGER start, end, frequency;
double milliseconds;

QueryPerformanceFrequency(&frequency);
QueryPerformanceCounter(&start);

/* Code to be timed */

QueryPerformanceCounter(&end);

milliseconds = (end.QuadPart - start.QuadPart) * 
                        1000.0/frequency.QuadPart;

fprintf(stderr,"Host Time = %f milliseconds\n",milliseconds);

In linux, we can use gettimeofday function to measure elapsed time.

#include<sys/time.h>

timeval startTime;
timeval endTime;

gettimeofday(&startTime, NULL);

/* Code to be timed */

gettimeofday(&endTime, NULL);

long seconds, useconds; 
double milliseconds; 
seconds  = endTime.tv_sec  - startTime.tv_sec;
useconds = endTime.tv_usec - startTime.tv_usec; 
milliseconds = (seconds + useconds/1000000.0) * 1000.0;

fprintf(stderr,"Host Time = %f milliseconds\n",milliseconds);

These routines serve the best when you don't have enough time to profile whole application using visual tools. The routines give you quick and accurate timing of any device or host call you wish to profile.