This post will cover benchmarking in Julia using a specific case to evaluate the functions in CSV and CSVFiles packages to write a CSV file.
Packages and Versioning
In this use case, I am using Julia v1.5.3 with the following packages:
using CSV # v0.8.2
using CSVFiles # v1.0.0
using DataFrames # v0.22.4
using BenchmarkTools # v0.5.0
Please reference each packages documentation for more details. CSV, CSVFiles, DataFrames, BenchmarkTools.
Setting up the benchmark
In each of the example benchmark methods or implementations below, I use the function respective to save/write a CSV file from each of the CSV and CSVFiles packages.
Although being applied to specific use case, the process allows us to
measure the performance and memory allocation of expressions.
The first run will typically be higher than subsequent runs. This is likely an
overhead caused from the initial compiling of a function or expression.
Additional References for Performance Tips:
https://docs.julialang.org/en/v1/manual/performance-tips/
Example Data
The code below just shows the directory and file information. This is purely administrative preparation for the subsequent section utilize.
# Define the base directory to work out of.
# You can always make this the active directory
# by using :
# julia> Base.cd(base_dir)
base_dir = "/Some/Dir/Project"
base_dir_crime = "$base_dir/data/crime"
filename_in = "$base_dir_crime/crime_table_CY2009.csv" filesize(filename_in)
df = CSV.write(filename_out, df)
# Define the output filename
filename_out = "$dir_destination/crime_table_CY2009_write_bm.csv"
The selected input file being used contains 31248 rows with 23 columns are various datatypes. You can get more information on the data in my post on the crime data. The file is 8.37 MB (8370840 bytes). Ideally my output file should be the same or close to this size.
Method 1: @time
The @time
macro is available in Julia’s Base package. The output of @time
will print the time it took to execute, the number of allocations, and the total number of bytes its execution caused to be allocated.
It is advised to run this at least twice. The first instance may have some compiling time that is part of the output results. If you run each instance multiple times you can see a distribution of results.
Here we see the CSVFiles.save()
function. I ran it four times to demonstrate the first instance time in comparison to the 2nd-4th runs. We can certainly see the significance that can have. You should consider factoring this into your benchmarking process.
@time CSVFiles.save(filename_out, df)
"""
14.774528 seconds (22.77 M allocations: 1.116 GiB, 6.86% gc time)
0.831113 seconds (750.29 k allocations: 77.738 MiB, 3.37% gc time)
0.806468 seconds (750.29 k allocations: 77.738 MiB)
0.820039 seconds (750.29 k allocations: 77.738 MiB)
"""
filesize(filename_out)
# 9462617 (9.46MB)
Here we see the CSV.write() function.
@time CSV.write(filename_out, df)
"""
2.446969 seconds (6.29 M allocations: 239.346 MiB, 3.22% gc time)
0.635727 seconds (3.43 M allocations: 101.568 MiB, 20.35% gc time)
0.490933 seconds (3.43 M allocations: 101.568 MiB, 5.31% gc time)
0.559759 seconds (3.43 M allocations: 101.568 MiB, 5.00% gc time)
"""
filesize(filename_out)
# 8431379 (8.43MB)
As we can see, we get similar results on the initial run but subsequent runs of the code produce more consistent results. The read times are much faster and it uses more memory (101MB vs 77MB).
The file output size is another measure we can assess the two packages. The CSV.write()
outputs a file the same size as my original output size using R’s readr::write_csv()
, 8.4MB, whereas the CSVFiles.save()
creates a file that is 9.4MB. That is a 12.2% increase. From a comparison standpoint with R’s functionality, I would prefer to see similar numbers when looking at comparable functions in Julia. It hints that the functions are operating somewhat similarly at a high-level, but would require digging in more to make a thorough and fair assessment.
Method 2: @timev
The @timev
macro is available in the Julia’s Base. The macro @timev
provides a more verbose response from Method 1 above. We don’t need to run these multiple times since the functions have already been compiled. The output provides more context.
In this code block I’m using the CSVFiles.save() function.
@timev CSVFiles.save(filename_out, df)
"""
1.037027 seconds (750.29 k allocations: 77.738 MiB, 4.48% gc time)
elapsed time (ns): 1037027394
gc time (ns): 46481980
bytes allocated: 81513984
pool allocs: 750290
GC pauses: 1
"""
Here we see the CSV.write() function. Again much faster than CSVFiles but required more memory and system resources.
@timev CSV.write(filename_out, df)
"""
0.635651 seconds (3.43 M allocations: 101.568 MiB, 15.83% gc time)
elapsed time (ns): 635650783
gc time (ns): 100641690
bytes allocated: 106501648
pool allocs: 3425475
malloc() calls: 1
GC pauses: 2
"""
Between these two runs, we observed that CSV.write()
took 38.7% less time to complete the task. From a resource perspective, CSVFiles.save()
required less, using 23.5% less bytes and 78.1% less memory allocations.
Method 3: @timed
The @timed
macro is available in the Julia’s Base package. The macro @timed
provides a means to output the following results to an object:
- value of the expression
- elapsed time
- total bytes allocated
- garbage collection time
- object with various memory allocation counters
Since the result values are the same as the above implementations of the @time
and @timev
, I will only display one implementation in this section.
The process is the same for so I’ll just provide one example.
time_stat = @timed CSV.write(filename_out, df)
time_stat.value # value of the expression
time_stat.time # elapsed time
time_stat.bytes # total bytes allocated
time_stat.gctime # garbage collection time
time_stat.gcstats # object with various memory allocation counters
Method 4: @elapsed
The @elapsed
macro is available in the Julia’s Base package. The macro returns only the number of seconds it took to execute an expression.
@elapsed CSVFiles.save(filename_out, df)
# 0.850206273 seconds
@elapsed CSV.write(filename_out, df)
# 0.6364408 seconds
Method 5: @benchmark
The @benchmark
macro is available in the BenchmarkTools
package. The macro runs a series of samples and outputs some basic statistical results.
Here we see the CSVFiles.save()
function. In this instance the macro ran 6 samples with a spread of 776.531 ms – 1.007 sec.
@benchmark CSVFiles.save(filename_out, df)
"""
BenchmarkTools.Trial:
memory estimate: 77.74 MiB
allocs estimate: 750290
--------------
minimum time: 776.531 ms (1.39% GC)
median time: 832.033 ms (2.11% GC)
mean time: 868.346 ms (2.50% GC)
maximum time: 1.007 s (4.85% GC)
--------------
samples: 6
evals/sample: 1
"""
Here we see the CSV.write()
function. In this instance the macro ran 10 samples with a spread of 479.760 – 718.754 ms.
@benchmark CSV.write(filename_out, df)
"""
BenchmarkTools.Trial:
memory estimate: 101.57 MiB
allocs estimate: 3425476
--------------
minimum time: 479.760 ms (6.32% GC)
median time: 504.789 ms (6.01% GC)
mean time: 551.279 ms (5.28% GC)
maximum time: 718.754 ms (2.57% GC)
--------------
samples: 10
evals/sample: 1
"""
@btime
If you just want to get the minimum time, you can run the @btime
macro from the BenchmarkTools
package, which is an alternative to the @time
macro from the Base package.
@btime CSVFiles.save(filename_out, df)
# 802.068 ms (750290 allocations: 77.74 MiB)
@btime CSV.write(filename_out, df)
# 388.340 ms (3425476 allocations: 101.57 MiB)
Conclusion
Based on the repeated results, I think CSV.write()
is the better package and function of the two for the current operation. If you have a concern about memory allocation and system resources, CSVFiles.save()
might be a better choice.
From comparison to the readr::write_csv()
function in R, I would prefer the closeness in output results of CSV.write()
.
I will implement the CSV
package going forward until I can see different results with CSVFiles
for write in data.
Also going through this example should give you some tools to benchmark your code in Julia. There is a lot more you can do and I only scratched the surface.