Benchmarking CSV vs CSVFiles packages: Write

This post will cover benchmarking in Julia using a specific case to evaluate the functions in CSV and CSVFiles packages to write a CSV file.

Packages and Versioning

In this use case, I am using Julia v1.5.3 with the following packages:

using CSV # v0.8.2
using CSVFiles # v1.0.0
using DataFrames # v0.22.4
using BenchmarkTools # v0.5.0

Please reference each packages documentation for more details. CSV, CSVFiles, DataFrames, BenchmarkTools.

Setting up the benchmark

In each of the example benchmark methods or implementations below, I use the function respective to save/write a CSV file from each of the CSV and CSVFiles packages.

Although being applied to specific use case, the process allows us to
measure the performance and memory allocation of expressions.

The first run will typically be higher than subsequent runs. This is likely an
overhead caused from the initial compiling of a function or expression.

Additional References for Performance Tips:
https://docs.julialang.org/en/v1/manual/performance-tips/

Example Data

The code below just shows the directory and file information. This is purely administrative preparation for the subsequent section utilize.

# Define the base directory to work out of.
# You can always make this the active directory
# by using :
# julia> Base.cd(base_dir)
base_dir = "/Some/Dir/Project"
base_dir_crime = "$base_dir/data/crime"
 
filename_in = "$base_dir_crime/crime_table_CY2009.csv" filesize(filename_in)
df = CSV.write(filename_out, df)

# Define the output filename
filename_out = "$dir_destination/crime_table_CY2009_write_bm.csv"

The selected input file being used contains 31248 rows with 23 columns are various datatypes. You can get more information on the data in my post on the crime data. The file is 8.37 MB (8370840 bytes). Ideally my output file should be the same or close to this size.

Method 1: @time

The @time macro is available in Julia’s Base package. The output of @time will print the time it took to execute, the number of allocations, and the total number of bytes its execution caused to be allocated.

It is advised to run this at least twice. The first instance may have some compiling time that is part of the output results. If you run each instance multiple times you can see a distribution of results.

Here we see the CSVFiles.save() function. I ran it four times to demonstrate the first instance time in comparison to the 2nd-4th runs. We can certainly see the significance that can have. You should consider factoring this into your benchmarking process.

@time CSVFiles.save(filename_out, df)
 """
 14.774528 seconds (22.77 M allocations: 1.116 GiB, 6.86% gc time)
 0.831113 seconds (750.29 k allocations: 77.738 MiB, 3.37% gc time)
 0.806468 seconds (750.29 k allocations: 77.738 MiB)
 0.820039 seconds (750.29 k allocations: 77.738 MiB)
 """
filesize(filename_out)
# 9462617 (9.46MB)

Here we see the CSV.write() function.

@time CSV.write(filename_out, df)
 """
 2.446969 seconds (6.29 M allocations: 239.346 MiB, 3.22% gc time)
 0.635727 seconds (3.43 M allocations: 101.568 MiB, 20.35% gc time)
 0.490933 seconds (3.43 M allocations: 101.568 MiB, 5.31% gc time)
 0.559759 seconds (3.43 M allocations: 101.568 MiB, 5.00% gc time)
 """
filesize(filename_out)
# 8431379 (8.43MB)

As we can see, we get similar results on the initial run but subsequent runs of the code produce more consistent results. The read times are much faster and it uses more memory (101MB vs 77MB).

The file output size is another measure we can assess the two packages. The CSV.write() outputs a file the same size as my original output size using R’s readr::write_csv(), 8.4MB, whereas the CSVFiles.save() creates a file that is 9.4MB. That is a 12.2% increase. From a comparison standpoint with R’s functionality, I would prefer to see similar numbers when looking at comparable functions in Julia. It hints that the functions are operating somewhat similarly at a high-level, but would require digging in more to make a thorough and fair assessment.

Method 2: @timev

The @timev macro is available in the Julia’s Base. The macro @timev provides a more verbose response from Method 1 above. We don’t need to run these multiple times since the functions have already been compiled. The output provides more context.

In this code block I’m using the CSVFiles.save() function.

@timev CSVFiles.save(filename_out, df)
 """
 1.037027 seconds (750.29 k allocations: 77.738 MiB, 4.48% gc time)
 elapsed time (ns): 1037027394
 gc time (ns):      46481980
 bytes allocated:   81513984
 pool allocs:       750290
 GC pauses:         1
 """

Here we see the CSV.write() function. Again much faster than CSVFiles but required more memory and system resources.

@timev CSV.write(filename_out, df)
 """
 0.635651 seconds (3.43 M allocations: 101.568 MiB, 15.83% gc time)
 elapsed time (ns): 635650783
 gc time (ns):      100641690
 bytes allocated:   106501648
 pool allocs:       3425475
 malloc() calls:    1
 GC pauses:         2
 """

Between these two runs, we observed that CSV.write() took 38.7% less time to complete the task. From a resource perspective, CSVFiles.save() required less, using 23.5% less bytes and 78.1% less memory allocations.

Method 3: @timed

The @timed macro is available in the Julia’s Base package. The macro @timed provides a means to output the following results to an object:

value of the expression
elapsed time
total bytes allocated
garbage collection time
object with various memory allocation counters

Since the result values are the same as the above implementations of the @time and @timev, I will only display one implementation in this section.

The process is the same for so I’ll just provide one example.

time_stat = @timed CSV.write(filename_out, df)

time_stat.value # value of the expression
time_stat.time # elapsed time
time_stat.bytes # total bytes allocated
time_stat.gctime # garbage collection time
time_stat.gcstats # object with various memory allocation counters

Method 4: @elapsed

The @elapsed macro is available in the Julia’s Base package. The macro returns only the number of seconds it took to execute an expression.

@elapsed CSVFiles.save(filename_out, df)
# 0.850206273 seconds

@elapsed CSV.write(filename_out, df)
# 0.6364408 seconds

Method 5: @benchmark

The @benchmark macro is available in the BenchmarkTools package. The macro runs a series of samples and outputs some basic statistical results.

Here we see the CSVFiles.save() function. In this instance the macro ran 6 samples with a spread of 776.531 ms – 1.007 sec.

@benchmark CSVFiles.save(filename_out, df)
"""
BenchmarkTools.Trial: 
  memory estimate:  77.74 MiB
  allocs estimate:  750290
  --------------
  minimum time:     776.531 ms (1.39% GC)
  median time:      832.033 ms (2.11% GC)
  mean time:        868.346 ms (2.50% GC)
  maximum time:     1.007 s (4.85% GC)
  --------------
  samples:          6
  evals/sample:     1
"""

Here we see the CSV.write() function. In this instance the macro ran 10 samples with a spread of 479.760 – 718.754 ms.

@benchmark CSV.write(filename_out, df)
"""
BenchmarkTools.Trial: 
  memory estimate:  101.57 MiB
  allocs estimate:  3425476
  --------------
  minimum time:     479.760 ms (6.32% GC)
  median time:      504.789 ms (6.01% GC)
  mean time:        551.279 ms (5.28% GC)
  maximum time:     718.754 ms (2.57% GC)
  --------------
  samples:          10
  evals/sample:     1
"""

@btime

If you just want to get the minimum time, you can run the @btime macro from the BenchmarkTools package, which is an alternative to the @time macro from the Base package.

@btime CSVFiles.save(filename_out, df)
# 802.068 ms (750290 allocations: 77.74 MiB)

@btime CSV.write(filename_out, df)
# 388.340 ms (3425476 allocations: 101.57 MiB)

Conclusion

Based on the repeated results, I think CSV.write() is the better package and function of the two for the current operation. If you have a concern about memory allocation and system resources, CSVFiles.save() might be a better choice.

From comparison to the readr::write_csv() function in R, I would prefer the closeness in output results of CSV.write().

I will implement the CSV package going forward until I can see different results with CSVFiles for write in data.

Also going through this example should give you some tools to benchmark your code in Julia. There is a lot more you can do and I only scratched the surface.