Benchmarking CSV vs CSVFiles packages: Read

This post will cover benchmarking in Julia using a specific case to evaluate the functions in CSV and CSVFiles packages to read a CSV file.

Packages and Versioning

In this use case, I am using Julia v1.5.3 with the following packages:

using CSV # v0.8.2
using CSVFiles # v1.0.0
using DataFrames # v0.22.4
using BenchmarkTools # v0.5.0

Please reference each packages documentation for more details. CSV, CSVFiles, DataFrames, BenchmarkTools.

Setting up the benchmark

In each of the example benchmark methods or implementations below, I use the function respective to load/read a CSV file from each of the CSV and CSVFiles packages. For the CSV package there is an optional argument to specify whether to use threaded process or not. There is also an argument to convert the data into a DataFrame. I am looking at this to see how much it actually saves (resources and time)

Although being applied to specific use case, the process allows us to
measure the performance and memory allocation of expressions.

The first run will typically be higher than subsequent runs. This is likely an
overhead caused from the initial compiling of a function or expression.

Additional References for Performance Tips:
https://docs.julialang.org/en/v1/manual/performance-tips/

Example Data

The code below just shows the directory and file information. This is purely administrative preparation for the subsequent section utilize.

# Define the base directory to work out of.
# You can always make this the active directory
# by using :
# julia> Base.cd(base_dir)
base_dir = "/Some/Dir/Project"
base_dir_crime = "$base_dir/data/crime"
 
filename = "$base_dir_crime/crime_table_CY2009.csv" filesize(filename)

The selected file to use contains 31248 rows with 23 columns are various datatypes. You can get more information on the data in my post on the crime data. The file is 8.37 MB (8370840 bytes).

Method 1: @time

The @time macro is available in the Julia’s Base package. The output of @time will print the time it took to execute, the number of allocations, and the total number of bytes its execution caused to be allocated.

It is advised to run this at least twice. The first instance may have some compiling time that is part of the output results. If you run each instance multiple times you can see a distribution of results.

Here we see the CSVFiles.load() function. I ran it four times to demonstrate the first instance time in comparison to the 2nd-4th runs. We can certainly see the significance that can have. You should consider factoring this into your benchmarking process.

@time load(filename) |> DataFrame
# 42.636511 seconds (34.63 M allocations: 1.653 GiB, 3.68% gc time)
# 0.711944 seconds (1.40 M allocations: 51.374 MiB, 77.01% gc time)
# 0.292204 seconds (1.40 M allocations: 51.374 MiB, 34.44% gc time)
# 0.275875 seconds (1.40 M allocations: 51.374 MiB, 30.81% gc time)

Here we see the CSV.read() function. It has an argument to direct the output to a DataFrame. As we can see, we get similar results on the initial run but subsequent runs of the code produce more consistent results. The read times are much faster and it uses less memory (19.5MB vs 51MB)

@time CSV.read(filename, DataFrame)
# 37.971290 seconds (32.19 M allocations: 1.392 GiB, 3.53% gc time)
# 0.091355 seconds (345.09 k allocations: 19.530 MiB)
# 0.174240 seconds (345.13 k allocations: 19.531 MiB)
# 0.061774 seconds (345.09 k allocations: 19.530 MiB)

Here we see the CSV.read() function with the threaded option. The results look to be better here for elapsed time than without calling out the threaded option.

@time CSV.read(filename, DataFrame, threaded = true)
# 1.401817 seconds (1.15 M allocations: 63.545 MiB, 8.19% gc time)
# 0.075649 seconds (345.09 k allocations: 19.530 MiB)
# 0.065940 seconds (345.09 k allocations: 19.530 MiB)
# 0.089491 seconds (345.09 k allocations: 19.530 MiB)

Method 2: @timev

The @timev macro is available in the Julia’s Base. The macro @timev provides a more verbose response from Method 1 above. We don’t need to run these multiple times since the functions have already been compiled. The output provides more context.

In this code block I’m using the CSVFiles.load() function.

@timev load(filename) |> DataFrame
# 0.144837 seconds (1.40 M allocations: 51.374 MiB)
# elapsed time (ns): 144837257
# bytes allocated:   53869296
# pool allocs:       1399250
# malloc() calls:    28

Here we see the CSV.read() function. Again much faster than CSVFiles and less memory.

@timev CSV.read(filename, DataFrame)
# 0.085629 seconds (345.09 k allocations: 19.530 MiB)
# elapsed time (ns): 85629035
# bytes allocated:   20478568
# pool allocs:       345022
# non-pool GC allocs:22
# malloc() calls:    48

Here we see the CSV.read() function with the threaded option. The time here looks slower. Maybe just noise, which is why it is better to run multiple samples to get the distributions of metrics.

@timev  CSV.read(filename, DataFrame, threaded = true)
# 0.264510 seconds (345.13 k allocations: 19.531 MiB, 50.96% gc time)
# elapsed time (ns): 264510100
# gc time (ns):      134800864
# bytes allocated:   20480120
# pool allocs:       345062
# non-pool GC allocs:22
# malloc() calls:    48
# GC pauses:         1

Method 3: @timed

The @timed macro is available in the Julia’s Base package. The macro @timed provides a means to output the following results to an object:

value of the expression
elapsed time
total bytes allocated
garbage collection time
object with various memory allocation counters

Since the result values are the same as the above implementations of the @time and @timev, I will only display one implementation in this section.

The process is the same for so I’ll just provide one example.

time_stat = @timed load(filename) |> DataFrame

time_stat.value # value of the expression
time_stat.time # elapsed time
time_stat.bytes # total bytes allocated
time_stat.gctime # garbage collection time
time_stat.gcstats # object with various memory allocation counters

Method 4: @elapsed

The @elapsed macro is available in the Julia’s Base package. The macro returns only the number of seconds it took to execute an expression.

@elapsed load(filename) |> DataFrame
# 0.854006828 seconds

@elapsed CSV.read(filename, DataFrame)
# 0.081564938 seconds

@elapsed CSV.read(filename, DataFrame, threaded = true)
# 0.091888956 seconds

Method 5: @benchmark

The @benchmark macro is available in the BenchmarkTools package. The macro runs a series of samples and outputs some basic statistical results.

Here we see the CSVFiles.load() function. In this instance the macro ran 26 samples with a spread of 111.979 – 428.623 ms.

@benchmark load(filename) |> DataFrame
"""
BenchmarkTools.Trial:
 memory estimate:  51.37 MiB
 allocs estimate:  1399278
 --------------
 minimum time:     111.979 ms (0.00% GC)
 median time:      204.326 ms (39.21% GC)
 mean time:        196.098 ms (32.90% GC)
 maximum time:     428.623 ms (61.51% GC)
 --------------
 samples:          26
 evals/sample:     1
"""

Here we see the CSV.read() function. It has an argument to direct the output to a DataFrame. In this instance the macro ran 75 samples with a spread of 40.203 – 282.604 ms.

@benchmark CSV.read(filename, DataFrame)
"""
BenchmarkTools.Trial: 
  memory estimate:  19.53 MiB
  allocs estimate:  345091
  --------------
  minimum time:     40.203 ms (0.00% GC)
  median time:      50.150 ms (0.00% GC)
  mean time:        67.209 ms (19.00% GC)
  maximum time:     282.604 ms (68.51% GC)
  --------------
  samples:          75
  evals/sample:     1
"""

Here we see the CSV.read() function with the threaded option. In this instance the macro ran 79 samples with a spread of 42.552 – 181.889 ms.

@benchmark CSV.read(filename, DataFrame, threaded = true)
"""
BenchmarkTools.Trial: 
  memory estimate:  19.53 MiB
  allocs estimate:  345091
  --------------
  minimum time:     41.552 ms (0.00% GC)
  median time:      51.281 ms (0.00% GC)
  mean time:        63.418 ms (16.24% GC)
  maximum time:     181.889 ms (63.86% GC)
  --------------
  samples:          79
  evals/sample:     1
"""

@btime

If you just want to get the minimum time, you can run the @btime macro from the BenchmarkTools package, which is an alternative to the @time macro from the Base package.

@btime load(filename) |> DataFrame
# 138.617 ms (1399278 allocations: 51.37 MiB)

@btime CSV.read(filename, DataFrame)
# 42.112 ms (345091 allocations: 19.53 MiB)
 
@btime CSV.read(filename, DataFrame, threaded = true)
# 52.643 ms (345091 allocations: 19.53 MiB)

Conclusion

Based on the repeated results, I think CSV.read() is the better package and function of the two for the current operation. Based on the results its a toss up between using threaded or not due to comparable results with my example.

I will implement the CSV package going forward until I can see different results with CSVFiles for reading in data. I have not done a comparison with any other functionality between the two packages.

Also going through this example should give you some tools to benchmark your code in Julia. There is a lot more you can do and I only scratched the surface.