How to Use Ruby to Get CSV Headers

Daniel Pericich
4 min readMay 3, 2022

--

Photo by Firmbee.com on Unsplash

I’ve been working on a release for a while now that involves reading massive CSV files with unpredictable column ordering and then validating both the presence and correctness of their data. It’s led me to focus a lot more on efficiency when reading files.

One of the main keys to efficient file parsing is to not open the whole file in memory. Doing this requires a lot of memory to store all the data locally, and a lot of processing to be able to traverse this data. In place of opening the file, we can stream in the data row by row. This allows us to break up what we are working with for smaller memory allocations and faster traversal.

Taking this approach was great as it allowed me to move more quickly through the data, but did not solve my issue of inconsistent column orders. To solve this I had to write a method to read the headers and create a consistent header mapping, no matter their given order. In order to do this I needed to get the headers from my file, but ran into the issue of how to access a specific row from a file without opening the file.

The Ruby CSV module’s ‘shift’ method

This is when I turned to the docs for the CSV module. The CSV module is a ruby module that allows us to interface with CSV files and their data. Having an interface allows you to specify what and how you read from a CSV file and was important in my file data validations.

While the CSV module has options to read in or skip the headers when accessing a CSV file by adding the argument of “headers: true” to a read, open or new method call, this does not return the headers. Instead, it determines if the headers will be treated like other rows and be read.

What we want here is an array containing the header row values, without opening the whole file and reading it into memory. To do this we will use a method that is used to read our file in line by line. That method is shift.

If you are familiar with the shift method used by Arrays then you know that it removes and returns the first element of the array. This is great for us as this will be the headers row of our CSV file.

This is the finished code snippet for our approach:

Here we create a new instance of a CSV object using our csv file. Then we can call shift, which will pull a single row from our data source (csv file) and read it in as an array of strings. After this we continue to a while loop that will validate each row of data as long as there is a row of data to read from the data source.

This pattern works because we did not have to open the entire csv file in memory to access the first row of headers. Also, by using shift here we are manually removing the headers row from our data source so we do not need to account for that row when we move to validate the body data. It’s great!

A Prettier, but Slower, Approach

While I was researching this topic I came upon a Stack Overflow post that detailed what seemed like a shortcut. This is a recreated code snippet from the post:

This works great to get the headers, but completely misses one of the requirements of my desired solution. By calling the read method for the CSV instance, we are loading the entire file and all of its data into memory. This is slow and should be avoided, even if the syntax looks cleaner.

Final Thoughts

To be honest the file documentation for Ruby is not great, which is unfortunate given that the CSV and File modules are so useful. I hope that this helps you access the headers of whatever CSV files you are trying to efficiently read in. Let me know in the comments if there are other CSV of File module tricks you’ve found!

Notes

https://ruby-doc.org/stdlib-2.6.1/libdoc/csv/rdoc/CSV.html#method-i-shift

https://stackoverflow.com/questions/18115985/whats-the-easiest-way-to-get-the-headers-from-a-csv-file-in-ruby

--

--

Daniel Pericich
Daniel Pericich

Written by Daniel Pericich

Former Big Beer Engineer turned Full Stack Software Engineer

Responses (2)