This project has moved. For the latest updates, please go here.

Workaround on big excel files and memory issues?

Jul 21, 2015 at 12:15 PM
I've been working with a program in C#, that creates excel reports from several SQL tables, in separate sheets. But this particular report I'm on right now has a large ammount of data: about 5 sheets of 15k rows with somewhere around 80 columns each.

The thing is, it's so large that manipulating a single workbook with all the cells in it spins the process out of control, giving out of memory errors.

I was wondering if there is a way to workaround with this, peharps being able to work on a single sheet at a time and appending it to the file without having to load the whole workbook (as in: create a sheet, load it up, save it then the next one).

Any ideas?
Aug 11, 2015 at 10:39 AM
I'm having a similar issue (except one very large sheet rather than 5 small ones). I've seen comments elsewhere that suggest moving to x64, but I'm not convinced that'll solve the issue for me since users are likely to be on 4gb ram machines, and if I'm hitting 1.7gb when after around 1/4 of one sheet is loaded, I'd expect to get considerably larger before all the data was there (let alone for the actual save processess).

Looking at the object model, it looks like cells don't get written back immediately to the representation of the file (rather they kick around until save or saveas is called), which is obviously going to exacerbate the situation. That is to say, there are two points where OOM exceptions show up:
  1. When inserting data/manipulating the sheet (Too many XLCell objects floating around, I think)
  2. When saving the sheet to disk (Haven't dug into this; I'm just hoping that internally it's not using XmlSerializer, because this is OMGBAD for memory footprint)
I'm wondering whether it would be possible to optimize the ClosedXML cell model to explicitly flush cells on command to reduce its memory footprint, and potentially allow saving in blocks instead of all at once. After all, there are programs that do handle this much data even on x86 machines (e.g. Access from experience, and probably most respectable DBs in any event, especially if Access can do it). Obviously the age old adage of don't use Excel as a DB does apply etc. but in this case, since Excel itself can handle that much data on users' machines, it would be nice if the solution we're using to build the files could as well!
Editor
Aug 11, 2015 at 5:48 PM
I've been doing some profiling and digging through the code to see what's going on with memory; I've got a test program which generates about 60 columns of dummy data and it throws OOM Exceptions when I try to generate 100,000 rows. This is on Windows 10, 64-bit, with 8 GB of memory.

It's possible that the underlying OpenXML libraries from Microsoft are using XMLSerializer, but that's not something that ClosedXML can do much about. As far as I can tell, ClosedXML doesn't use XMLSerializer directly.

The bulk of the problem does seem to be the number of XLCell objects allocated. Reducing their size would help, but I don't see an easy way of doing that. Every single XLCell ultimately allocates its own XLStyle object (which in turn has an Alignment, Border, Fill, etc.); the library doesn't re-use XLStyle objects across multiple cells. It would require a pretty significant change to the style design to make that re-use happen.

A little bit of breathing room would be gained by removing the _trimmedAddress and _columnLetter members from XLAddress; these are being used to cache values which can be computed. Removing them will reduce the memory pressure - since every XLCell has and XLAddress, those two cached values are being stored for every single cell allocated, which adds up when we're talking about hundreds of thousands (or even millions) of cells.