Photo by Cara Fuller on Unsplash (Fastest Mammal)
Even when we have 1TB of Disk Storage, 8GB/16GB of RAM still pandas and much other data loading API struggles to load a 2GB file.
This is because when a process requests for memory, memory is allocated in two ways:
-
Contiguous Memory Allocation (consecutive blocks are assigned)
-
NonContiguous Memory Allocation(separate blocks at different locations)
Pandas use Contiguous Memory to load data into RAM because read and write operations are must faster on RAM than Disk(or SSDs).
-
Reading from SSDs: ~16,000 nanoseconds
-
Reading from RAM: ~100 nanoseconds
Before going into multiprocessing & GPU’s, etc… let us see how to use pd.read_csv() effectively.
Pandas is fine for loading data and preprocessing but to train your models start using DataLoader from TensorFlow or PyTorch or where ever you run your model.
Photo by Sebastian Yepes on Unsplash (fastest dog breed)
Note: If you are reading in your mobile you may not be able to scroll through the code. (Open the Gist for better readability.)
1. use cols:
Rather than loading data and removing unnecessary columns that aren’t useful when processing your data. load only the useful columns.
2. Using correct dtypes for numerical data:
Every column has it’s own dtype in a pandas DataFrame, for example, integers have int64, int32, int16 etc…
-
int8 can store integers from -128 to 127.
-
int16 can store integers from -32768 to 32767.
-
int64 can store integers from -9223372036854775808 to 9223372036854775807.
Pandas assign int64 to integer datatype by default, therefore by defining correct dtypes we can reduce memory usage significantly.
🔥 Pro Tip: Use converters to replace missing values or NANs while loading data, especially for the columns that have predefined datatypes using dtype.
3. Using correct dtypes for categorical columns:
In my Dataset, I have a column Thumb which is by default parsed as a string, but it contains only a fixed number of values that remain unchanged for any dataset.
And also columns such as Gender, etc.. can be stored as categorical values which reduces the memory from ~1000 KB to ~100 KB. (check the sats)
🔥 Pro Tip: If your DataFrame contains lots of empty values or missing values or NANs you can reduce their memory footprint by converting them to Sparse Series.
4. nrows, skip rows
Even before loading all the data into your RAM, it is always a good practice to test your functions and workflows using a small dataset and pandas have made it easier to choose precisely the number of rows (you can even skip the rows that you do not need.)
In most of the cases for testing purpose, you don’t need to load all the data when a sample can do just fine.
nrows The number of rows to read from the file.
1
2
3
4
>>> Import pandas as pd
>>> df = pd.read_csv("train.csv", nrows=1000)
>>>len(df)
1000
skiprows Line numbers to skip (0-indexed) or the number of lines to skip (int) at the start of the file.
1
2
3
# Can be either list or first N rows.
df = pd.read_csv('train.csv', skiprows=[0,2,5])
# It might remove headings
🔥 Pro-Tip: An Effective use of nrows is when you have more than 100’s of columns to check and define proper dtypes for each and every column. All of this overhead can be reduced using nrows as shown below.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
sample = pd.read_csv("train.csv", nrows=100) # Load Sample data
dtypes = sample.dtypes # Get the dtypes
cols = sample.columns # Get the columns
dtype_dictionary = {}
for c in cols:
"""
Write your own dtypes using
# rule 2
# rule 3
"""
if str(dtypes[c]) == 'int64':
dtype_dictionary[c] = 'float32' # Handle NANs in int columns
else:
dtype_dictionary[c] = str(dtypes[c])
# Load Data with increased speed and reduced memory.
df = pd.read_csv("train.csv", dtype=dtype_dictionary,
keep_default_na=False,
error_bad_lines=False,
na_values=['na',''])
**NOTE: **As NANs are considered to be float in pandas don’t forget to convert integer data_types to float if your columns contain NANs.
5. Loading Data in Chunks:
Memory Issues in pandas read_csv() are there for a long time. So one of the best workarounds to load large datasets is in chunks.
Note: loading data in chunks is actually slower than reading whole data directly as you need to concat the chunks again but you can load files with more than 10’s of GB’s easily.
6. Multiprocessing using pandas:
As pandas don’t have njobs variable to make use of multiprocessing power. we can utilize multiprocessinglibrary to handle chunk size operations asynchronously on multi-threads which can reduce the run time by half.
**Note: **you need to define pool in main only because only main can distribute pool asynchronously among multiple processers.
7. Dask Instead of Pandas:
Although Dask doesn’t provide a wide range of data preprocessing functions such as pandas it supports parallel computing and loads data faster than pandas
1
2
3
4
import dask.dataframe as dd
data = dd.read_csv("train.csv",dtype={'MachineHoursCurrentMeter': 'float64'},assume_missing=True)
data.compute()
🔥Pro Tip: If you want to find the time taken by a jupyter cell to run just add %%time magic function at the start of the cell
Libraries to try out: Paratext, Datatable.
Their’s is an another way, You can rent a VM in the cloud, with 64 cores and 432GB RAM, for ~$3/hour or even a better price with some googling. caveat: you need to spend the next week configuring it.
Thanks for reaching until the end, I hope you learned something new. Happy Loading….✌️. (🌩 If you like it.)
Comment below the tricks that you used to load your data faster I will add them to the list.
References(Add them to your blog list):
🔥 Itamar Turner-Trauring — Speed Python Master (Must ✅).
🔥 Gouthaman Balaraman — quantitative finance with python (Must ✅).
Connect with me on Linkedin.
-
Previous
Process of Creating an Effective Machine Learning Pipeline -
Next
Virtualization and how it all started...