While trying to load a large dataset using pandas library pd.read_csv() method, my computer crashed and that started my journey in to exploring big data tools without upgrading my machine.
Pyspark, which uses distributed computing came to my rescue. Using combination of basic SQL commands and spark data frames, selecting files and establishing relationships with other csv files was a breeze. It was easy to aggregate millions of rows into few thousands of rows and convert it back to pandas for data visualization. The cool thing was, we did not need to create a database and still be able to do the analysis, but the downside was what if this data continues to grow bigger in size (big data problem). That will require much more infrastructure like setting up AWS RDS with Postgres and running spark on EC2 instance etc.
But for now, to keep things moving let’s run this analysis and set up spark on our local machine.
After much research, I decided to use open source software ‘home brew’ for installing spark and its dependencies.
Example: I used it to install java, hadoop, scala and Apache-Spark which simplified installation on my OS machine. Then, set up bash_profile to ensure all of the above is working together and when we run spark on our local machine, it’s not throwing ‘unsupported file error 58’ etc...
You can type open -a TextEdit .bash_profile in your terminal to access and modify your bash profile. Here you will need to include export path for your spark, hadoop and java to make it all work.
Finally, once above set up is done. We will go to our Jupyter notebook and initiate our spark session to read the csv file and start exploring the large data file.