Pandas is a very useful data science tool in Python to manipulate tables and time series data using its data structures and tools. If you input data in a CSV or TSV file, or a SQL database it creates a Python object with rows and columns called a data frame, which processes data very fast.
Pandas can allow Python programs to read and modify Excel spreadsheets. This can be used to automate data extraction and processing (ETL) for data residing in Excel files in a very fast manner.
VBA vs Pandas for Excel
Excel supports several automation options using VBA like User Defined Functions (UDF) and macros. If you use other tools like macros, VBA, etc, time to learn becomes an important factor. If you need fast results, VBA is a good option. Pandas will take time to learn, but it can do a lot more than VBA.
Pandas also contains a smarter machine learning backbone. Due to that working with large Excel file is becomes very fast.
Pandas and Excel come together in two ways:
- Control Excel from Python
- Call custom Python code from within Excel
Some typical examples of Excel ETL are:
- Copying only certain data from one spreadsheet and pasting it into another.
- Going through thousands of rows and finding just a few that need edits based on some criteria.
- Searching through hundreds of a budgeting spreadsheets, searching for any cells that are in the red.
Usually, a programming language supports working with a simple flat files (like .txt, .csv), but pandas even helps in formatting Excel files.
We did a POC for large Excel files. Incoming data collected on the field from pharma and medical staff was not in standard format in Excel files. Manually copying, cleaning, and analyzing this data was leading to a lot of human error. Pandas was used to reduce this and do the processing much faster.
We focused on reading and analyzing the file and then working with a subset of the original data, all using pandas.
The basic operations of Excel were required in our project:
- Merging, concatenation, renaming, dropping, indexing, etc.
- Creating the final formatted Excel file.
This article will focus on building a a final Excel worksheet that is created by our custom python code.
In this POC, we developed a simple modelling application that:
- Takes input as an Excel which may be in non standard columns, headers, and non-formatted dates.
- Returns rest Excel with users choice (for example, if new columns are detected in Excel file, user has option to add while generating final output Excel) that has been transformed via pandas.
The solution is simple but shows the power of this combination and how easily you could perform more complex data analysis.
Tkinter– For creating GUI (below is the GUI for choosing Excel files in our POC).
pyExcelerate – to store dataframe to Excel
(We used pyexcelerate over pandas here because it has more speed to store dataframe to Excel . For example, pandas take 10 sec to store dataframe to Excel while pyExcelerate take 3-4 sec to store dataframe to Excel.)
Read Excel Files.
The UI has option ( choose a file ) to select or browse input file.
Run Python script.
In this step when user browses file, it goes to our Python script and reads input Excel file. After reading the file, it validates each column of Excel file.
Here, our Python script does the following:
- Takes input as Excel file. When user selects the Excel file from UI (Choose a file button) it validates each columns of Excel (if columns are missing values, it can not be proceed to final output ).
- Then it changes the date columns with standard date format (mm/dd/yyy).
- It changes columns values where columns values are in yes and no into Yes or No.
- For total amount columns it takes special symbols like $ and gives them right and left alignment.
- Finally it returns the resulting Excel file.
Sample ETL Using our Script
Below is a sample Excel file where header position is not fixed. The names of headers , dates, time, and values of columns are not in standard form.
We are going to process this with our Python script and see the output Excel file.
When we run the script, this is the result:
- While reading Excel file it shows if new headers are detected.
- Script gets headers of Excel files and replaces with its standard name.
- In this step it recognizes headers of Excel file, even if header is not in top position. If you see in above sample file header position is on 8th line.
- In above output, we got headers and replaced header name with its standard name (for eg ID with Product ID, name with Product Name, Total with Total Amount, type with Produbt Type, etc).
- After changing headers names, it asks user if s/he wants to add new columns to Excel file.
- It changes values of columns to standard and does right and left alignment.
In above screenshot, if you see Header name columns, the values are different. The values are in Yes or No, not in y or N.
After processing, the output Excel is as shown below:
Challenges and Solutions
Getting header names if header is not on top.
— Read Excel file
— Count length of each row
— Get max of count
— Max count considered as header
— Remove head part of Excel from max count of first row.
— Get headers in tuples
Date formatting for all date formats to convert into mm/dd/yyyy format.
— Read date column into data frame
— Create instantaneous function with lambda
pandas Features to Handle Excel Data used in POC
- Full formatting: Can read Excel files with full formatting.
- Renaming column names: Can rename column names.
- Splitting and Merging: Can split Excel file data and merging into new Excel files.
- Dropping: Can remove null and empty cells.
- Concatenation: Can concatenate contents from two Excel cells from different files into one.
- Reset indexing: Can reset index positions of columns.
- Right and Left Alignment: Can change data alignment of columns from right to left and vice-versa.
- Styles and colors: Can apply styles and color formatting on cells.
- Fastest way to store modified cells to Excel file.
Unique pandas Features used in POC
- Can read header names from different positions, that is, find out where Excel file header position is.
- Can convert different date formats into standard format. For example (20200318 to 2020/03/18, 18/03/2020 to 2020/03/18, 18-03-2020 to 2020/03/18, 18-Mar-2020 to 2020/03/18 etc).
- Suppose name is small case, make into Sentence case (example: convert Copay into Copayment)
- Mapping with new headers of Excel sheet to replace headers with its standard name
- Can perform validations on columns of Excel files.
- Can create output JSON file with fields that have same order as that of columns names. For web application it helpful to show output via front-end.
- Can read special symbols in cells for example ($ for total).
Advantages of using Python ETL
To summarize the advantages of doing ETL with Python:
- Data validations
- Merging spreadsheets
- Reduce errors related of spreadsheets
- Standardize processes
- Efficiency and Accuracy
- Faster execution