Python – read csv, tsv, excel files with pandas
I will focus on data input and output using pandas here, but there are many other libraries that can help you read and write data in various file formats.
Inputs and outputs generally fall into several broad categories.
Patterns for reading text files or more efficiently formatted files from disk, patterns for reading data from databases, and patterns for reading data from network sources such as Web APIs.
Reading and writing data in text format
A feature of pandas is that it has many functions that read tabular data as DataFrame objects. Probably the most commonly used ones are read_csv and read_table.
|read_csv||Reads delimited data from files, URLs, and other file-based objects. The default delimiter is a comma.|
|read_table||Reads delimited data from files, URLs, and other file-based objects. The default delimiter is tab (‘t’).|
|read_excel||Import tabular data from Excel XLS and XLSX files.|
0. Get Iris Dataframe from sklearn
First, get the iris data from sklearn.datasets as a preparation. Create a CSV file.
from sklearn.datasets import load_iris df_iris = pd.DataFrame(iris.data, columns = iris.feature_names) df_iris.head()
1. Output Dataframe as CSV file
Now, output the acquired iris Dataframe as a CSV file.
# Output CSV file from df df_iris.to_csv('iris_out.csv', encoding='utf8', index=False)
2. Output Dataframe to TSV file
Since it will be used in subsequent work, it will also be output as a TSV file this time.
# Output TSV file from df df_iris.to_csv('iris_out.tsv', encoding='utf8', index=False, sep='t')
3. Read CSV file with pandas
Now, read the CSV file (comma delimited). You can read Data frame (data frame) as a return value with read_csv of pandas. Also, check the top 5 results of Dataframe after loading.
# Read CSV file df = pd.read_csv('iris_out.csv') df.head()
4. Read TSV file with pandas
Try reading a TSV file (tab delimited). Read_csv of pandas can be used to read TSV files as well as CSV files. When read with read_csv, you can get a Data frame as a return value.
# Read TSV file df_tsv = pd.read_csv('iris_out.tsv', delimiter='t')
It is also possible to read using read_table. In that case, a delimiter must be specified.
df_1 = pd.read_table('iris_out.csv', sep=',') df_1.head()
Below are some of the options commonly given when using pandas.read_csv and pandas.read_table.
|path||A string representing a filesystem location, URL, or other file-based object.|
|sep or delimiter||A string or regular expression used to split each line into fields.|
|header||Row number to use as column names. 0 (first row) by default. Specify None if there is no header row.|
|index_col||The number or name of the column used as the row index in the returned object. A single name/number, or a list of names/numbers for hierarchical indexes.|
|names||A list of column names for the returned object. use with header=None|
|skiprows||Number of lines to ignore at the beginning of the file, or line numbers to skip (first line is 0).|
|na_values||A set of values to replace with missing values.|
|comment||Separate the character or character string specified in this argument from each line as a comment.|
|parse_dates||Attempts to read the data as a datetime. False by default, if True try to read on all columns. Specify the number or name of the column to read if you don’t want it to apply to all columns. If the elements of the list are tuples or lists, combine their multiple columns and read them as dates (e.g. used when date and time are separated into two columns)|
|date_parser||The function used to read the date.|
|nrows||The number of lines to read at the beginning of the file.|
|iterator||Returns a TextFileReader object for partially reading the file.|
|encoding||Character code to use as Unicode (for example, specify ‘utf-8’ for UTF-8 encoded text).|
|squeeze||Returns a Series if the loaded data contains only one column.|
|thousands||Thousands separator (eg ‘,’ or ‘.’).|
Export data in text format
Data can also be exported in delimited format. Consider, for example, one of the CSV files we just read.
df = pd.read_csv('iris_out.csv') df.head()
You can use the dataframe’s to_csv method to write the data to a comma-separated file.
Importing Microsoft Excel files
Pandas also supports reading table data saved in Excel 2003 (and newer) files.
Reading is done using either the ExcelFile class or the pandas.read_excel function. These tools internally use add-on packages called xlrd for reading Excel files, and xlwt (for XLS files) and openpyxl (for XLSX files) for writing, so use pip or conda when using These add-on packages must be installed separately.
To use ExcelFile, create an instance by passing the path of the xls file or xlsx file as an argument as follows.
xlsx = pd.ExcelFile('ex1.xlsx')
This time, if the sheet name is specified as “Sheet1” as the second argument, the data in the sheet can be read as a data frame by the read_excel function.
# Case of the first row is index df_xlsx = pd.read_excel(xlsx, 'Sheet1', index_col=0) # Case of using the index df_xlsx = pd.read_excel(xlsx, 'Sheet1') # You can also read by directly specifying the file path df_xlsx = pd.read_excel('ex1.xlsx', 'Sheet1', index_col=0) df_xlsx.head()
To write out pandas data in Excel format, first create an ExcelWriter and then write the data using the to_excel method of the pandas object.
writer = pd.ExcelWriter('ex2.xlsx') df_xlsx.to_excel(writer, 'Sheet1') writer.save()
You can also write the file path to to_excel without using ExcelWriter.
This time, it was about reading CSV, TSV, Excel files with Python – pandas. Loading data is often the first step in the process of data analysis. In addition to the contents described this time, I think that there may be cases where detailed parameter settings are necessary, but I think that the basics are like this.
Please also refer to this article for steps such as filtering and aggregation after loading csv, tsv, and xlsx into dataframe with pandas.
Please also refer to here for joins between dataframes.