avro files

diary of a codelovingyogi
1 min readApr 2, 2020

Avro is a data serialization system, row based storage format for Hadoop. It uses JSON to store the data definition schema, with data stored in binary format making it compact and fast.

If you are working with avro file formats in Python, you can read the file by using the library:

pip install fastavro

You can read in each row as a record of data:

final_file = []for line in fastavro.reader(avro_file):
final_file.append(line)

You can further analyze and manipulate data as you like using Pandas:

import pandas as pddf = (pd.DataFrame(final_file, dtype=str)
.drop_duplicates()
.dropna('columns', how='all')
)

For example, if you wanted to then save the data as a parquet file, you can simply convert from df to a parquet file format:

temp_file = 'temp.snappy.parquet'
df.to_parquet(temp_file, compression='snappy')

--

--