What is Apache Parquet ?
Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk. it is designed to be a common interchange format for both batch and interactive workloads. It is similar to other columnar-storage file formats available in Hadoop, namely RCFile and ORC.
In this tutorial, we will show you how to read and write parquet files
install parquet-tools
https://pypi.org/project/parquet-tools/
via pip
pip install parquet-tools
via brew on mac
brew install parquet-tools
CLI Command Examples
display usage
parquet-tools -h
display rowcount
parquet-tools rowcount [example_file].snappy.parquet
display row
parquet-tools head -n 1 [example_file].snappy.parquet
display with json output
parquet-tools cat --json hdfs://path/to/[example_file].parquet
display with csv output
parquet-tools csv input.gz.parquet | csvq -f json "select [column1], [column2]"
display file meta data
parquet-tools meta [example_file].snappy.parquet
Write Apache Parquet in python using spark
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("parquetFile").getOrCreate()
data = [
("Jacob ","","Smith","36636","M",3000),
("Alex ", " Mozi","","40288","M",4000),
("William ","","Brown","42114","M",4000),
("Linda ","Ann","Jones","39192","F",4000),
("Janet","Sarah","Brown","","F",-1)
]
columns=["firstname","middlename","lastname","dob","gender","salary"]
df=spark.createDataFrame(data,columns)
Write to Parquet file
df.write.mode("overwrite").parquet("/tmp/output/people.parquet")
Write mode:
append: This mode appends the data from the DataFrame to the existing file, if the Destination files already exist. In Case the Destination files do not exists, it will create a new parquet file in the specified location.
Example:
df.write.mode("append").parquet("path/to/parquet/file")
overwrite: This mode overwrites the destination Parquet file with the data from the DataFrame. If the file does not exist, it creates a new Parquet file.
Example:
df.write.mode("overwrite").parquet("path/to/parquet/file")
ignore: If the destination Parquet file already exists, this mode does nothing and does not write the DataFrame to the file. If the file does not exist, it creates a new Parquet file.
Example:
df.write.mode("ignore").parquet("path/to/parquet/file")
error or errorIfExists: This mode raises an error if the destination Parquet file already exists. It does not write the DataFrame to the file. If the file does not exist, it creates a new Parquet file.
Example:
df.write.mode("error").parquet("path/to/parquet/file")
Read from Parquet file
parDF1=spark.read.parquet("/tmp/output/people.parquet")
Create table from Parquet file to run queries
parDF1.createOrReplaceTempView("parquetTable")
parDF1.printSchema()
parDF1.show(truncate=False)
Sample query 1:
parkSQL = spark.sql("select * from ParquetTable where salary >= 4000 ")
parkSQL.show(truncate=False)
Sample query 2:
spark.sql("CREATE TEMPORARY VIEW PERSON USING parquet OPTIONS (path \"/tmp/output/people.parquet\")")
spark.sql("SELECT * FROM PERSON").show()
Sample query 3:
df.write.partitionBy("gender","salary").mode("overwrite").parquet("/tmp/output/people2.parquet")
parDF2=spark.read.parquet("/tmp/output/people2.parquet/gender=M")
parDF2.show(truncate=False)
Sample query 4:
spark.sql("CREATE TEMPORARY VIEW PERSON2 USING parquet OPTIONS (path \"/tmp/output/people2.parquet/gender=F\")")
spark.sql("SELECT * FROM PERSON2" ).show()
Other useful links and tools:
parquet-cli is an alternative of parquet-tools
install parquet-cli
via pip
pip install parquet-cli
via brew on mac
brew install parquet-cli
https://github.com/apache/parquet-mr/blob/master/parquet-cli/README.md
Do you like the tutorial “What is Apache Parquet” ? If you want latest update and find more tips and tricks to build your own business platform, please checkout more articles on https://www.productdeploy.com and https://blog.productdeploy.com and subscribe the newsletter