What is Apache Parquet

What is Apache Parquet ?

Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk. it is designed to be a common interchange format for both batch and interactive workloads. It is similar to other columnar-storage file formats available in Hadoop, namely RCFile and ORC.

In this tutorial, we will show you how to read and write parquet files

install parquet-tools

https://pypi.org/project/parquet-tools/

via pip

pip install parquet-tools

via brew on mac

brew install parquet-tools

CLI Command Examples

display usage

parquet-tools -h

display rowcount

parquet-tools rowcount [example_file].snappy.parquet

display row

parquet-tools head -n 1 [example_file].snappy.parquet

display with json output

parquet-tools cat --json hdfs://path/to/[example_file].parquet

display with csv output

parquet-tools csv input.gz.parquet | csvq -f json "select [column1], [column2]"

display file meta data

parquet-tools meta [example_file].snappy.parquet 

Write Apache Parquet in python using spark

import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("parquetFile").getOrCreate()
data = [
   ("Jacob ","","Smith","36636","M",3000),
   ("Alex ", " Mozi","","40288","M",4000),
   ("William ","","Brown","42114","M",4000),
   ("Linda ","Ann","Jones","39192","F",4000),
   ("Janet","Sarah","Brown","","F",-1)
]
columns=["firstname","middlename","lastname","dob","gender","salary"]
df=spark.createDataFrame(data,columns)

Write to Parquet file

df.write.mode("overwrite").parquet("/tmp/output/people.parquet")

Write mode:

append: This mode appends the data from the DataFrame to the existing file, if the Destination files already exist. In Case the Destination files do not exists, it will create a new parquet file in the specified location.

Example:

df.write.mode("append").parquet("path/to/parquet/file")
overwrite: This mode overwrites the destination Parquet file with the data from the DataFrame. If the file does not exist, it creates a new Parquet file.

Example:

df.write.mode("overwrite").parquet("path/to/parquet/file")
ignore: If the destination Parquet file already exists, this mode does nothing and does not write the DataFrame to the file. If the file does not exist, it creates a new Parquet file.

Example:

df.write.mode("ignore").parquet("path/to/parquet/file")
error or errorIfExists: This mode raises an error if the destination Parquet file already exists. It does not write the DataFrame to the file. If the file does not exist, it creates a new Parquet file.

Example:

df.write.mode("error").parquet("path/to/parquet/file")

Read from Parquet file

parDF1=spark.read.parquet("/tmp/output/people.parquet")

Create table from Parquet file to run queries

parDF1.createOrReplaceTempView("parquetTable")
parDF1.printSchema()
parDF1.show(truncate=False)

Sample query 1:

parkSQL = spark.sql("select * from ParquetTable where salary >= 4000 ")
parkSQL.show(truncate=False)

Sample query 2:

spark.sql("CREATE TEMPORARY VIEW PERSON USING parquet OPTIONS (path \"/tmp/output/people.parquet\")")
spark.sql("SELECT * FROM PERSON").show()

Sample query 3:

df.write.partitionBy("gender","salary").mode("overwrite").parquet("/tmp/output/people2.parquet")

parDF2=spark.read.parquet("/tmp/output/people2.parquet/gender=M")
parDF2.show(truncate=False)

Sample query 4:


spark.sql("CREATE TEMPORARY VIEW PERSON2 USING parquet OPTIONS (path \"/tmp/output/people2.parquet/gender=F\")")
spark.sql("SELECT * FROM PERSON2" ).show()

Other useful links and tools:

parquet-cli is an alternative of parquet-tools

install parquet-cli

via pip

pip install parquet-cli

via brew on mac

brew install parquet-cli


https://github.com/apache/parquet-mr/blob/master/parquet-cli/README.md

Do you like the tutorial “What is Apache Parquet” ? If you want latest update and find more tips and tricks to build your own business platform, please checkout more articles on https://www.productdeploy.com and https://blog.productdeploy.com and subscribe the newsletter

Share This

Leave a Reply

Your email address will not be published. Required fields are marked *

*
*
*