Greenplum: How to find Skewness of table (Skew of data)?

This article is half-done without your Comment! *** Please share your thoughts via Comment ***

The Greenplum is a based on MPP (Massive Parallel Processing) architecture.
There are multiple segments running in nothing shared mode that means your data should equally distribute across all segments.

If table data is not equally distributed, we cannot achieve the good performance of parallel processing system.

The Skewness of the table means that table data is not equally distributed across the segments and workload is not divided properly between the segments.

You can find skewness of data by checking gp_segment_id for each record.

The record count of segments should be very near to each other like 90% to 95%, and if you find a big difference in a count or 0 counts for few segments that mean your data is not properly distributed.

SELECT gp_segment_id, count(*)

FROM table_name

GROUP BY gp_segment_id;

Other two gp_toolkit views to get the information of Skewness of data:

gp_toolkit.gp_skew_coefficients: This view shows data distribution skew by calculating the coefficient of variation (CV) for the data stored on each segment.
gp_toolkit.gp_skew_idle_fractions: This view shows data distribution skew by calculating the percentage of the system that is idle during a table scan, which is an indicator of processing data skew.

Jun 30, 2017Anvesh Patel

Greenplum: How to find Skewness of table (Skew of data)?

Leave a Reply Cancel reply

Anvesh Patel

About Me!

About DBRND !

Recent Comments !

Follow Me !