Profiling

Profiling extract from source statistical information that help you understand the content of the tables, columns, etc in these sources.

This feature only applies to specific sources. The source page indicates whether it is supported.

Sample recipe

The following sample recipe configures profiling of a subset of the tables, including columns profiling.

Example 1. Profiling sample recipe.

source:
    type: postgres # A source that support Profiling.
    config:
        # Connection parameters for the source
        # ...
        # Profiling
        # Exclude tables starting with TEMP from profiling
        profile_pattern: "{'allow': ['.*'],
                           'deny': ['TEMP.*'],
                           'ignoreCase': True}"
        # Profiling Configuration
        profiling:
            # Enable profiling.
            enabled: true
            # Enable column profiling.
            profile_table_level_only: false
sink:
  # sink configuration

Selective profiling

The profile_pattern element defines using regular expression the tables and columns to include or exclude in the profiling process.

Parameter Description

Parameter	Description
`profile_pattern`	Lists of regular expressions patterns to define the tables and columns to include (`allow`) or exclude (`deny`) in the profiling process. Note that only tables included in the allowed by the `table_pattern` are profiled. Default value is `{'allow': ['.*'], 'deny': [], 'ignoreCase': True}`. The `ìgnoreCase` option ignores case sensitivity during pattern matching.

profile_pattern

Lists of regular expressions patterns to define the tables and columns to include (allow) or exclude (deny) in the profiling process. Note that only tables included in the allowed by the table_pattern are profiled.

Default value is {'allow': ['.*'], 'deny': [], 'ignoreCase': True}.

The ìgnoreCase option ignores case sensitivity during pattern matching.

Configure the profiling

In addition to the profile patterns, you can configure the profiling behavior for the source.

All these parameters must be defined under the profiling element.

Parameter

Description

enabled

Set to true to enable profiling. Default to False.

field_sample_values_limit

Maximum number of values to sample for all columns. Defaults to 20.

include_field_distinct_count

Set to true to profile the number of distinct values for each column. Defaults to True.

include_field_distinct_value_frequencies

Set to true to profile distinct value frequencies. Defaults to False.

include_field_histogram

Set to true to profile the histogram for numeric fields. Defaults to False.

include_field_max_value

Set to true to profile the max value of numeric columns. Defaults to True.

include_field_mean_value

Set to true to profile the mean value of numeric columns. Defaults to True.

include_field_median_value

Set to true to profile the median value of numeric columns. Defaults to True.

include_field_min_value

Set to true to profile the min value of numeric columns. Defaults to True.

include_field_null_count

Set to true to profile the number of nulls for each column. Defaults to True.

include_field_quantiles

Set to true to profile the quantiles of numeric columns. Defaults to False.

include_field_sample_values

Set to true to profile the sample values for all columns. Defaults to True.

include_field_stddev_value

Set to true to profile the standard deviation of numeric columns. Defaults to True.

limit

Maximum number of documents to profile. By default, profiles all documents.

max_number_of_fields_to_profile

Maximum number of columns to profile for any table. Set to None to profile all columns. Profiling cost grows with the number of columns to profile.

max_workers

Number of threads to use for profiling. Set to 1 to disable multi-threads. Defaults to 80.

offset

Offset in documents to profile. By default, uses no offset.

partition_datetime

If specified, profile only the partition matching this datetime. If not specified, profile the latest partition. Only Bigquery supports this.

partition_profiling_enabled

Set to true to profile partitioned tables. Only BigQuery supports this. If enabled, latest partition data is used for profiling. Defaults to True.

profile_if_updated_since_days

Profile only tables updated since this number of days. If set to null, profile table regardless of the last modified time. Supported only in Snowflake and BigQuery.

profile_table_level_only

Set to true to perform profiling at table-level only, or include column-level profiling as well. Defaults to False.

profile_table_row_count_estimate_only

Set to true to use an approximate (faster but less accurate) query for row count. Only supported for Postgres and MySQL. Defaults to False.

profile_table_row_limit

Profile tables only if their row count is less than this limit. If set to null, no limit on the row count of tables to profile. Supported only in Snowflake and BigQuery. Defaults to 5000000.

profile_table_size_limit

Profile tables only if their size is less than this limit in GBs. If set to null, no limit on the size of tables to profile. Supported only in Snowflake and BigQuery. Defaults to 5.

report_dropped_profiles

Set to true to report datasets or dataset columns that were not profiled. Set to True for debugging purposes. Defaults to False.

use_sampling

Set to true to profile column level stats on a sample of the tables. Supported by BigQuery and Snowflake. When enabled, profiling is done on rows sampled from table. Sampling is not done for smaller tables. Defaults to True.

sample_size

Number of rows to be sampled from table for column level profiling. Applicable only if use_sampling is set to True. Defaults to 10000.

turn_off_expensive_profiling_metrics

Set to true to turn off expensive profiling or not. This turns off profiling for quantiles, distinct_value_frequencies, histogram & sample_values. This also limits maximum number of fields being profiled to 10. Defaults to False.