Profiling
Profiling extract from source statistical information that help you understand the content of the tables, columns, etc in these sources.
This feature only applies to specific sources. The source page indicates whether it is supported. |
Sample recipe
The following sample recipe configures profiling of a subset of the tables, including columns profiling.
source:
type: postgres # A source that support Profiling.
config:
# Connection parameters for the source
# ...
# Profiling
# Exclude tables starting with TEMP from profiling
profile_pattern: "{'allow': ['.*'],
'deny': ['TEMP.*'],
'ignoreCase': True}"
# Profiling Configuration
profiling:
# Enable profiling.
enabled: true
# Enable column profiling.
profile_table_level_only: false
sink:
# sink configuration
Selective profiling
The profile_pattern
element defines using regular expression the tables and columns to include or exclude in the profiling process.
Parameter | Description |
---|---|
|
Lists of regular expressions patterns to define the tables and columns to include ( Default value is The |
Configure the profiling
In addition to the profile patterns, you can configure the profiling behavior for the source.
All these parameters must be defined under the profiling element.
|
Parameter |
Description |
|
Set to true to enable profiling. Default to |
|
Maximum number of values to sample for all columns. Defaults to 20. |
|
Set to true to profile the number of distinct values for each column. Defaults to |
|
Set to true to profile distinct value frequencies. Defaults to |
|
Set to true to profile the histogram for numeric fields. Defaults to |
|
Set to true to profile the max value of numeric columns. Defaults to |
|
Set to true to profile the mean value of numeric columns. Defaults to |
|
Set to true to profile the median value of numeric columns. Defaults to |
|
Set to true to profile the min value of numeric columns. Defaults to |
|
Set to true to profile the number of nulls for each column. Defaults to |
|
Set to true to profile the quantiles of numeric columns. Defaults to |
|
Set to true to profile the sample values for all columns. Defaults to |
|
Set to true to profile the standard deviation of numeric columns. Defaults to |
|
Maximum number of documents to profile. By default, profiles all documents. |
|
Maximum number of columns to profile for any table. Set to None to profile all columns. Profiling cost grows with the number of columns to profile. |
|
Number of threads to use for profiling. Set to 1 to disable multi-threads. Defaults to 80. |
|
Offset in documents to profile. By default, uses no offset. |
|
If specified, profile only the partition matching this datetime. If not specified, profile the latest partition. Only Bigquery supports this. |
|
Set to true to profile partitioned tables. Only BigQuery supports this. If enabled, latest partition data is used for profiling. Defaults to |
|
Profile only tables updated since this number of days. If set to null, profile table regardless of the last modified time. Supported only in Snowflake and BigQuery. |
|
Set to true to perform profiling at table-level only, or include column-level profiling as well. Defaults to |
|
Set to true to use an approximate (faster but less accurate) query for row count. Only supported for Postgres and MySQL. Defaults to |
|
Profile tables only if their row count is less than this limit. If set to null, no limit on the row count of tables to profile. Supported only in Snowflake and BigQuery. Defaults to 5000000. |
|
Profile tables only if their size is less than this limit in GBs. If set to null, no limit on the size of tables to profile. Supported only in Snowflake and BigQuery. Defaults to 5. |
|
Set to true to report datasets or dataset columns that were not profiled. Set to True for debugging purposes. Defaults to |
|
Set to true to profile column level stats on a sample of the tables. Supported by BigQuery and Snowflake. When enabled, profiling is done on rows sampled from table. Sampling is not done for smaller tables. Defaults to |
|
Number of rows to be sampled from table for column level profiling. Applicable only if |
|
Set to true to turn off expensive profiling or not. This turns off profiling for quantiles, distinct_value_frequencies, histogram & sample_values. This also limits maximum number of fields being profiled to 10. Defaults to |