SparkDFExecutionEngine
- class great_expectations.execution_engine.SparkDFExecutionEngine(*args, persist: bool = True, spark_config: Optional[dict] = None, spark: Optional[pyspark.sql.session.SparkSession] = None, force_reuse_spark_context: Optional[bool] = None, **kwargs)# 
- SparkDFExecutionEngine instantiates the ExecutionEngine API to support computations using Spark platform. - This class holds an attribute spark_df which is a spark.sql.DataFrame. - Constructor builds a SparkDFExecutionEngine, using provided configuration parameters. - Parameters
- *args – Positional arguments for configuring SparkDFExecutionEngine 
- persist – If True (default), then creation of the Spark DataFrame is done outside this class 
- spark_config – Dictionary of Spark configuration options. If there is an existing Spark context, the spark_config will be used to update that context in environments that allow it. In local environments the Spark context will be stopped and restarted with the new spark_config. 
- spark – A PySpark Session used to set the SparkDFExecutionEngine being configured. Will override spark_config if provided. 
- force_reuse_spark_context – If True then utilize existing SparkSession if it exists and is active 
- **kwargs – Keyword arguments for configuring SparkDFExecutionEngine 
 - Deprecated since version 1.0: The force_reuse_spark_context attribute is no longer part of any Spark Datasource classes. The existing Spark context will be reused if possible. If a spark_config is passed that doesn’t match the existing config, the context will be stopped and restarted in local environments only. 
 - get_compute_domain(domain_kwargs: dict, domain_type: Union[str, great_expectations.core.metric_domain_types.MetricDomainTypes], accessor_keys: Optional[Iterable[str]] = None) Tuple[pyspark.sql.dataframe.DataFrame, dict, dict]# 
- Uses a DataFrame and Domain kwargs (which include a row condition and a condition parser) to obtain and/or query a Batch of data. - Returns in the format of a Spark DataFrame along with Domain arguments required for computing. If the Domain is a single column, this is added to ‘accessor Domain kwargs’ and used for later access. - Parameters
- domain_kwargs (dict) – a dictionary consisting of the Domain kwargs specifying which data to obtain 
- domain_type (str or MetricDomainTypes) – an Enum value indicating which metric Domain the user would like to be using, or a corresponding string value representing it. String types include “identity”, “column”, “column_pair”, “table” and “other”. Enum types include capitalized versions of these from the class MetricDomainTypes. 
- accessor_keys (str iterable) – keys that are part of the compute Domain but should be ignored when describing the Domain and simply transferred with their associated values into accessor_domain_kwargs. 
 
- Returns
- a DataFrame (the data on which to compute) 
- a dictionary of compute_domain_kwargs, describing the DataFrame 
- a dictionary of accessor_domain_kwargs, describing any accessors needed to identify the Domain within the compute domain 
 
- Return type
- A tuple including 
 
 - get_domain_records(domain_kwargs: dict) pyspark.sql.dataframe.DataFrame#
- Uses the given Domain kwargs (which include row_condition, condition_parser, and ignore_row_if directives) to obtain and/or query a batch. - Parameters
- domain_kwargs (dict) – 
- Returns
- A DataFrame (the data on which to compute returned in the format of a Spark DataFrame)