Srinimf
2 min readApr 10, 2024

--

How to Understand External Tables in Databricks Context

In Databricks, external tables refer to tables that are defined in Databricks but are physically stored outside of the Databricks environment. These tables are typically used to query data that resides in external data sources such as cloud storage (e.g., Amazon S3, Azure Blob Storage), databases (e.g., Amazon Redshift, Azure SQL Database), or data lakes.

Photo by Marvin Meyer on Unsplash

External tables in the context of Databricks: Key points

  1. Definition: External tables are defined using SQL CREATE TABLE statements or by registering existing data sources as tables in the Unified Catalog (Unity Catalog) within Databricks.
  2. Data Location: Unlike managed tables where data is stored within the Databricks environment (e.g., Delta tables), external tables reference data that is stored externally. The actual data resides in the external data source, and Databricks maintains metadata and schema information about the table.
  3. Data Formats: External tables can support various data formats such as Parquet, ORC, CSV, JSON, Avro, and Delta Lake. The data format is specified during table creation or registration, and Databricks provides built-in connectors to read and write data in these formats from external sources.
  4. Querying: Once defined, external tables can be queried using SQL or the DataFrame API in Databricks. Users can run SQL queries or Spark jobs against the external tables just like they would with managed tables, and Databricks takes care of accessing the data from the external source.
  5. Performance: Databricks optimizes query performance for external tables by pushing down predicates and other optimizations to the external data source whenever possible. This minimizes data movement and maximizes performance for queries executed in Databricks.
  6. Integration: External tables enable seamless integration with external data sources, allowing users to analyze and process data stored in different systems without needing to move or copy the data into Databricks. This simplifies data access and reduces data duplication and storage costs.

In short, external tables in Databricks provide a flexible and efficient way to access and analyze data stored in external data sources, enabling users to leverage the power of Databricks for their data processing and analytics workflows.

--

--