BigQuery Schema Design

Irvi Aini
Google Cloud - Community
3 min readAug 20, 2022

--

Designing a schema will require us to think on how we will create an algorithm to access the data. There could be several different access methods, such as: access by rows, access by columns, etc. Different approaces, in general, will behave differently and we should notice that the approach may not be parallelizable.

Normalizing the data means turning it into a relational system. It will store the data efficiently and makes query processing a clear and direct task. It will also increase the orderliness of the data. It is useful for saving space. Normalizing data usually happens when a schema is designed for a database. Denormalizing is the strategy of allowing duplicate field values for a column in a table in the data to gain processing performance. Data is repeated rather than being relational. Flattened data takes more storage, however it makes queries more efficient because they can be processed in parallel using columnar processing.

Denormalizing data enables BigQuery to efficiently distribute processing resulting in better query performance. We would usually denormalize data before loading it into BigQuery. However, there are cases where denormalizing data is bad for performance, specifically if you have to group by a column with a one-to-many relationship. In any case we want to group the data, it must be shuffled. It often happens by transferring the data over a network between servers or systems. Shuffling is slow, fortunatey BigQuery supports columns with nested and repeated data. For this reason, nested and repeated fields are useful for working with data that originates in relational databases. Nested columns can be understood as a form of repeated field. It preserves the relational qualities of the original data and schema while enabling columnar and parallel processing of the repeated nested fields, and is the best alternative for data that already has a relational pattern. Turning the relation into a nested or repeated field improves BigQuery performance. It will help BigQuery work with data source in relational databases.

There are four different orderable data types: ARRAY, STRUCT, GEOGRAPHY, and JSON. In this session we sould like to focus on ARRAY and STRUCT. While ARRAY values give us the flexibility to go deep into the granularity of your fields, STRUCT allow us to go wide in your schema by grouping related fields together. Storing your large reporting tables as STRUCT and ARRAY give us the following benefits:

  • Gaining significant performance advantages by JOIN
  • Getting granular data from ARRAY
  • Having all the business context in one table as opposed to worrying about JOIN keys and which tables have the data

Now, another way to optimize the performance is by using partitioning and clustering. In a table partitions by a date or a timestamp column, each partition contains a single day of data. When the data is stored, BigQuery ensures that all the data in a block belongs to a single partition. A partition’s table maintains these properties across all operations that modify it, query jobs, data manipulation language, DML statements, Data Definition Language, DDL statements, load jobs and copy jobs. thus requiring BigQuery to maintain more metadata than a non-partitioned table. As the number of partitions increases, the amount of metadata overhead increases. One way to optimize the tables in your data warehouse is to reduce the cost and amount of data read by partitioning the tables. The good practice is to require that queries always include the partition filter, make sure that the partition field is isolated on the left side, because that’s the only way BigQuery can quickly discard unnecessary partitions. When data is written to a clustered table by a query or a load job, BigQuery sorts the data using the values in the clustering columns. These values are used to organize the data into multiple blocks in the storage. When we submit a query containing a clause that filters data based on the clustering columns, BigQuery uses the sorted blocks to eliminate scans of unnecessary data. Similarly, when submitting a query that aggregates data based on the values and the clustering columns, performance is improved, because the sorted blocks co-locate rows with similar values. Partitioning provides a way to obtain accurate cost estimates for queries and guarantees improved cost and performance. Clustering provides additional cost and performance benefits in addition to the partitioning benefits.

References

--

--

Irvi Aini
Google Cloud - Community

Machine Learning, Natural Language Processing, and Open Source.