The best way to learn something is try it yourself. So we've turned a bunch of the demos our instructors do into Do-It-Now activities, where students can try each concept out in quick, bite-size activities.

If you are using these as part of an Instructor Led Training (ILT), make sure to listen to your instructor and coordinate your efforts with their direction.

Overview

  1. We have 18 Do-It-Now activities for BigQuery concepts.
  2. These activities are all prefixed with BigQuery.
  3. Many of these activities will cost money

Sample data

  1. Many of the activities use data in the roi-bq-demos.bq_demo dataset
  2. Other activities rely on additional sample tables that need to be derived from the tables in the bq_demo dataset. You can use the directions found here to generate the required tables.
  3. There is a much smaller dataset which will work with the demos: roi-bq-demos.bq_demo_small. Queries will be much faster with this dataset, so aren't as effective at illustrating query speed benefits.

Setting up

  1. Log into Qwiklabs and start the Data to Insights Lab.
  2. Using the provided credentials, open the Google Cloud Console, and navigate to the BigQuery UI.
  3. In a separate browser tab, open https://github.com/roitraining/gcp-demos/blob/master/bq-demos/wiki_queries.sql.

Running a query

  1. Enter the query from the github file into the BigQuery code editor, and change the term you are searching for to whatever word you like.
  2. Run the query. Check the Job Information pane and note the query execution time and the bytes processed.

Feel the power!

  1. Modify your query, changing the table name from 1M to 10M. Rerun the query and note the job info.
  2. Repeat the process with the 100M, 1B, 10B and 100B tables.

Setting up

  1. Open the console
  2. Go to BigQuery page
  3. Expand your project. If working in Qwiklabs, note that there are no resources as yet

Starring projects

  1. Star the bigquery-samples project by name
  2. Star the bigquery-public-data project by name
  3. Star the roi-bq-demos project by name
  4. Star any additional projects suggested by your instructor

Exploring datasets

  1. Expand bigquery-samples in the Explorer pane.
  2. Note the datasets; expand one or two of the datasets. A dataset is a collection of assets (can be secured as a unit; lives in a specific location)
  3. Click on wikipedia_benchmark in the Explorer pane; look at the metadata displayed in right pane.
  4. Click on the three dot menu to the right of a dataset in Explorer and note the available actions

Creating a dataset

  1. In your project, create a new dataset called class

Exploring tables

  1. Review tables in the bigquery-samples.wikipedia_benchmark dataset
  2. Click on 100B table
  3. Review the table schema shown on the right
  4. Look at table details
  5. Look at preview data
  6. Check out three dot menu next to the table name on the left

Setting up

  1. Make sure you have the BigQuery console open.
  2. In a separate browser tab, open https://github.com/roitraining/gcp-demos/blob/master/bq-do-it-nows/views.sql

Making the case for views

  1. Enter the query with subquery query into the BigQuery editor
  2. Review the query - what does it do?
  3. Examine just the subquery
  4. Run just the subquery and note results (CMD-e)
  5. Next, run the whole query
    a. It runs the subquery first, and generates an in-memory table
    b. It then runs the outer query against the in-memory table
    c. Run the query and check the results
  6. Rather than using subqueries, you can use with clauses. This behaves effectively the same. Try running the with clause query in BigQuery

What if you like that subquery a lot, and would like to use it in many other queries? Rather than copy/paste over and over, you can create a view. A view is just a saved, shared subquery.

  1. Run the create view query in BigQuery
  2. Check out the view in the UI
  3. Now try running the query view query, which queries the view

See the jobs you've run

  1. Click on the history links at the bottom of the page and review the queries that you've run

Setting up

  1. Make sure you have the BigQuery console open.
  2. In a separate browser tab, open https://github.com/roitraining/gcp-demos/blob/master/bq-do-it-nows/udfs.sql

Messy numbers

  1. Look at the messy_text table in the roi-bq-demos.bq_demo dataset; check out the data in the preview tab.
  2. Run the trim strings query. What does it do? How does it do it?
  3. Review and run the create udf query
  4. Select the tidy_string function in the routines section of the Explorer pane; review the details.
  5. Run the query with query with SQL UDF query

Getting numbers from text

  1. Review the number_string table in the roi-bq-demos.bq_demo dataset; check out the data in the preview tab.
  2. Review and run the create javascript udf query
  3. Review the get_numbers UDF in the routines section
  4. Run the query with javascript udf query
  5. Look at documentation for stored procedures: https://cloud.google.com/bigquery/docs/procedures

Setting up

  1. Make sure you have the BigQuery console open.

Taking the tour

  1. Notice that you can have multiple tabs open
  2. Write a query to find all CA customers from roi-bq-demos.bq_demo. Use tab to autocomplete things like SQL keywords and table references.
  3. Add a comment at the top of the query using either # or – and the beginning of comment. Try toggling the comment using CMD-/
  4. Copy/paste the contents of https://github.com/roitraining/gcp-demos/blob/master/bq-do-it-nows/udfs.sql into an editor tab
  5. Select OR REPLACE
  6. Press CTRL/CMD-D several times to multi select
  7. Hit backspace or delete to change all the lines
  8. Click the shortcut button to get a list of shortcuts
  9. Press F1 (or right-click in editor tab and choose show command palette) to see more options (including making the font bigger!)

Setting up

  1. Make sure you have the BigQuery console open.

Working with cache

  1. Run this query:
select * from `roi-bq-demos.bq_demo.customer` where cust_state = "AK"
  1. Look at JOB INFORMATION
  2. Note the query duration and bytes processed
  3. Click on link for Destination table. This temp table has the results of your query and lasts for 24 hours
  4. Re-run the query
  5. Note duration and bytes processed
  6. Disable cache and re-run query

Setting up

  1. Make sure you have the BigQuery console open.

Working with saved queries

  1. Write a query to find orders on one day in 2018 (from roi-bq-demos.bq_demo.orders)
  2. Save the query
  3. Close all the editor tabs
  4. Click on your saved query in the Explorer panel
  5. Run the query again

Setting up

  1. Make sure you have the BigQuery console open
  2. Open the Cloud Shell

Creating the time travel table

  1. In a new browser tab, open https://github.com/roitraining/gcp-demos/blob/master/bq-do-it-nows/time_travel.sh
  2. Review the contents of the file
  1. Copy the contents of the file and paste them into your Cloud Shell terminal window.

Running queries against the current, fully populated table

  1. In a new browser tab, open https://github.com/roitraining/gcp-demos/blob/master/bq-do-it-nows/time_travel.sql.
  2. Run the -- view up-to-date table query. This should show you the total number of orders per month, with values for Jan. and Feb.

Running a time travel query, finding results from specific time

  1. Copy the -- view table with only initial load query from the SQL file in Github and paste it into your BigQuery editor.
  2. Replace target with the time travel target value that was output in your Cloud Shell window. It should look like this: 1654541783
  3. Run the query. This should show you the total number of orders per month, but from the table as it looked after it's initial load when only January data was stored.

Restore a previous version of the table to a new table

  1. Copy the -- create restoration table query from the SQL file in Github and paste it into your BigQuery editor.
  2. Replace target with the time travel target value that was output in your Cloud Shell window. It should look like this: 1654541783
  3. Run the query. One the table has been created, check out the Details and Preview data. Verify that only January orders exist.

Setting up

  1. Make sure you have the BigQuery console open.
  2. Make sure you have the bigquery-public-data.noaa_gsod pinned.
  3. In the Explorer pane, expand bigquery-public-data.noaa_gsod.

Working with sharded tables

  1. Write a query that finds all the entries for stn 038110 in 1929
  2. How would you write a query that finds all the 03110 entries for 1929, 1930, 1931?

Working with wildcard table queries

  1. Review https://cloud.google.com/bigquery/docs/querying-wildcard-tables
  2. Write a query that finds all the 038110 entries for 1929-1939 using a wildcard table
  3. If you wanted to write a query to find 038110 entries in ALL the tables in the dataset?

Setting up

  1. Make sure you have the BigQuery console open.
  2. In a new browser tab, open https://github.com/roitraining/gcp-demos/blob/master/bq-do-it-nows/bq_hive.sql
  3. In another new browser tab, open https://console.cloud.google.com/storage/browser/jwd-gcp-demos/orders_partitioned

Create an external table

  1. In the BigQuery UI's Explorer pane, select your dataset
  2. From three-dot menu, select create table
  3. In Create table from, select Google Cloud Storage
  4. In Select file... enter jwd-gcp-demos/orders_partitioned/*
  5. In File format select parquet
  6. Check Source data partitioning
  7. In Select Source... enter gs://jwd-gcp-demos/orders_partitioned/
  8. In Table, enter ext_part
  9. In Table type select External table
  10. In Schema, select Auto detect
  11. Click CREATE TABLE
  12. Check your table details

Query the external table

  1. Run the query the external table query
  1. Run the query external table with where clause query;
  1. Run the query external table on partition query;

Setting up

  1. Make sure you have the BigQuery console open.
  2. In a new browser tab, open https://github.com/roitraining/gcp-demos/blob/master/bq-schema-demo/norm_query.sql
  3. Copy the query into the BQ query editor
  4. Replace all instances of <project-id> with roi-bq-demos (or whatever project your instructor directs you to use)

Try out query with joins

  1. Run the query and review the Job Information and Execution Details

Setting up

  1. Make sure you have the BigQuery console open.
  2. View the denorm table in the project shared with you by your instructor
  1. In a new browser tab, open https://github.com/roitraining/gcp-demos/blob/master/bq-schema-demo/denorm_query.sql
  2. Copy the query into the BQ query editor
  3. Replace <project-id> with the instructor's project name

Try out query with joins

  1. Run the query and review the Job Information and Execution Details

Setting up

  1. Make sure you have the BigQuery console open.
  2. In a new browser window, open https://github.com/roitraining/gcp-demos/blob/master/bq-do-it-nows/arrays.sql

Populating Arrays

  1. Read and run the populate arrays explicitly query, and review the results
  2. Read and run the populate arrays using array_agg query, and review the results

Array lengths

  1. In the Explorer pane, drill down to bigquery-public-data.github_repos.commits.
  1. Read and run the report array length query, and review the results.
  2. Write a query to find all the rows where the length of the difference array is equal to five (hint, you can use the find by array length query from the git file).

UNNEST

  1. Review and run the select basic array from array query. Review the results.
  2. Click on the JSON tab in the Query results area, and note the structure of the results. There's an array with a single row object. That object has a single column, which is a four-element array.
  3. Review and run the select table from array query. Review the results.
  4. Click on the JSON tab in the Query results area, and note the structure of the results. There are four rows, each with one column which is a scalar value. Unnest flattens the array into a table.
  5. Review and run the calculate average of array query.
  1. Review and run the basic correlated cross join query. The CTE at the top is just creating an initial arrays table with two rows and two columns. Review the standard results output and the JSON output.
  1. Review and run the comma correlated cross join query.
  1. It turns out, that if you're doing a correlated cross join, you don't even need to explicitly do the UNNEST, it's done for you implicitly. Review and run the comma implicit unnest query.

Querying on array contents

  1. Review and run the -- find row where num_array contains 2 - take 1 query. Review the results.
  1. Edit the query, modifying the CTE so that the first row in the arrays table is [2, 2, 3, 4]. Re-run the query and note the results.
  1. Edit the query to search for rows that have an 8 in the array. Run the query and review the results? Are they correct?
  2. Run the -- find row where num_array contains 2 - take 2 query.
  1. Run the -- find row where num_array contains 2 - take 3 query.

All three of these queries return the same results, but there are differences in performance on large datasets. Let's explore that.

  1. Run each of the find commits that... queries, noting the time taken for each to run.

Setting up

  1. Make sure you have the BigQuery console open.
  2. View the nested_once table in your instructor's project
  1. In a new browser window, open https://github.com/roitraining/gcp-demos/blob/master/bq-schema-demo/nested_queries.sql

Querying the nested table

  1. Read and run the find sales/zip for march from nested_once table query (replace with the instructor's project name)

Setting up

  1. Make sure you have the BigQuery console open.
  2. Review the table_nested_partitioned table in your instructor's project
  1. In a new browser window, open https://github.com/roitraining/gcp-demos/blob/master/bq-schema-demo/nested_queries.sql

Querying the partitioned table

  1. Read and run the find sales/zip for march from nested/partitioned table query (replace with the instructor's project name)

Setting up

  1. Make sure you have the BigQuery console open.
  2. Review the table_nested_partitioned_clustered table in your instructor's project
  1. In a new browser window, open https://github.com/roitraining/gcp-demos/blob/master/bq-schema-demo/nested_queries.sql

Querying the partitioned table

  1. Find the total sales for the 8754 zip code by running the find sales for 6 months in 8754 from nested query (replace with the instructor's project name). Note the duration and data processed.
  2. Run the find for 6 months in 8754 from nested/partitioned query (replace with the instructor's project name) and note the duration and data processed.
  3. Run the find for 6 months in 8754 from nested/partitioned/clustered query (replace with the instructor's project name) and note the duration and data processed.

Setting up

  1. Make sure you have the BigQuery console open.

Querying the partitioned table

  1. Write and run a query to find all the March 2018 orders from the instructor's nested/repeated table, projecting all the columns except the customer email and phone number, and storing the results in a derived table in your class dataset.
  2. Write and run a query to sum sales for AK customers orders in March 2018, using the full nested/repeated table. Note the query duration and bytes processed.
  3. Write and run a query to sum sales for AK customers orders in March 2018, using the derived table. Note the query duration and bytes processed.
  1. Schedule the query to run daily at 12.01am, overwriting the previous contents of the derived table.

Setting up

  1. Make sure you have the BigQuery console open.
  2. Review the roi-bq-demos.bq_demo.order_mv materialized view
  1. In a new browser window, open https://github.com/roitraining/gcp-demos/blob/master/bq-do-it-nows/mv.sql

Querying the materialized view

  1. Run the -- query against original tables... query.

Querying the original tables

  1. Run the -- query against original tables... query.

Setting up

  1. Make sure you have the BigQuery console open.
  2. In a new browser window, open https://github.com/roitraining/gcp-demos/blob/master/bq-do-it-nows/approx.sql

Run exact and approx queries side-by-side

  1. Run the -- First query - exact query in one editor tab. This query will find the number of unique article titles in the 106B row wikipedia table
  2. Open a 2nd editor tab, and run the -- Second query - approx query. This will approximate the number of unique article titles in the 106B row wikipedia table.

Compare results

  1. Calculate the % difference in query execution time between the two queries
  2. Calculate the % difference in reported number of unique titles


The following activities cover a range of relevant Data Engineering topics. Enjoy!

1. Overview

This activity is intended to illustrate a variety of techniques, including:

Managing and using PubSub with Python

Various BEAM tricks when processing streams of data

2. Setting up

  1. Open the Google Cloud console
  2. Activate Cloud Shell
  3. Click the button to open Cloud Shell in a new browser window
  4. Clone the gcp-demos repository into Cloud Shell with the following command:
git clone https://github.com/roitraining/gcp-demos.git
cd gcp-demos/dflow-bq-stream-python
  1. Run the setup.sh script, providing the name of the service account you want the demo code to use. For example (with demo-sa as the service account name):
. ./setup.sh demo-sa
  1. Take a few minutes to review and understand the setup script. The diagram below indicates what's happening in the script.

3. Starting the pipeline

  1. Make sure that you are in the dflow-bq-stream-python directory in your Cloud Shell window.
  2. Deploy the Dataflow job with the follow command (you'll review the code in a minute):
python3 process_events.py \
--runner DataflowRunner \
--region us-central1 \
--project $PROJECT_ID \
--staging_location gs://$PROJECT_ID-dflow-demo/ \
--temp_location gs://$PROJECT_ID-dflow-demo/

4. Starting the event stream

  1. Run the following command to start sending messages:
python3 send_events.py \
    --project_id=$PROJECT_ID

5. Understanding the code

  1. In a new browser tab, open https://github.com/roitraining/gcp-demos/blob/master/bq-do-it-nows/dflow-bq-stream-python/send_events.py
  2. Take a few minutes to review and understand the code. The diagram below highlights some key features:

  1. Open https://github.com/roitraining/gcp-demos/blob/master/bq-do-it-nows/dflow-bq-stream-python/process_events.py
  2. Take a few minutes to review and understand the code. The diagrams below highlight some key features

7. Check out the results

  1. In the console, go to Dataflow and click on the running job.
  2. Click on the top node of the pipeline diagram. Wait until you see metrics showing the number of received messages on the right-hand side (it takes a while for the cluster to spin up and start processing messages).
  3. Explore the pipeline and the execution metrics in Dataflow.
  4. In the console, go to BigQuery and explore your new dflow_demo dataset. Check out the DETAILS and PREVIEW section for both of the tables in that dataset. Note that the nested table has one row per window, with an array of structs, each struct representing one message for that window. Note also that the DETAILS section shows the rows in in the streaming buffer and not the BigQuery storage service.

8. Clean up

  1. Stop the Dataflow job via the console
  2. Stop the sending of messages by closing the Cloud Shell window
  3. Delete the PubSub topic and subscription
  4. Delete the BigQuery dataset