Do-It-Now activities (gcp-demos)

The best way to learn something is try it yourself. So we've turned a bunch of the demos our instructors do into Do-It-Now activities, where students can try each concept out in quick, bite-size activities.

If you are using these as part of an Instructor Led Training (ILT), make sure to listen to your instructor and coordinate your efforts with their direction.

Overview

We have 18 Do-It-Now activities for BigQuery concepts.
These activities are all prefixed with BigQuery.
Many of these activities will cost money

Sample data

Many of the activities use data in the roi-bq-demos.bq_demo dataset
Other activities rely on additional sample tables that need to be derived from the tables in the bq_demo dataset. You can use the directions found here to generate the required tables.
There is a much smaller dataset which will work with the demos: roi-bq-demos.bq_demo_small. Queries will be much faster with this dataset, so aren't as effective at illustrating query speed benefits.

Setting up

Log into Qwiklabs and start the Data to Insights Lab.
Using the provided credentials, open the Google Cloud Console, and navigate to the BigQuery UI.
In a separate browser tab, open https://github.com/roitraining/gcp-demos/blob/master/bq-demos/wiki_queries.sql.

Running a query

Enter the query from the github file into the BigQuery code editor, and change the term you are searching for to whatever word you like.
Run the query. Check the Job Information pane and note the query execution time and the bytes processed.

Feel the power!

Modify your query, changing the table name from 1M to 10M. Rerun the query and note the job info.
Repeat the process with the 100M, 1B, 10B and 100B tables.

Setting up

Open the console
Go to BigQuery page
Expand your project. If working in Qwiklabs, note that there are no resources as yet

Starring projects

Star the bigquery-samples project by name
Star the bigquery-public-data project by name
Star the roi-bq-demos project by name
Star any additional projects suggested by your instructor

Exploring datasets

Expand bigquery-samples in the Explorer pane.
Note the datasets; expand one or two of the datasets. A dataset is a collection of assets (can be secured as a unit; lives in a specific location)
Click on wikipedia_benchmark in the Explorer pane; look at the metadata displayed in right pane.
Click on the three dot menu to the right of a dataset in Explorer and note the available actions

Creating a dataset

In your project, create a new dataset called class

Exploring tables

Review tables in the bigquery-samples.wikipedia_benchmark dataset
Click on 100B table
Review the table schema shown on the right
Look at table details
Look at preview data
Check out three dot menu next to the table name on the left

Setting up

Make sure you have the BigQuery console open.
In a separate browser tab, open https://github.com/roitraining/gcp-demos/blob/master/bq-do-it-nows/views.sql

Making the case for views

Enter the query with subquery query into the BigQuery editor
Review the query - what does it do?
Examine just the subquery
Run just the subquery and note results (CMD-e)
Next, run the whole query
a. It runs the subquery first, and generates an in-memory table
b. It then runs the outer query against the in-memory table
c. Run the query and check the results
Rather than using subqueries, you can use with clauses. This behaves effectively the same. Try running the with clause query in BigQuery

What if you like that subquery a lot, and would like to use it in many other queries? Rather than copy/paste over and over, you can create a view. A view is just a saved, shared subquery.

Run the create view query in BigQuery
Check out the view in the UI
Now try running the query view query, which queries the view

See the jobs you've run

Click on the history links at the bottom of the page and review the queries that you've run

Setting up

Make sure you have the BigQuery console open.
In a separate browser tab, open https://github.com/roitraining/gcp-demos/blob/master/bq-do-it-nows/udfs.sql

Messy numbers

Look at the messy_text table in the roi-bq-demos.bq_demo dataset; check out the data in the preview tab.
Run the trim strings query. What does it do? How does it do it?
Review and run the create udf query
Select the tidy_string function in the routines section of the Explorer pane; review the details.
Run the query with query with SQL UDF query

Getting numbers from text

Review the number_string table in the roi-bq-demos.bq_demo dataset; check out the data in the preview tab.
Review and run the create javascript udf query
Review the get_numbers UDF in the routines section
Run the query with javascript udf query
Look at documentation for stored procedures: https://cloud.google.com/bigquery/docs/procedures

Setting up

Make sure you have the BigQuery console open.

Taking the tour

Notice that you can have multiple tabs open
Write a query to find all CA customers from roi-bq-demos.bq_demo. Use tab to autocomplete things like SQL keywords and table references.
Add a comment at the top of the query using either # or – and the beginning of comment. Try toggling the comment using CMD-/
Copy/paste the contents of https://github.com/roitraining/gcp-demos/blob/master/bq-do-it-nows/udfs.sql into an editor tab
Select OR REPLACE
Press CTRL/CMD-D several times to multi select
Hit backspace or delete to change all the lines
Click the shortcut button to get a list of shortcuts
Press F1 (or right-click in editor tab and choose show command palette) to see more options (including making the font bigger!)

Setting up

Make sure you have the BigQuery console open.

Working with cache

Run this query:

select * from `roi-bq-demos.bq_demo.customer` where cust_state = "AK"

Look at JOB INFORMATION
Note the query duration and bytes processed
Click on link for Destination table. This temp table has the results of your query and lasts for 24 hours
Re-run the query
Note duration and bytes processed
Disable cache and re-run query

Setting up

Make sure you have the BigQuery console open.

Working with saved queries

Write a query to find orders on one day in 2018 (from roi-bq-demos.bq_demo.orders)
Save the query
Close all the editor tabs
Click on your saved query in the Explorer panel
Run the query again

Setting up

Make sure you have the BigQuery console open
Open the Cloud Shell

Creating the time travel table

In a new browser tab, open https://github.com/roitraining/gcp-demos/blob/master/bq-do-it-nows/time_travel.sh
Review the contents of the file

Copy the contents of the file and paste them into your Cloud Shell terminal window.

Running queries against the current, fully populated table

In a new browser tab, open https://github.com/roitraining/gcp-demos/blob/master/bq-do-it-nows/time_travel.sql.
Run the -- view up-to-date table query. This should show you the total number of orders per month, with values for Jan. and Feb.

Running a time travel query, finding results from specific time

Copy the -- view table with only initial load query from the SQL file in Github and paste it into your BigQuery editor.
Replace target with the time travel target value that was output in your Cloud Shell window. It should look like this: 1654541783
Run the query. This should show you the total number of orders per month, but from the table as it looked after it's initial load when only January data was stored.

Restore a previous version of the table to a new table

Copy the -- create restoration table query from the SQL file in Github and paste it into your BigQuery editor.
Replace target with the time travel target value that was output in your Cloud Shell window. It should look like this: 1654541783
Run the query. One the table has been created, check out the Details and Preview data. Verify that only January orders exist.

Setting up

Make sure you have the BigQuery console open.
Make sure you have the bigquery-public-data.noaa_gsod pinned.
In the Explorer pane, expand bigquery-public-data.noaa_gsod.

Working with sharded tables

Write a query that finds all the entries for stn 038110 in 1929
How would you write a query that finds all the 03110 entries for 1929, 1930, 1931?

Working with wildcard table queries

Review https://cloud.google.com/bigquery/docs/querying-wildcard-tables
Write a query that finds all the 038110 entries for 1929-1939 using a wildcard table
If you wanted to write a query to find 038110 entries in ALL the tables in the dataset?

Setting up

Make sure you have the BigQuery console open.
In a new browser tab, open https://github.com/roitraining/gcp-demos/blob/master/bq-do-it-nows/bq_hive.sql
In another new browser tab, open https://console.cloud.google.com/storage/browser/jwd-gcp-demos/orders_partitioned

Create an external table

In the BigQuery UI's Explorer pane, select your dataset
From three-dot menu, select create table
In Create table from, select Google Cloud Storage
In Select file... enter jwd-gcp-demos/orders_partitioned/*
In File format select parquet
Check Source data partitioning
In Select Source... enter gs://jwd-gcp-demos/orders_partitioned/
In Table, enter ext_part
In Table type select External table
In Schema, select Auto detect
Click CREATE TABLE
Check your table details

Query the external table

Run the query the external table query

Run the query external table with where clause query;

Run the query external table on partition query;

Setting up

Make sure you have the BigQuery console open.
In a new browser tab, open https://github.com/roitraining/gcp-demos/blob/master/bq-schema-demo/norm_query.sql
Copy the query into the BQ query editor
Replace all instances of <project-id> with roi-bq-demos (or whatever project your instructor directs you to use)

Try out query with joins

Run the query and review the Job Information and Execution Details

Setting up

Make sure you have the BigQuery console open.
View the denorm table in the project shared with you by your instructor

In a new browser tab, open https://github.com/roitraining/gcp-demos/blob/master/bq-schema-demo/denorm_query.sql
Copy the query into the BQ query editor
Replace <project-id> with the instructor's project name

Try out query with joins

Run the query and review the Job Information and Execution Details

Setting up

Make sure you have the BigQuery console open.
In a new browser window, open https://github.com/roitraining/gcp-demos/blob/master/bq-do-it-nows/arrays.sql

Populating Arrays

Read and run the populate arrays explicitly query, and review the results
Read and run the populate arrays using array_agg query, and review the results

Array lengths

In the Explorer pane, drill down to bigquery-public-data.github_repos.commits.

Read and run the report array length query, and review the results.
Write a query to find all the rows where the length of the difference array is equal to five (hint, you can use the find by array length query from the git file).

UNNEST

Review and run the select basic array from array query. Review the results.
Click on the JSON tab in the Query results area, and note the structure of the results. There's an array with a single row object. That object has a single column, which is a four-element array.
Review and run the select table from array query. Review the results.
Click on the JSON tab in the Query results area, and note the structure of the results. There are four rows, each with one column which is a scalar value. Unnest flattens the array into a table.
Review and run the calculate average of array query.

Review and run the basic correlated cross join query. The CTE at the top is just creating an initial arrays table with two rows and two columns. Review the standard results output and the JSON output.

Review and run the comma correlated cross join query.

It turns out, that if you're doing a correlated cross join, you don't even need to explicitly do the UNNEST, it's done for you implicitly. Review and run the comma implicit unnest query.

Querying on array contents

Review and run the -- find row where num_array contains 2 - take 1 query. Review the results.

Edit the query, modifying the CTE so that the first row in the arrays table is [2, 2, 3, 4]. Re-run the query and note the results.

Edit the query to search for rows that have an 8 in the array. Run the query and review the results? Are they correct?
Run the -- find row where num_array contains 2 - take 2 query.

Run the -- find row where num_array contains 2 - take 3 query.

All three of these queries return the same results, but there are differences in performance on large datasets. Let's explore that.

Run each of the find commits that... queries, noting the time taken for each to run.

Setting up

Make sure you have the BigQuery console open.
View the nested_once table in your instructor's project

In a new browser window, open https://github.com/roitraining/gcp-demos/blob/master/bq-schema-demo/nested_queries.sql

Querying the nested table

Read and run the find sales/zip for march from nested_once table query (replace with the instructor's project name)

Setting up

Make sure you have the BigQuery console open.
Review the table_nested_partitioned table in your instructor's project

In a new browser window, open https://github.com/roitraining/gcp-demos/blob/master/bq-schema-demo/nested_queries.sql

Querying the partitioned table

Read and run the find sales/zip for march from nested/partitioned table query (replace with the instructor's project name)

Setting up

Make sure you have the BigQuery console open.
Review the table_nested_partitioned_clustered table in your instructor's project

In a new browser window, open https://github.com/roitraining/gcp-demos/blob/master/bq-schema-demo/nested_queries.sql

Querying the partitioned table

Find the total sales for the 8754 zip code by running the find sales for 6 months in 8754 from nested query (replace with the instructor's project name). Note the duration and data processed.
Run the find for 6 months in 8754 from nested/partitioned query (replace with the instructor's project name) and note the duration and data processed.
Run the find for 6 months in 8754 from nested/partitioned/clustered query (replace with the instructor's project name) and note the duration and data processed.

Setting up

Make sure you have the BigQuery console open.

Querying the partitioned table

Write and run a query to find all the March 2018 orders from the instructor's nested/repeated table, projecting all the columns except the customer email and phone number, and storing the results in a derived table in your class dataset.
Write and run a query to sum sales for AK customers orders in March 2018, using the full nested/repeated table. Note the query duration and bytes processed.
Write and run a query to sum sales for AK customers orders in March 2018, using the derived table. Note the query duration and bytes processed.

Schedule the query to run daily at 12.01am, overwriting the previous contents of the derived table.

Setting up

Make sure you have the BigQuery console open.
Review the roi-bq-demos.bq_demo.order_mv materialized view

In a new browser window, open https://github.com/roitraining/gcp-demos/blob/master/bq-do-it-nows/mv.sql

Querying the materialized view

Run the -- query against original tables... query.

Querying the original tables

Run the -- query against original tables... query.

Setting up

Make sure you have the BigQuery console open.
In a new browser window, open https://github.com/roitraining/gcp-demos/blob/master/bq-do-it-nows/approx.sql

Run exact and approx queries side-by-side

Run the -- First query - exact query in one editor tab. This query will find the number of unique article titles in the 106B row wikipedia table
Open a 2nd editor tab, and run the -- Second query - approx query. This will approximate the number of unique article titles in the 106B row wikipedia table.

Compare results

Calculate the % difference in query execution time between the two queries
Calculate the % difference in reported number of unique titles

The following activities cover a range of relevant Data Engineering topics. Enjoy!

1. Overview

This activity is intended to illustrate a variety of techniques, including:

Managing and using PubSub with Python

Creating topics and subscriptions
Creating bytestring message bodies
Sending messages

Various BEAM tricks when processing streams of data

Decoding bytestring message bodies
Using one branch for writing each event into a BigQuery table
Using a second branch for aggregating rows into windows and writing windows to BigQuery
Writing nested/repeated data from BEAM to BigQuery
Reading the end of window time

2. Setting up

Open the Google Cloud console
Activate Cloud Shell
Click the button to open Cloud Shell in a new browser window
Clone the gcp-demos repository into Cloud Shell with the following command:

git clone https://github.com/roitraining/gcp-demos.git
cd gcp-demos/dflow-bq-stream-python

Run the setup.sh script, providing the name of the service account you want the demo code to use. For example (with demo-sa as the service account name):

. ./setup.sh demo-sa

Take a few minutes to review and understand the setup script. The diagram below indicates what's happening in the script.

3. Starting the pipeline

Make sure that you are in the dflow-bq-stream-python directory in your Cloud Shell window.
Deploy the Dataflow job with the follow command (you'll review the code in a minute):

python3 process_events.py \
--runner DataflowRunner \
--region us-central1 \
--project $PROJECT_ID \
--staging_location gs://$PROJECT_ID-dflow-demo/ \
--temp_location gs://$PROJECT_ID-dflow-demo/

4. Starting the event stream

Run the following command to start sending messages:

python3 send_events.py \
    --project_id=$PROJECT_ID

5. Understanding the code

In a new browser tab, open https://github.com/roitraining/gcp-demos/blob/master/bq-do-it-nows/dflow-bq-stream-python/send_events.py
Take a few minutes to review and understand the code. The diagram below highlights some key features:

Open https://github.com/roitraining/gcp-demos/blob/master/bq-do-it-nows/dflow-bq-stream-python/process_events.py
Take a few minutes to review and understand the code. The diagrams below highlight some key features

7. Check out the results

In the console, go to Dataflow and click on the running job.
Click on the top node of the pipeline diagram. Wait until you see metrics showing the number of received messages on the right-hand side (it takes a while for the cluster to spin up and start processing messages).
Explore the pipeline and the execution metrics in Dataflow.
In the console, go to BigQuery and explore your new dflow_demo dataset. Check out the DETAILS and PREVIEW section for both of the tables in that dataset. Note that the nested table has one row per window, with an array of structs, each struct representing one message for that window. Note also that the DETAILS section shows the rows in in the streaming buffer and not the BigQuery storage service.

8. Clean up

Stop the Dataflow job via the console
Stop the sending of messages by closing the Cloud Shell window
Delete the PubSub topic and subscription
Delete the BigQuery dataset