Share on facebook
Share on linkedin
Share on twitter
Share on reddit

Version Control in BigQuery: How to sync your BigQuery work with GitHub

A best practice of software development, version control in BigQuery can be just as useful, allowing you to track changes made to queries and views across your team.

Have you ever written what you thought was a one-time query in BigQuery, only to have to re-create it from memory a few weeks later?

Or have you ever found yourself needing a previous version of a query you wrote?

In these situations and many others, version controlling your work in a query library is incredibly useful.

That’s why in this post we’re going to show you how to:

  1. Transform your Github repo into a query library, and
  2. Automatically back up & version control your BigQuery queries using superQuery

But first…

What is a query library?

A query library is essentially a repository for saving, updating, and tracking changes in your SQL queries.

There are many reasons to build a query library — if not for your team, at least for personal use. Caitlin Hudon summarizes the three main reasons in her great post on query libraries:

Three reasons to build a query library
Three main reasons to create a query library. (credit: Caitlin Hudon)

And if your organization’s data warehouse rests in BigQuery, the benefits to version controlling your work become more obvious. For instance, with multiple colleagues editing the same view, it becomes easier to identify a key change in the SQL (or who made it).

However, as your query (and view) count piles up, query libraries become a pain to maintain manually.

Let’s explore why.

Why query libraries are hard to maintain

There are a few ways BigQuery users build and maintain their query library today:

  1. Copy + paste queries into a text editor like Sublime or vim, and manually save with some naming convention,

  2. Save the query to Saved Queries in BigQuery, or

  3. Manually update queries in your GitHub repo.

However, these methods require you to manually update the queries and lack an automatic versioning capability. The first two options lack versioning entirely — you can’t view previous versions of queries or see what changes were made over time.

As mentioned earlier, once your query count gets into the double digits, this becomes a headache.

What happens if you forget to update a query?

Or, if you’re using option #1 or #2 from above, how do you retrieve a previous version of the query?

And when queries are used across your organization, versioning becomes even more important.


Version control BigQuery work with superQuery

By connecting your GitHub repository to superQuery, you’ll have effectively transformed your repo into a query library.

How does it work? 

Every time you execute a query in superQuery and there is a change in the SQL, your query will automatically get committed to your selected GitHub repository. 

This also applies to BigQuery views. Provided that your colleagues are connected to the same repository, all changes made to views will be committed to GitHub. From there, you’ll see the history of revisions made to that view and which colleague made the change.

How are queries organized?

If there are multiple users connected to the same repository, each user will have their own folder in the repo.

Your queries are catalogued under a hierarchy of Boards > Query Tabs, making your library easy to navigate (as long as you give your boards & tabs good names).

As a result, there’s no need to think up your own naming convention. Everything is done for you auto-magically so you can focus on data analysis.

Version History for a query with the superQuery integration for Github

Create your query library

To connect your GitHub repo to superQuery, follow these instructions. Of course, the two main prerequisites are that you:

  1. Have your data in BigQuery already.
  2. Create a superQuery account, or have one already.

We use 🍪 to ensure you get the best experience.