A best practice of software development, version control in BigQuery can be just as useful, allowing you to track changes made to queries and views across your team.
Have you ever written what you thought was a one-time query in BigQuery, only to have to re-create it from memory a few weeks later?
Or have you ever found yourself needing a previous version of a query you wrote?
In these situations and many others, version controlling your work in a query library is incredibly useful.
That’s why in this post we’re going to show you how to:
- Transform your Github repo into a query library, and
- Automatically back up & version control your BigQuery queries using superQuery
What is a query library?
And if your organization’s data warehouse rests in BigQuery, the benefits to version controlling your work become more obvious. For instance, with multiple colleagues editing the same view, it becomes easier to identify a key change in the SQL (or who made it).
However, as your query (and view) count piles up, query libraries become a pain to maintain manually.
Let’s explore why.
Why query libraries are hard to maintain
There are a few ways BigQuery users build and maintain their query library today:
- Copy + paste queries into a text editor like Sublime or vim, and manually save with some naming convention,
- Save the query to Saved Queries in BigQuery, or
- Manually update queries in your GitHub repo.
However, these methods require you to manually update the queries and lack an automatic versioning capability. The first two options lack versioning entirely — you can’t view previous versions of queries or see what changes were made over time.
As mentioned earlier, once your query count gets into the double digits, this becomes a headache.
What happens if you forget to update a query?
Or, if you’re using option #1 or #2 from above, how do you retrieve a previous version of the query?
And when queries are used across your organization, versioning becomes even more important.
Version control BigQuery work with superQuery
By connecting your GitHub repository to superQuery, you’ll have effectively transformed your repo into a query library.
How does it work?
Every time you execute a query in superQuery and there is a change in the SQL, your query will automatically get committed to your selected GitHub repository.
This also applies to BigQuery views. Provided that your colleagues are connected to the same repository, all changes made to views will be committed to GitHub. From there, you’ll see the history of revisions made to that view and which colleague made the change.
How are queries organized?
If there are multiple users connected to the same repository, each user will have their own folder in the repo.
As a result, there’s no need to think up your own naming convention. Everything is done for you auto-magically so you can focus on data analysis.
Create your query library
The Github integration actually saved my bacon last week when a client asked me to re-supply some data. I was able to dig out the original SQL in GitHub, modify the dates, and run it again.It meant I was confident that the data structure would be consistent with the previous extract I’d sent them, and didn’t have to spend time getting my head back in the nuances of how and why I had written the SQL in a certain way.The Data Tribe team is a huge fan of this integration!