Share on facebook
Share on linkedin
Share on twitter
Share on reddit

BigQuery and Python: How to optimize the way you use the two

Analyzing BigQuery data with Python isn’t as straightforward as it could be. This can be boiled down to three factors:

  1. Data loading takes a lot of time 
  2. You have no visibility into your query
  3. You have to manage the interface yourself.

These three issues bottleneck you and your fellow data scientists from delivering key insights that’ll push your organization forward.

We’re solving for this with the superPy library, which complements the superQuery IDE for BigQuery and simplifies the work of analysts using Jupyter Notebook to access BigQuery data. 

More on that later, but first let’s take a quick look at the three biggest issues Python developers face with BigQuery.

Sluggish transfer speeds

With Google BigQuery, you can run SQL queries on millions of rows and get the results in a matter of seconds.

Yet, you can’t say the same when you’re getting data out of BigQuery for more complex processing. Data that loaded in seconds in BigQuery can take several minutes-to-hours in your Jupyter notebook.

There are shortcuts to speed up the process — like downloading the results to CSV and then importing this data into Jupyter. But after a while you end up juggling various CSV outputs in GCS buckets.

Ironically, shortcuts like this only add extra steps to a process that should instead be more seamless.

Navigating your data blind

With zero statistics on your queries, you’re navigating your data in complete darkness. There is no way for you to know the data scan beforehand, the query’s cost, or any other statistics.

Only after executing your query will you find out whether your work cost your company an arm and a leg.

You have to manage the interface yourself

When a data team writes a database connection to their notebook, there’s plenty of room for error. For example, each analyst downloads their own driver and subsequently end up with different experiences.

Then you have to create methods that integrate over rows, set up a connection, and authenticate the connection. A lot can go wrong along the way.

Only after that can you finally begin to manage results. 

As a result, the experience isn’t normalized.

Beat the BigQuery—Python bottleneck

With the superPy library, analysts don’t deal with these issues anymore thanks to three key capabilities.

First, analysts enjoy lazy loading of data. Now you can develop your logic over a subset of rows. Once that’s finished, then you can download the rest of your data.

Second, developers have the capability to view Google BigQuery’s dry run before executing a query. This means you will know the data scanned AND the query’s cost beforehand. No more pricey accidents.

You can also explore other rich statistics on your queries: Was cache used? What percentage of your results came from cache?

Better yet: Tell superQuery which statistics you are interested in and we’ll seriously consider adding it for you.

Check out the query statistics you can see with superPy:

Statistics of BigQuery data in Python
superPy result statistics

Finally, we manage the interface for you. You just need to import your data and start coding.

How to start using superPy

To get started, read the short instructions in the superPy GitHub README, and make sure you have a superQuery account (it’s free to start!).

We use 🍪 to ensure you get the best experience.