Connecting a Jupyter Notebook to Snowflake through Python (Part 3)

April 25, 2018 Robert Fehrmann

In part two of this four-part series, we learned how to create a Sagemaker Notebook instance. In part three, we’ll learn how to connect that Sagemaker Notebook instance to Snowflake. If you’ve completed the steps outlined in part one and part two, the Jupyter Notebook instance is up and running and you have access to your Snowflake instance, including the demo dataset. Now, you’re ready to connect the two platforms.

You can review the entire blog series here: Part One > Part Two > Part Three > Part Four.  

The Sagemaker Console

The first step is to open the Jupyter service using the link on the Sagemaker console.

 

There are two options for creating a Jupyter Notebook. You can create the notebook from scratch by following the step-by-step instructions below, or you can download sample notebooks here. If you decide to build the notebook from scratch, select the conda_python3 kernel. Alternatively, if you decide to work with a pre-made sample, make sure to upload it to your Sagemaker notebook instance first.

There are several options for connecting Sagemaker to Snowflake. The simplest way to get connected is through the Snowflake Connector for Python. By the way, the connector doesn’t come pre-installed with Sagemaker, so you will need to install it through the Python Package manager. (Note: As of the writing of this post, the Snowflake Python connector has a dependency to C foreign function interface (CFFI), so it requires a minimum version of 1.11). Sagemaker, on the other hand, comes preinstalled with libcffi 1.10, which unfortunately causes the Snowflake Python connector to fail.

The step outlined below automatically detects the failure and, if necessary, triggers de-installation and re-installation of the most recent version of libcffi. Note: If re-installation of the CFFI package is necessary, you must also restart the Kernel so the new version is recognized.

%%bash
CFFI_VERSION=$(pip list 2>/dev/null | grep cffi )
echo $CFFI_VERSION
if [[ "$CFFI_VERSION" == "cffi (1.10.0)" ]]
then 
   pip uninstall --yes cffi
fi
yum_log=$(sudo yum install -y libffi-devel openssl-devel)
pip_log=$(pip install --upgrade snowflake-connector-python)  
if [[ "$CFFI_VERSION" == "cffi (1.10.0)" ]]
then 
   echo "configuration has changed; restart notebook"
fi

This  problem is persistent and needs to be resolved each time the Sagemaker server is shut down and restarted. The Jupyter Kernel also needs to be restarted after each shutdown. The good news is that the cffi package will eventually be updated on the Sagemaker AMI, and the step indicated above will begin to automatically recognize the update.

 

The next step is to connect to the Snowflake instance with your credentials.

import snowflake.connector
# Connecting to Snowflake using the default authenticator
ctx = snowflake.connector.connect(
  user=<user>,
  password=<password>,
  account=<account>
)

Here you have the option to hard code all credentials and other specific information, including the S3 bucket names. However, for security reasons, it’s advisable to not store credentials in the notebook. Another option is to enter your credentials every time you run the notebook.

Rather than storing credentials directly in the notebook, I opted to store a reference to the credentials. The actual credentials are automatically stored in a secure key/value management system called AWS Systems Manager Parameter Store (SSM).  

With most AWS systems, the first step requires setting up permissions for SSM through AWS IAM. Please ask your AWS security admin to create another policy with the following Actions on KMS and SSM.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",

            "Effect": "Allow",

            "Action": [

                "ssm:PutParameter",

                "ssm:DeleteParameter",

                "kms:Decrypt",

                "ssm:GetParameterHistory",

                "ssm:GetParametersByPath",

                "ssm:GetParameters",

                "ssm:GetParameter",

                "ssm:DeleteParameters"
            ],

            "Resource": [

                "arn:aws:kms:<region>:<accountid>:key/SNOWFLAKE/*",

                "arn:aws:ssm:<region>:<accountid>:parameter/SNOWFLAKE/*"
            ]
        },

        {
            "Sid": "VisualEditor1",

            "Effect": "Allow",

            "Action": [

                "ssm:DescribeParameters",

                "kms:ListAliases"
            ],

            "Resource": "*"
        }
    ]

}         

Adhering to the best-practice principle of least permissions, I recommend limiting usage of the “Actions by Resource”. Also, be sure to change the region and accountid in the code segment shown above or, alternatively, grant access to all resources (i.e. “*”).

In the code segment shown above, I created a root name of “SNOWFLAKE”. This is only an example. You’re free to create your own, unique naming convention.

Next, check permissions for your login. Assuming the new policy has been called SagemakerCredentialsPolicy, permissions for your login should look like the example shown below.

With the SagemakerCredentialsPolicy in place, you’re ready to begin configuring all your secrets (i.e. credentials) in SSM. Be sure to take the same namespace that you used to configure the credentials policy and apply them to the prefixes of your secrets.

After setting up your key/value pairs in SSM, use the following step to read the key/value pairs into your Jupyter Notebook.

import boto3

params=['/SNOWFLAKE/URL','/SNOWFLAKE/ACCOUNT_ID'
        ,'/SNOWFLAKE/USER_ID','/SNOWFLAKE/PASSWORD'
        ,'/SNOWFLAKE/DATABASE','/SNOWFLAKE/SCHEMA'
        ,'/SNOWFLAKE/WAREHOUSE','/SNOWFLAKE/BUCKET'
        ,'/SNOWFLAKE/PREFIX']
region='us-east-1'

def get_credentials(params):
   ssm = boto3.client('ssm',region)
   response = ssm.get_parameters(
      Names=params,
      WithDecryption=True
   )
   #Build dict of credentials
   param_values={k['Name']:k['Value'] for k in  response['Parameters']}
   return param_values

param_values=get_credentials(params)

Instead of hard coding the credentials, you can reference key/value pairs via the variable param_values. In addition to the credentials (account_id, user_id, password), also I stored the warehouse, database and schema.

import snowflake.connector
# Connecting to Snowflake using the default authenticator
ctx = snowflake.connector.connect(
  user=param_values['/SNOWFLAKE/USER_ID'],
  password=param_values['/SNOWFLAKE/PASSWORD'],
  account=param_values['/SNOWFLAKE/ACCOUNT_ID'],
  warehouse=param_values['/SNOWFLAKE/WAREHOUSE'],
  database=param_values['/SNOWFLAKE/DATABASE'],
  schema=param_values['/SNOWFLAKE/SCHEMA']

Now, you’re ready to read data from Snowflake. To illustrate the benefits of using data in Snowflake, we will read semi-structured data from the database I named “SNOWFLAKE_SAMPLE_DATABASE”.

When data is stored in Snowflake, you can use the Snowflake JSON parser and the SQL engine to easily query, transform, cast and filter JSON data data before it gets to the Jupyter Notebook.

From the JSON documents stored in WEATHER_14_TOTAL, the following step shows the minimum and maximum temperature values, a date and timestamp and the latitude/longitude coordinates for New York City.

cs=ctx.cursor()
allrows=cs.execute( \
"select (V:main.temp_max - 273.15) * 1.8000 + 32.00 as temp_max_far, " +\
"       (V:main.temp_min - 273.15) * 1.8000 + 32.00 as temp_min_far, " +\
"       cast(V:time as timestamp) time, " +\
"       V:city.coord.lat lat, " +\
"       V:city.coord.lon lon " +\
"from snowflake_sample_data.weather.weather_14_total " +\
"where v:city.name = 'New York' " +\
"and   v:city.country = 'US' ").fetchall()

The final step converts the result set into a Pandas DataFrame, which is suitable for machine learning algorithms.

import pandas as pd                               # For munging tabular data

data = pd.DataFrame(allrows)
data.columns=['temp_max_far','temp_min_far','time','lat','lon']
pd.set_option('display.max_columns', 500)     # Make sure we can see all of the columns
pd.set_option('display.max_rows', 10)         # Keep the output on one page
Data

 

Conclusion

Now that we’ve connected a Jupyter Notebook in Sagemaker to the data in Snowflake using the Snowflake Connector for Python, we’re ready for the final stage: Connecting Sagemaker and a Jupyter Notebook to both a local Spark instance and a multi-node EMR Spark cluster. I’ll cover how to accomplish this connection in the fourth and final installment of this series — Connecting a Jupyter Notebook to Snowflake via Spark.

You can review the entire blog series here: Part One > Part Two > Part Three > Part Four.  

 

subscribe to the snowflake blog

The post Connecting a Jupyter Notebook to Snowflake through Python (Part 3) appeared first on Snowflake.

Previous Article
How to Load terabytes into Snowflake – Speeds, Feeds and Techniques
How to Load terabytes into Snowflake – Speeds, Feeds and Techniques

We often get these questions from customers facing an initial data load into Snowflake or, a large-scale da...

Next Article
Why You Need a Cloud Data Warehouse
Why You Need a Cloud Data Warehouse

Are you new to the concepts of data warehousing? Do you want to know how your enterprise can benefit from u...

×

Subscribe to email updates from the Snowflake Blog

You're subscribed!
Error - something went wrong!