1. Project set up

Follow the AutoML Tables documentation to

Create a Google Cloud Platform (GCP) project.
Enable billing.
Apply to whitelist your project.
Enable AutoML API.
Enable AutoML Talbes API.
Create a service account, grant required permissions, and download the service account private key.

You also need to upload your data into Google Cloud Storage (GCS) or BigQuery. For example, to use GCS as your data source

Create a GCS bucket.
Upload the training and batch prediction files.

Warning: Private keys must be kept secret. If you expose your private key it is recommended to revoke it immediately from the Google Cloud Console.

2. Initialize and authenticate

This section runs intialization and authentication. It creates an authenticated session which is required for running any of the following sections.

Install the client library

Run the following cell. Click on the 'Choose Files' button and select the client library compressed file. The file is uploaded to your Colab and installed using pip.

#@title Install AutoML Tables client library { vertical-output: true }

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

from google.colab import files
import tarfile

# Upload the client library
compressed_file_upload = files.upload()
compressed_file_name = list(compressed_file_upload.keys())[0]
# Decompress the client library
with tarfile.open(compressed_file_name) as tar:
  tar.extractall(path='.')
# Install the client library
!pip install ./python/automl-v1beta1

Authenticate using service account key

Run the following cell. Click on the 'Choose Files' button and select the service account private key file. If your Service Account key file or folder is hidden, you can reveal it in a Mac by pressing the Command + Shift + . combo.

#@title Authenticate using service account key and create a client. { vertical-output: true }

from google.cloud import automl_v1beta1

# Upload service account key
keyfile_upload = files.upload()
keyfile_name = list(keyfile_upload.keys())[0]
# Authenticate and create an AutoML client.
client = automl_v1beta1.AutoMlClient.from_service_account_file(keyfile_name)
# Authenticate and create a prediction service client.
prediction_client = automl_v1beta1.PredictionServiceClient.from_service_account_file(keyfile_name)

Test

Enter your GCP project ID.

#@title GCP project ID and location

project_id = '<PROJECT_ID>' #@param {type:'string'}
location = 'us-central1'
location_path = client.location_path(project_id, location)
location_path

To test whether your project set up and authentication steps were successful, run the following cell to list your datasets.

#@title List datasets. { vertical-output: true }

list_datasets_response = client.list_datasets(location_path)
datasets = {dataset.display_name: dataset.name for dataset in list_datasets_response}
datasets

You can also print the list of your models by running the following cell.

#@title List models. { vertical-output: true }

list_models_response = client.list_models(location_path)
models = {model.display_name: model.name for model in list_models_response}
models

3. Import training data

Create dataset

Select a dataset display name and pass your table source information to create a new dataset.

#@title Create dataset { vertical-output: true, output-height: 200 }

dataset_display_name = 'iris_dataset' #@param {type: 'string'}

create_dataset_response = client.create_dataset(
    location_path,
    {'display_name': dataset_display_name, 'tables_dataset_metadata': {}})
dataset_name = create_dataset_response.name
create_dataset_response

Import data

You can import your data to AutoML Tables from GCS or BigQuery. For this tutorial, you can use the iris dataset as your training data. You can create a GCS bucket and upload the data into your bucket. The URI for your file is gs://BUCKET_NAME/FOLDER_NAME1/FOLDER_NAME2/.../FILE_NAME. Alternatively you can create a BigQuery table and upload the data into the table. The URI for your table is bq://PROJECT_ID.DATASET_ID.TABLE_ID.

Importing data may take a few minutes or hours depending on the size of your data. If your Colab times out, run the following command to retrieve your dataset. Replace dataset_name with its actual value obtained in the preceding cells.

dataset = client.get_dataset(dataset_name)

#@title ... if data source is GCS { vertical-output: true }

dataset_gcs_input_uris = ['gs://<BUCKET_NAME>/<FILE_PATH>',] #@param
# Define input configuration.
input_config = {
    'gcs_source': {
        'input_uris': dataset_gcs_input_uris
    }
}

#@title ... if data source is BigQuery { vertical-output: true }

dataset_bq_input_uri = 'bq://<PROJECT_ID>.<DATASET_NAME>.<TABLE_NAME>' #@param {type: 'string'}
# Define input configuration.
input_config = {
    'bigquery_source': {
        'input_uri': dataset_bq_input_uri
    }
}

 #@title Import data { vertical-output: true }

import_data_response = client.import_data(dataset_name, input_config)
print('Dataset import operation: {}'.format(import_data_response.operation))
# Wait until import is done.
import_data_result = import_data_response.result()
import_data_result

Review the specs

Run the following command to see table specs such as row count.

#@title Table schema { vertical-output: true }

import google.cloud.automl_v1beta1.proto.data_types_pb2 as data_types
import matplotlib.pyplot as plt

# List table specs
list_table_specs_response = client.list_table_specs(dataset_name)
table_specs = [s for s in list_table_specs_response]
# List column specs
table_spec_name = table_specs[0].name
list_column_specs_response = client.list_column_specs(table_spec_name)
column_specs = {s.display_name: s for s in list_column_specs_response}
# Table schema pie chart.
type_counts = {}
for column_spec in column_specs.values():
  type_name = data_types.TypeCode.Name(column_spec.data_type.type_code)
  type_counts[type_name] = type_counts.get(type_name, 0) + 1

plt.pie(x=type_counts.values(), labels=type_counts.keys(), autopct='%1.1f%%')
plt.axis('equal')
plt.show()

Run the following command to see column specs such inferred schema.

4. Update dataset: assign a label column and enable nullable columns

AutoML Tables automatically detects your data column type. For example, for the Iris dataset it detects species to be categorical and petal_length, petal_width, sepal_length, and sepal_width to be numerical. Depending on the type of your label column, AutoML Tables chooses to run a classification or regression model. If your label column contains only numerical values, but they represent categories, change your label column type to categorical by updating your schema.

Update a column: set to nullable

#@title Update dataset { vertical-output: true }

update_column_spec_dict = {
    'name': column_specs['sepal_length'].name,
    'data_type': {
        'type_code': 'FLOAT64',
        'nullable': True
    }
}
update_column_response = client.update_column_spec(update_column_spec_dict)
update_column_response

Tip: You can use 'type_code': 'CATEGORY' in the preceding update_column_spec_dict to convert the column data type from FLOAT64 toCATEGORY`.

Update dataset: assign a label

#@title Update dataset { vertical-output: true }

label_column_name = 'species' #@param {type: 'string'}
label_column_spec = column_specs[label_column_name]
label_column_id = label_column_spec.name.rsplit('/', 1)[-1]
print('Label column ID: {}'.format(label_column_id))
# Define the values of the fields to be updated.
update_dataset_dict = {
    'name': dataset_name,
    'tables_dataset_metadata': {
        'target_column_spec_id': label_column_id
    }
}
update_dataset_response = client.update_dataset(update_dataset_dict)
update_dataset_response

5. Creating a model

Train a model

Specify the duration of the training. For example, 'train_budget_milli_node_hours': 1000 runs the training for one hour. If your Colab times out, use client.list_models(location_path) to check whether your model has been created. Then use model name to continue to the next steps. Run the following command to retrieve your model. Replace model_name with its actual value.

model = client.get_model(model_name)

#@title Create model { vertical-output: true }

model_display_name = 'iris_model' #@param {type:'string'}

model_dict = {
    'display_name': model_display_name,
    'dataset_id': dataset_name.rsplit('/', 1)[-1],
    'tables_model_metadata': {'train_budget_milli_node_hours': 1000}
}
create_model_response = client.create_model(location_path, model_dict)
print('Dataset import operation: {}'.format(create_model_response.operation))
# Wait until model training is done.
create_model_result = create_model_response.result()
model_name = create_model_result.name
create_model_result

6. Make a prediction

There are two different prediction modes: online and batch. The following cell shows you how to make an online prediction.

#@title Make an online prediction { vertical-output: true }

sepal_length = 5.8 #@param {type:'slider', min:4, max:8, step:0.1}
sepal_width = 3.1 #@param {type:'slider', min:2, max:5, step:0.1}
petal_length = 3.8 #@param {type:'slider', min:1, max:7, step:0.1}
petal_width = 1.2 #@param {type:'slider', min:0, max:3, step:0.1}
payload = {
    'row': {       
        'values': [
            {'number_value': sepal_length},
            {'number_value': sepal_width},
            {'number_value': petal_length},
            {'number_value': petal_width}
        ]
    }
}
# Make a prediction.
prediction_client.predict(model_name, payload)

7. Batch prediction

Initialize prediction

Your data source for batch prediction can be GCS or BigQuery. For this tutorial, you can use iris_batch_prediction_input.csv as input source. Create a GCS bucket and upload the file into your bucket. Some of the lines in the batch prediction input file are intentionally left missing some values. The AutoML Tables logs the errors in the errors.csv file.

NOTE: The client library has a bug. If the following cell returns a TypeError: Could not convert Any to BatchPredictResult error, ignore it. The batch prediction output file(s) will be updated to the GCS bucket that you set in the preceding cells.

#@title Start batch prediction { vertical-output: true, output-height: 200 }

batch_predict_gcs_input_uris = ['gs://automl-tables-data/iris_batch_prediction_input.csv',] #@param
batch_predict_gcs_output_uri_prefix = 'gs://automl-tables-pred' #@param {type:'string'}
# Define input source.
batch_prediction_input_source = {
  'gcs_source': {
    'input_uris': batch_predict_gcs_input_uris
  }
}
# Define output target.
batch_prediction_output_target = {
    'gcs_destination': {
      'output_uri_prefix': batch_predict_gcs_output_uri_prefix
    }
}
batch_predict_response = prediction_client.batch_predict(
    model_name, batch_prediction_input_source, batch_prediction_output_target)
print('Batch prediction operation: {}'.format(batch_predict_response.operation))
# Wait until batch prediction is done.
batch_predict_result = batch_predict_response.result()
batch_predict_response.metadata

batch_predict_response.metadata