AutoML tables asdfasdfkahdfkhaskdfhasd. asdfasdfkjhagsdfasdhfgjasd
(SPANISH) Entrenamiento e inferencia con Google Cloud AutoML Tables
- 1. Project set up
- 2. Initialize and authenticate
- 3. Import training data
- 4. Update dataset: assign a label column and enable nullable columns
- 5. Creating a model
- 6. Make a prediction
- 7. Batch prediction
Follow the AutoML Tables documentation to
- Create a Google Cloud Platform (GCP) project.
- Enable billing.
- Apply to whitelist your project.
- Enable AutoML API.
- Enable AutoML Talbes API.
- Create a service account, grant required permissions, and download the service account private key.
You also need to upload your data into Google Cloud Storage (GCS) or BigQuery. For example, to use GCS as your data source
- Create a GCS bucket.
- Upload the training and batch prediction files.
Warning: Private keys must be kept secret. If you expose your private key it is recommended to revoke it immediately from the Google Cloud Console.
#@title Install AutoML Tables client library { vertical-output: true }
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from google.colab import files
import tarfile
# Upload the client library
compressed_file_upload = files.upload()
compressed_file_name = list(compressed_file_upload.keys())[0]
# Decompress the client library
with tarfile.open(compressed_file_name) as tar:
tar.extractall(path='.')
# Install the client library
!pip install ./python/automl-v1beta1
#@title Authenticate using service account key and create a client. { vertical-output: true }
from google.cloud import automl_v1beta1
# Upload service account key
keyfile_upload = files.upload()
keyfile_name = list(keyfile_upload.keys())[0]
# Authenticate and create an AutoML client.
client = automl_v1beta1.AutoMlClient.from_service_account_file(keyfile_name)
# Authenticate and create a prediction service client.
prediction_client = automl_v1beta1.PredictionServiceClient.from_service_account_file(keyfile_name)
Enter your GCP project ID.
#@title GCP project ID and location
project_id = '<PROJECT_ID>' #@param {type:'string'}
location = 'us-central1'
location_path = client.location_path(project_id, location)
location_path
To test whether your project set up and authentication steps were successful, run the following cell to list your datasets.
#@title List datasets. { vertical-output: true }
list_datasets_response = client.list_datasets(location_path)
datasets = {dataset.display_name: dataset.name for dataset in list_datasets_response}
datasets
You can also print the list of your models by running the following cell.
#@title List models. { vertical-output: true }
list_models_response = client.list_models(location_path)
models = {model.display_name: model.name for model in list_models_response}
models
Select a dataset display name and pass your table source information to create a new dataset.
#@title Create dataset { vertical-output: true, output-height: 200 }
dataset_display_name = 'iris_dataset' #@param {type: 'string'}
create_dataset_response = client.create_dataset(
location_path,
{'display_name': dataset_display_name, 'tables_dataset_metadata': {}})
dataset_name = create_dataset_response.name
create_dataset_response
You can import your data to AutoML Tables from GCS or BigQuery. For this tutorial, you can use the iris dataset as your training data. You can create a GCS bucket and upload the data into your bucket. The URI for your file is gs://BUCKET_NAME/FOLDER_NAME1/FOLDER_NAME2/.../FILE_NAME. Alternatively you can create a BigQuery table and upload the data into the table. The URI for your table is bq://PROJECT_ID.DATASET_ID.TABLE_ID.
Importing data may take a few minutes or hours depending on the size of your data. If your Colab times out, run the following command to retrieve your dataset. Replace dataset_name with its actual value obtained in the preceding cells.
dataset = client.get_dataset(dataset_name)
#@title ... if data source is GCS { vertical-output: true }
dataset_gcs_input_uris = ['gs://<BUCKET_NAME>/<FILE_PATH>',] #@param
# Define input configuration.
input_config = {
'gcs_source': {
'input_uris': dataset_gcs_input_uris
}
}
#@title ... if data source is BigQuery { vertical-output: true }
dataset_bq_input_uri = 'bq://<PROJECT_ID>.<DATASET_NAME>.<TABLE_NAME>' #@param {type: 'string'}
# Define input configuration.
input_config = {
'bigquery_source': {
'input_uri': dataset_bq_input_uri
}
}
#@title Import data { vertical-output: true }
import_data_response = client.import_data(dataset_name, input_config)
print('Dataset import operation: {}'.format(import_data_response.operation))
# Wait until import is done.
import_data_result = import_data_response.result()
import_data_result
Run the following command to see table specs such as row count.
#@title Table schema { vertical-output: true }
import google.cloud.automl_v1beta1.proto.data_types_pb2 as data_types
import matplotlib.pyplot as plt
# List table specs
list_table_specs_response = client.list_table_specs(dataset_name)
table_specs = [s for s in list_table_specs_response]
# List column specs
table_spec_name = table_specs[0].name
list_column_specs_response = client.list_column_specs(table_spec_name)
column_specs = {s.display_name: s for s in list_column_specs_response}
# Table schema pie chart.
type_counts = {}
for column_spec in column_specs.values():
type_name = data_types.TypeCode.Name(column_spec.data_type.type_code)
type_counts[type_name] = type_counts.get(type_name, 0) + 1
plt.pie(x=type_counts.values(), labels=type_counts.keys(), autopct='%1.1f%%')
plt.axis('equal')
plt.show()
Run the following command to see column specs such inferred schema.
AutoML Tables automatically detects your data column type. For example, for the Iris dataset it detects species to be categorical and petal_length, petal_width, sepal_length, and sepal_width to be numerical. Depending on the type of your label column, AutoML Tables chooses to run a classification or regression model. If your label column contains only numerical values, but they represent categories, change your label column type to categorical by updating your schema.
#@title Update dataset { vertical-output: true }
update_column_spec_dict = {
'name': column_specs['sepal_length'].name,
'data_type': {
'type_code': 'FLOAT64',
'nullable': True
}
}
update_column_response = client.update_column_spec(update_column_spec_dict)
update_column_response
Tip: You can use 'type_code': 'CATEGORY' in the preceding update_column_spec_dict to convert the column data type from FLOAT64 toCATEGORY`.
#@title Update dataset { vertical-output: true }
label_column_name = 'species' #@param {type: 'string'}
label_column_spec = column_specs[label_column_name]
label_column_id = label_column_spec.name.rsplit('/', 1)[-1]
print('Label column ID: {}'.format(label_column_id))
# Define the values of the fields to be updated.
update_dataset_dict = {
'name': dataset_name,
'tables_dataset_metadata': {
'target_column_spec_id': label_column_id
}
}
update_dataset_response = client.update_dataset(update_dataset_dict)
update_dataset_response
Train a model
Specify the duration of the training. For example, 'train_budget_milli_node_hours': 1000 runs the training for one hour. If your Colab times out, use client.list_models(location_path) to check whether your model has been created. Then use model name to continue to the next steps. Run the following command to retrieve your model. Replace model_name with its actual value.
model = client.get_model(model_name)
#@title Create model { vertical-output: true }
model_display_name = 'iris_model' #@param {type:'string'}
model_dict = {
'display_name': model_display_name,
'dataset_id': dataset_name.rsplit('/', 1)[-1],
'tables_model_metadata': {'train_budget_milli_node_hours': 1000}
}
create_model_response = client.create_model(location_path, model_dict)
print('Dataset import operation: {}'.format(create_model_response.operation))
# Wait until model training is done.
create_model_result = create_model_response.result()
model_name = create_model_result.name
create_model_result
There are two different prediction modes: online and batch. The following cell shows you how to make an online prediction.
#@title Make an online prediction { vertical-output: true }
sepal_length = 5.8 #@param {type:'slider', min:4, max:8, step:0.1}
sepal_width = 3.1 #@param {type:'slider', min:2, max:5, step:0.1}
petal_length = 3.8 #@param {type:'slider', min:1, max:7, step:0.1}
petal_width = 1.2 #@param {type:'slider', min:0, max:3, step:0.1}
payload = {
'row': {
'values': [
{'number_value': sepal_length},
{'number_value': sepal_width},
{'number_value': petal_length},
{'number_value': petal_width}
]
}
}
# Make a prediction.
prediction_client.predict(model_name, payload)
Your data source for batch prediction can be GCS or BigQuery. For this tutorial, you can use iris_batch_prediction_input.csv as input source. Create a GCS bucket and upload the file into your bucket. Some of the lines in the batch prediction input file are intentionally left missing some values. The AutoML Tables logs the errors in the errors.csv file.
NOTE: The client library has a bug. If the following cell returns a TypeError: Could not convert Any to BatchPredictResult error, ignore it. The batch prediction output file(s) will be updated to the GCS bucket that you set in the preceding cells.
#@title Start batch prediction { vertical-output: true, output-height: 200 }
batch_predict_gcs_input_uris = ['gs://automl-tables-data/iris_batch_prediction_input.csv',] #@param
batch_predict_gcs_output_uri_prefix = 'gs://automl-tables-pred' #@param {type:'string'}
# Define input source.
batch_prediction_input_source = {
'gcs_source': {
'input_uris': batch_predict_gcs_input_uris
}
}
# Define output target.
batch_prediction_output_target = {
'gcs_destination': {
'output_uri_prefix': batch_predict_gcs_output_uri_prefix
}
}
batch_predict_response = prediction_client.batch_predict(
model_name, batch_prediction_input_source, batch_prediction_output_target)
print('Batch prediction operation: {}'.format(batch_predict_response.operation))
# Wait until batch prediction is done.
batch_predict_result = batch_predict_response.result()
batch_predict_response.metadata
batch_predict_response.metadata