Clasificación de texto usando la nueva capa de vectorización
Clasificación de texto usando la nueva capa de vectorización de Keras
This notebook contains a walkthrough of text classification from scratch, starting from a directory of plain text files (a common scenario in practice). We demonstrate multiclass text classification using a dataset of Stack Overflow questions.
!pip3 install -q tf-nightly
import tensorflow as tf
import numpy as np
from tensorflow.keras import preprocessing
print(tf.__version__)
Multiclass text classification
This notebook shows a multicalss classifier of Stack Overflow posts as one of the most used languages today, namely Java, Javascript, Python or C#. This is an example of multiclass classification.
We will use a public dataset about Stack Overflow questions available in Google Cloud marketplace. You can explore the dataset in BigQuery just by following the instructions of the former link. In this notebook, you will build a model to predict the tags of questions from Stack Overflow, using a pre-processed table already built and coming from the BigQuery dataset. To keep things simple our pre-processed table includes questions containing 4 possible programming-related tags: Java, Javascript, Python or Csharp.
This notebook uses tf.keras to build and train models in TensorFlow, as well as some TensorFlow experimental features, like the TextVectorization layer for word splitting & indexing.
Download the BigQuery dataset
BigQuery has a public dataset that includes more than 17 million Stack Overflow questions. We are going to download some posts labeled as one of the four most used languages today: java, javascript, python and C#, but to make this a harder problem to our model, we have replaced every instance of that word with another less used language today (but well-known some decades ago) called blank. Otherwise, it will be very easy for the model to detect that a post is a java-related post just by finding the word java on it.
You can access the pre-processed blank-filled dataset as a tar file here in Google Cloud Storage. Each of the four labels has approximate 10k samples for training/eval and 10k samples for test.
!gsutil cp gs://tensorflow-blog-rnn/so_posts_4labels_blank_80k.tar.gz .
!tar -xf so_posts_4labels_blank_80k.tar.gz
batch_size = 32
raw_train_ds = tf.keras.preprocessing.text_dataset_from_directory(
'train', batch_size=batch_size, validation_split=0.2, subset='training', seed=42)
raw_val_ds = tf.keras.preprocessing.text_dataset_from_directory(
'train', batch_size=batch_size, validation_split=0.2, subset='validation', seed=42)
raw_test_ds = tf.keras.preprocessing.text_dataset_from_directory(
'test', batch_size=batch_size)
Caching may reduce the processing time. Let's prove it.
import time
start = time.time()
for text_batch, label_batch in raw_train_ds:
pass
end = time.time()
print(end - start)
import time
start = time.time()
for text_batch, label_batch in raw_train_ds:
pass
end = time.time()
print(end - start)
raw_train_ds = raw_train_ds.cache()
import time
start = time.time()
for text_batch, label_batch in raw_train_ds:
pass
end = time.time()
print(end - start)
import time
start = time.time()
for text_batch, label_batch in raw_train_ds:
pass
end = time.time()
print(end - start)
import time
for text_batch, label_batch in raw_train_ds.take(1):
for i in range(5):
print(text_batch.numpy()[i])
print(label_batch.numpy()[i])
Each label is an integer value between 0 and 3, correponsing to one of our four labels (0 to 3):
Prepare data for training
Since the data is pre-processed, we do not need to make any additional steps like removing HTML tags, as we did in Part 1 of this notebook.
We can go directly to instantiate our text vectorization layer (experimental feature). We are using this layer to normalize, split, and map strings to integers, so we set our output_mode to int. We also set the same constants as Part 1 for the model, like an explicit maximum sequence_length.
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
max_features = 5000
embedding_dim = 128
sequence_length = 500
vectorize_layer = TextVectorization(
max_tokens=max_features,
output_mode='int',
output_sequence_length=sequence_length)
# Make a text-only dataset (no labels) and call adapt
text_ds = raw_train_ds.map(lambda x, y: x)
vectorize_layer.adapt(text_ds)
def vectorize_text(text, label):
text = tf.expand_dims(text, -1)
return vectorize_layer(text), label
# Vectorize the data.
train_ds = raw_train_ds.map(vectorize_text)
val_ds = raw_val_ds.map(vectorize_text)
test_ds = raw_test_ds.map(vectorize_text)
# Do async prefetching / buffering of the data for best performance on GPU.
train_ds = train_ds.cache().prefetch(buffer_size=10)
val_ds = val_ds.cache().prefetch(buffer_size=10)
test_ds = test_ds.cache().prefetch(buffer_size=10)
The vectorization layer transforms each input word of a sentence into a numerical representation, i.e. a list of token indices or vocabulary, with size defined by max_features (5000). Note that the output size is fixed, truncated by sequence_length (500), regardless of how many tokens resulted from the previous step, and this will be the input to our model.
Let's take a moment to understand the output of the vectorization layer. The output of each sentence is fixed to 500 integers, as stated by sequence_length. It should be noted that most of the values are zero, and this is due to the fact that there is no corresponding token in our vocabulary.
for text_batch, label_batch in train_ds.take(1):
for i in range(5):
print(text_batch.numpy()[i])
print(label_batch.numpy()[i])
Build the model
The input data consists of an array of integer-encoded vocabulary, with a fixed size. The labels to predict are between 0 and 3, so instead of using a binary classifier, we will use a softmax classifier. We compile the model with an Adam optimizer and a different loss function from Part 1 (Sparse categorical crossentropy).
One of the parameters of the embedding layer is max_features+1and not max_features, and the reason is to add an extra token for an unknown word to our vocabulary in the input string.
from tensorflow.keras import layers
# A integer input for vocab indices.
inputs = tf.keras.Input(shape=(None,), dtype='int64')
x = layers.Embedding(max_features + 1, embedding_dim)(inputs)
x = layers.Bidirectional(layers.LSTM(128))(x)
predictions = layers.Dense(4, activation='softmax', name='predictions')(x)
model = tf.keras.Model(inputs, predictions)
model.compile(
loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
epochs = 5
# Fit the model using the train and test datasets.
history = model.fit(
train_ds,
validation_data=val_ds,
epochs=epochs)
model.summary()
loss, accuracy = model.evaluate(test_ds)
print("Loss: ", loss)
print("Accuracy: ", accuracy)
Learn more
This notebook uses tf.keras, a high-level API to build and train models in TensorFlow. For a more advanced text classification tutorial using tf.keras, see the MLCC Text Classification Guide. In this notebook, we also use some TensorFlow experimental features, like the TextVectorization layer for word splitting & indexing.