Buenas!
Ayer se anuncio una nueva version de Machine Learning.Net, la version 0.4. Hay varias novedades interesantes, sin embargo, a mí me llamo la atención la mención de Word Embedding. Me he puesto a leer un poco al respecto y la verdad es que la capacidad de utilizar algunos modelos existentes de procesamiento de texto y sobre los mismos construir nuestros modelos es algo que se agradece.
En el post de lanzamiento comentan los detalles al respecto, yo he decidido tomar la app de consola de ejemplo del repositorio y ver las diferencias entre el procesamiento clásico y el que podemos hacer con WE. Utilizando los mismos Set de datos para el análisis y evaluación el modelo clásico con trabaja con una Precisión del 66.60% y utilizando WE esta Precisión sube hasta 72.30%
El código de entrenamiento de los mismos es el siguiente
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
internal static class Program | |
{ | |
private static PredictionModel<SentimentData, SentimentPrediction> _model; | |
private static PredictionModel<SentimentData, SentimentPrediction> _modelWordEmbeddings; | |
private static string AppPath => Path.GetDirectoryName(Environment.GetCommandLineArgs()[0]); | |
private static string TrainDataPath => Path.Combine(AppPath, "datasets", "sentiment-imdb-train.txt"); | |
private static string TestDataPath => Path.Combine(AppPath, "datasets", "sentiment-yelp-test.txt"); | |
private static string ModelPath => Path.Combine(AppPath, "SentimentModel.zip"); | |
private static void Main(string[] args) | |
{ | |
TrainModel(); | |
TrainModelWordEmbeddings(); | |
Evaluate(_model, "normal"); | |
Evaluate(_modelWordEmbeddings, "using WordEmbeddings"); | |
Console.ReadLine(); | |
} | |
public static void TrainModel() | |
{ | |
var pipeline = new LearningPipeline(); | |
pipeline.Add(new TextLoader(TrainDataPath).CreateFrom<SentimentData>()); | |
pipeline.Add(new TextFeaturizer("Features", "SentimentText")); | |
pipeline.Add(new FastTreeBinaryClassifier() { NumLeaves = 5, NumTrees = 5, MinDocumentsInLeafs = 2 }); | |
Console.WriteLine("=============== Training model ==============="); | |
var model = pipeline.Train<SentimentData, SentimentPrediction>(); | |
Console.WriteLine("=============== End training ==============="); | |
_model = model; | |
} | |
public static void TrainModelWordEmbeddings() | |
{ | |
var pipeline = new LearningPipeline(); | |
pipeline.Add(new TextLoader(TrainDataPath).CreateFrom<SentimentData>()); | |
pipeline.Add(new TextFeaturizer("FeaturesA", "SentimentText") { OutputTokens = true }); | |
pipeline.Add(new WordEmbeddings(("FeaturesA_TransformedText", "FeaturesB"))); | |
pipeline.Add(new ColumnConcatenator("Features", "FeaturesA", "FeaturesB")); | |
pipeline.Add(new FastTreeBinaryClassifier() { NumLeaves = 5, NumTrees = 5, MinDocumentsInLeafs = 2 }); | |
Console.WriteLine("=============== Training model with Word Embeddings ==============="); | |
var model = pipeline.Train<SentimentData, SentimentPrediction>(); | |
Console.WriteLine("=============== End training ==============="); | |
_modelWordEmbeddings = model; | |
} | |
private static void Evaluate(PredictionModel<SentimentData, SentimentPrediction> model, string name) | |
{ | |
var testData = new TextLoader(TestDataPath).CreateFrom<SentimentData>(); | |
var evaluator = new BinaryClassificationEvaluator(); | |
Console.WriteLine("=============== Evaluating model {0} ===============", name); | |
var metrics = evaluator.Evaluate(model, testData); | |
Console.WriteLine($"Accuracy: {metrics.Accuracy:P2}"); | |
Console.WriteLine($"Auc: {metrics.Auc:P2}"); | |
Console.WriteLine($"F1Score: {metrics.F1Score:P2}"); | |
Console.WriteLine("=============== End evaluating ==============="); | |
Console.WriteLine(); | |
} | |
} |
En la App anterior, utilizo los mismos DataSets para el entrenamiento y evaluación. La diferencia se puede ver en las funciones TrainModel() y TrainModelWordEmbeddings().
Como comentaba al principio lo interesante de este nuevo Release es la capacidad de utilizar varios modelos preentrenados. En el post de MSDN se habla de GloVe, fastText y SSWE. El paso siguiente seria ver como se comportan los nuevos modelos utilizando algunos de estos modelos
Ya hay varios modelos, solo hare la prueba con algunos de ellos ya que, durante el proceso de entrenamiento, los modelos existentes se descargan OnDemand y descargar +6GB por prueba es, como mínimo, interesante
Pues bien, los resultados son bastante interesantes
=============== Evaluating model normal =============== Accuracy: 66.60% Auc: 73.97% F1Score: 61.78% =============== End evaluating =============== =============== Evaluating model using WordEmbeddings =============== Accuracy: 72.30% Auc: 81.19% F1Score: 70.50% =============== End evaluating =============== =============== Evaluating model using WordEmbeddings GloVe50D =============== Accuracy: 66.10% Auc: 69.32% F1Score: 64.28% =============== End evaluating =============== =============== Evaluating model using WordEmbeddings GloVe300D =============== Accuracy: 67.80% Auc: 73.23% F1Score: 66.60% =============== End evaluating =============== =============== Evaluating model using WordEmbeddings GloVeTwitter50D =============== Accuracy: 65.30% Auc: 70.06% F1Score: 64.26% =============== End evaluating =============== =============== Evaluating model using WordEmbeddings GloVeTwitter200D =============== Accuracy: 65.40% Auc: 72.63% F1Score: 64.69% =============== End evaluating =============== =============== Evaluating model using WordEmbeddings Sswe =============== Accuracy: 72.30% Auc: 81.19% F1Score: 70.50% =============== End evaluating ===============
El código completo de la app se puede descargar desde https://github.com/elbruno/Blog/tree/master/20180808%20MLNET%200.4%20WordEmbeddings
Happy Coding!
Saludos @ Toronto
El Bruno
References
My Posts
- Error ‘Entry point ‘Trainers.LightGbmClassifier’ not found’ and how to fix it
- Machine Learning Glossary of terms
- Export Machine Learning.Net models to ONNX format
- Loading Data In our Learning Pipeline With List (Lists for ever!)
- What’s new in version 0.2.0
- What’s a Machine Learning model? A 7 minute video as the best possible explanation
- Write and Load models using Machine Learning .Net
- Understanding the step by step of Hello World
- Hello World in ML.Net, Machine Learning for .Net !