NLP

Our Sample code

Sentiment

Run nlphuggingfaceclassifier2.py, based on args of “data_type=huggingface data_name=imdb model_checkpoint=bert-base-cased task=sentiment outputdir=./output traintag=0805 training=True total_epochs=4 save_every=2 batch_size=8 learningrate=2e-05” The saved path is “outputpath=os.path.join(args.outputdir, task, args.data_name+’_’+args.traintag)”

All keys in raw datasets: dict_keys(['train', 'test', 'unsupervised'])
{'labels': torch.Size([8]), 'input_ids': torch.Size([8, 512]), 'token_type_ids': torch.Size([8, 512]), 'attention_mask': torch.Size([8, 512])}
tensor(0.7184, grad_fn=<NllLossBackward0>) torch.Size([8, 2])

sequence_classifier

Run nlphuggingfaceclassifier2.py, based on args of “data_type=huggingface data_name=glue dataconfig=mrpc subset=0.1 model_checkpoint=bert-base-cased task=sequence_classifier outputdir=./output traintag=0807 training=True total_epochs=4 save_every=2 batch_size=8 learningrate=2e-05”

oneitem all keys: dict_keys(['sentence1', 'sentence2', 'label', 'idx'])
ClassLabel(names=['not_equivalent', 'equivalent'], id=None)
{'labels': torch.Size([8]), 'input_ids': torch.Size([8, 80]), 'token_type_ids': torch.Size([8, 80]), 'attention_mask': torch.Size([8, 80])}
tensor(0.7476, grad_fn=<NllLossBackward0>) torch.Size([8, 2])
task sequence_classifier: {'accuracy': 0.8602941176470589, 'f1': 0.9028960817717206}

custom_classifier

data_type=huggingface data_name=imdb dataconfig=None subset=0 model_checkpoint=bert-base-cased task=custom_classifier outputdir=./output traintag=0807 training=True total_epochs=4 save_every=2 batch_size=8 learningrate=2e-05 oneitem all keys: dict_keys([‘text’, ‘label’]) {‘labels’: torch.Size([8]), ‘input_ids’: torch.Size([8, 512]), ‘attention_mask’: torch.Size([8, 512])} [‘labels’, ‘input_ids’, ‘attention_mask’] tensor(0.6916, grad_fn=<NllLossBackward0>) torch.Size([8, 2]) task custom_classifier: {‘accuracy’: 0.93576}

token_classifier

data_type=huggingface data_name=conll2003 dataconfig=None subset=0 model_checkpoint=bert-base-cased task=token_classifier outputdir=./output traintag=0807 training=False total_epochs=4 save_every=2 batch_size=8 learningrate=2e-05 oneitem all keys: dict_keys([‘id’, ‘tokens’, ‘pos_tags’, ‘chunk_tags’, ‘ner_tags’]) task token_classifier: {‘LOC’: {‘precision’: 0.0, ‘recall’: 0.0, ‘f1’: 0.0, ‘number’: 1837}, ‘MISC’: {‘precision’: 0.0, ‘recall’: 0.0, ‘f1’: 0.0, ‘number’: 922}, ‘ORG’: {‘precision’: 0.0, ‘recall’: 0.0, ‘f1’: 0.0, ‘number’: 1341}, ‘PER’: {‘precision’: 0.0, ‘recall’: 0.0, ‘f1’: 0.0, ‘number’: 1842}, ‘overall_precision’: 0.0, ‘overall_recall’: 0.0, ‘overall_f1’: 0.0, ‘overall_accuracy’: 0.764084299758639}

NLP dataset

GLUE, the General Language Understanding Evaluation benchmark (https://gluebenchmark.com/) is a collection of resources for training, evaluating, and analyzing natural language understanding systems. https://huggingface.co/datasets/glue

IMDb Reviews: http://ai.stanford.edu/~amaas/data/sentiment/ The IMDB dataset contains 25,000 movie reviews labeled by sentiment for training a model and 25,000 movie reviews for testing it.

E:\Dataset\NLPdataset> wget  http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz -o aclImdb_v1.tar.gz
E:\Dataset\NLPdataset> tar -xf .\aclImdb_v1.tar.gz
E:\Dataset\NLPdataset> ls .\aclImdb\

Mode                 LastWriteTime         Length Name
----                 -------------         ------ ----
d-----         4/12/2011  10:22 AM                test
d-----         6/25/2011   6:09 PM                train
-a----         4/12/2011  10:14 AM         845980 imdb.vocab
-a----         6/11/2011   3:54 PM         903029 imdbEr.txt
-a----         6/25/2011   5:18 PM           4037 README

E:\Dataset\NLPdataset> ls .\aclImdb\train\

Mode                 LastWriteTime         Length Name
----                 -------------         ------ ----
d-----         4/12/2011   2:47 AM                neg
d-----         4/12/2011   2:47 AM                pos
d-----         4/12/2011   2:47 AM                unsup
-a----         4/12/2011  10:17 AM       21021197 labeledBow.feat
-a----         4/12/2011  10:22 AM       41348699 unsupBow.feat
-a----         4/12/2011   2:48 AM         612500 urls_neg.txt
-a----         4/12/2011   2:48 AM         612500 urls_pos.txt
-a----         4/12/2011   2:47 AM        2450000 urls_unsup.txt

Sentiment Analysis tutorials: https://huggingface.co/blog/sentiment-analysis-python

SQuAD dataset

The dataset that is used the most as an academic benchmark for extractive question answering is SQuAD. There is also a harder SQuAD v2 benchmark, which includes questions that don’t have an answer. Your own dataset should contain a column for contexts, a column for questions, and a column for answers.

SQuAD: https://rajpurkar.github.io/SQuAD-explorer/ https://rajpurkar.github.io/SQuAD-explorer/explore/v2.0/dev/

E:\Dataset\NLPdataset\squad> wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json -O train-v2.0.json

QA tutorials: https://huggingface.co/docs/transformers/tasks/question_answering https://huggingface.co/learn/nlp-course/chapter7/7?fw=pt https://huggingface.co/transformers/v4.1.1/custom_datasets.html#question-answering-with-squad-2-0 A Model for Open Domain Long Form Question Answering: https://yjernite.github.io/lfqa.html

Reference

https://umap-learn.readthedocs.io/en/latest/index.html

Natural Language Processing with Transformers Book https://github.com/nlp-with-transformers/notebooks

CS224N: Natural Language Processing with Deep Learning https://web.stanford.edu/class/cs224n/

DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter”, (2019)

CARER: Contextualized Affect Representations for Emotion Recognition Unlike most sentiment analysis datasets that involve just “positive” and “negative” polarities, this dataset contains six basic emotions: anger, disgust, fear, joy, sadness, and surprise. Given a tweet, our task will be to train a model that can classify it into one of these emotions.