Week 0: Sandbox
- (Sean) Played around with spaCy's pretrained dependency parser, using the small, medium, large, and transformer models on simple rules like subject-verb agreement.
- (Sean) Quickly noticed that the pretrained models were only trained on grammatical sentences; how to address this?
Week 1: Research
- (Sean) Found this research paper. Read through it a bunch of times.
- (Sean) Reached out to the author (Maxim Mozgovoy), who was kind enough to give some insight and linked another paper that inspired the CoNLL-U augmentor.
- (Sean, Pranshu) Looked for more ungrammatical sentences pretrained models had trouble with.
Week 2: More research, data gathering
- (Isaac) Found that UD treebanks could be used to train spaCy components. Made a Python script to convert a UD treebank to spaCy binary.
- (Sean, Isaac) Learned that spacy pretrained models were trained on OntoNotes 5.0, which was in PTB format. Isaac applied for and got access to OntoNotes 5.0, but we couldn't figure out how to convert constituency trees to dependency trees.
- (Pranshu) Came up with a ton of grammar rules using the pretrained models.
Week 3: Hiatus for finals
- (Sean, Pranshu) Took a break to focus on finals.
- (Isaac) Already did his finals; found old scripts to convert PTB to CoNLL-U format. Tried to get them working but couldn't.
Week 4: Hiatus for finals
- (Sean, Pranshu) Still studying!
- (Isaac) Kept trying to get the scripts working...
Week 5: Data preprocessing, learning spaCy architecture
- (Sean) Found Stanford CoreNLP has an up-to-date PTB to CoNLL-U converter... quickly put a script together to convert the files.
- (Sean) Set up PC for GPU training on RTX 4070, did a ton of research on spaCy's NLP pipeline.
Week 6: CoNLL-U augmentor
- (Sean) Found this library to generate different word forms.
- (Isaac) Made the CoNLL-U to spaCy script multi-threaded.
- (Sean) Made the PTB to CoNLL-U script multi-threaded.
- (Sean, Isaac) Initially developed a super barebones and pretty slow CoNLL-U augmentor.
- (Sean, Isaac) Implemented multiprocessing and multithreading to the augmentor.
- (Sean) Realized
word_forms.py
was not thread-safe. Did a quick workaround to make it thread-safe (here's the fork). - (Sean) Developed a single-threaded version of the augmentor (to accommodate different systems).
- (Sean, Pranshu) Found a problem with the augmentor where it wouldn't change the POS for adjective/adverb switches. Pranshu came up with a great and simple solution (just use a dictionary).
- (Sean) Finished the augmentor by doing the following...
- Use
[(dep_rel, child_pos_list, head_pos_list, old_tags_list, aug_tag, child_or_head, aug_probability)]
to define augmentations. - Unpack CoNLL-U files as a 3D list (each word is a list of attributes, each sentence is a list of words, each file is a list of sentences). Dedicate
batch_size
threads to work on each file. - Perform augmentations by shuffling the rules array, picking the first rule that can be applied, and applying it with simple array indexing.
- Batch all augmentations of
batch_size
files into a single CoNLL-U file by reformatting into CoNLL-U. - Dedicate multiple processes to work on each batch of files in parallel, so we can augment
batch_size*num_processes
files at once.
- (Sean) Did some benchmarking to find optimal batch size (120 was the best).
Week 7: Optimizing the augmentor, training the model
- (Sean) Ran augmentor script on OntoNotes 5.0, trained a model on it.
- (Sean) Kept getting models that were just worse versions of the pretrained models, and were also massive.
- (Sean) CoNLL-U augmentor optimizations...
- Found this library which was a much more accurate and
faster way to generate different word forms. Used this in place of
word_forms.py
, but still used adj_to_adv.txt
from word_forms.py
. - Found a problem with the augmentor where the sentence ID was not synced between threads. Fixed this by using a 1-length
array for sentence ID since lists are mutable, passing this array as a parameter and using a lock to update it.
- In the end, we reduced augmentation times from >230s to 30-40s.
- (Sean) Trained a model without pretrained word embeddings. Tagger accuracy: 96%, parser LAS: 89%. However, this model was extremely overfit and
performed horribly on short sentences... this overfitting and poor performance on the short sentences will plague us for the next week.
- (Sean) Trained many models with a bunch of different hyperparameters and the GUM corpus.
- Tweaked learning rates, parser hidden layer width, dropout, Tok2Vec width and depth, batch size, and a whole lot more. None of these helped.
- Tried training on a mixture of pure/augmented OntoNotes 5.0 + pure/augmented GUM. None of these helped.
- Got desperate, tried training on CPU and Google Colab, for some reason thought this would help.
Week 8: A production-ready model, building the website and English API
- (Sean) Finally got a production-ready model. Used a 32 batch size, trained on only OntoNotes 5.0, and tweaked ONLY Tok2Vec. This model for some
reason does badly on short sentences but only if they contain an ending punctuation; simply decided to remove punctuations from input text because we trained over 20 models at that point.
- (Isaac) Started building this website (Svelte) and basically completed the English model API (FastAPI). The API at this point does not have a complete rule set, so it was not deployed.
- (Sean, Isaac) Worked on the website, ported the API to Flask and optimized latency.
- (Sean) Made rules for adjective/adverb confusion, verbs after modals, and verbs after prepositions.
Week 9: GUM, morphological features, augmentor improvements
- (Sean) Previous model was bad at subject-verb agreement (saw singular verbs after plural subjects as part of a noun phrase).
Fixed this by adding this error to the augmentor, kept the same config as before. Works great!
- (Sean) Scrapped verbs after prepositions rule.
- (Sean, Pranshu) Made rules for subject-verb agreement, and verb tense consistency.
- (Sean) Wrote documentation for the website.
- (Sean) Trained a new model on previous data appended with the unaugmented GUM corpus, with some tweaked hyperparameters.
This model is not used in production.
- (Isaac) Lot of styling fixes on the website.
- (Sean, Pranshu) Implemented passive voice detection.
- (Sean) Gave augmentor the ability to change morphological features. Augmented the GUM corpus with new rules
for copula usage (mainly 'to be') and subject-verb agreement. Trained english-v4.6 and english-v4.7 on this data.
Week 10: Massive improvements to rule system
- (Sean) Decided to stick to english-v4.6 for production since it doesn't mess up with punctuation. As a result,
removed the punctuation removal step from the API.
- (Sean, Pranshu) Adjusted SVA rule to work for pronouns, and improved accuracy by considering
tense, plurality, AND person of the subject.
- (Sean, Pranshu) Implemented copula rules with a special implementation for 'to be'.
- (Pranshu) Implemented complete sentence check.
- (Sean, Pranshu) Implemented first word capitalization, ends with punctuation rules, a vs an, and proper use of gerunds.
- (Sean) Augmentation to flip adjectives to adverbs; trained english-v4.9 (it was bad lol).
Week 11: Spell checking
- (Sean, Isaac) Implemented spell checking using symspellpy as
a separate model in the pipeline.
- (Sean) Reformatted the API structure to accommodate the new spell checking model.
- (Sean, Pranshu) Extended copula rule to auxiliary verbs, implemented pronoun-antecedent agreement.
Week 12: New augmentation system, final model training
- (Sean, Isaac) Did memory profiling on the API.
- (Sean) Removed Reddit data from the GUM corpus.
- (Sean) Made a new augmentor class for exact word augmentations to implement homophones and
subjective/objective pronoun rules. Trained english-v4.10 on this data
- (Sean, Pranshu) Implemented subjective vs objective pronouns and some common homophone rules.
- (Sean) Ensured contractions such as
're
and are
are seen as the same word. - (Sean) Trained
english-v4.11
on the same data as previously, but different hyperparameters. This is
our final model; the metrics were on par with spaCy transformer models... wow!
Week 13: Cleaning up API structure, final rules, Docker, Gunicorn
- (Sean) Removed unnecessary model-last (we only use model-best).
- (Sean, Pranshu) Implemented prepositions and determiners rules.
- (Sean) Changed pydantic to v1.10.16 to fix containerization issues.
- (Sean, Isaac) Serve Flask app through Gunicorn in a Docker container.
Week 14: Deployment
- (Sean) Set up Nginx reverse proxy for SSL termination and rate limiting.
- (Sean, Isaac) Used multi stage builds to reduce Flask app image size.
- (Sean) Made website mobile friendly, wrote API usage guide.
- (Sean, Isaac) Deploy Gunicorn, Nginx, and Certbot to EC2 instance using Docker Compose.
- (Sean) Set up SSL certificate for
api.grammacy.com
with a cron job for auto-renewal
in the EC2 instance. - (Sean) Deploy
grammacy.com
to Firebase Hosting.