Natural Language Processing

Natural Language Processing is what it sounds like. Here are some miscellaneous bits and bobs about it. NLP is built around a *pipeline* of stages that process the natural language in bits and pieces. These stages may include: > [!example] default spaCy pipeline > 1. Tokenizer > 2. Tagger > 3. Parser > 4. Entity Recognizer > 5. Lemmatizatizer > 6. Text Categorizer I really like this graphic from the source: ![[Pasted image 20250802111807.png]] ## Doing NLP in Python While there are a bunch of other options, the [[Python]] library we used for NLP was [spaCy](https://spacy.io/usage/spacy-101). SpaCy[^1] includes a lot of features and is as simple to use as: ```python import spacy nlp = spacy.load("en_core_web_sm") doc = nlp("Apple is looking at buying U.K. startup for $1 billion") for token in doc: print(token.text, token.pos_, token.dep_) ``` spaCy can be used for: - **Tokenization** - segmenting text into words, punctuation - similar to, but not exactly the same as, how [[Large Language Models|LLMs]] use tokens - **Lemmazation** - reducing word instances do their base word, sans tense & plurality. e.g. "was" becomes "be" and "shirts" becomes "shirt" - **Training** - you can train models - **POS** - you can tag *parts of speech* And lots more. **** # More ## Source - Grad School - https://spacy.io/usage/spacy-101#pipelines [^1]: I never now how to handle capitalization in situations like this.