Natural Language Processing is what it sounds like. Here are some miscellaneous bits and bobs about it.
NLP is built around a *pipeline* of stages that process the natural language in bits and pieces. These stages may include:
> [!example] default spaCy pipeline
> 1. Tokenizer
> 2. Tagger
> 3. Parser
> 4. Entity Recognizer
> 5. Lemmatizatizer
> 6. Text Categorizer
I really like this graphic from the source:
![[Pasted image 20250802111807.png]]
## Doing NLP in Python
While there are a bunch of other options, the [[Python]] library we used for NLP was [spaCy](https://spacy.io/usage/spacy-101). SpaCy[^1] includes a lot of features and is as simple to use as:
```python
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
for token in doc:
print(token.text, token.pos_, token.dep_)
```
spaCy can be used for:
- **Tokenization** - segmenting text into words, punctuation - similar to, but not exactly the same as, how [[Large Language Models|LLMs]] use tokens
- **Lemmazation** - reducing word instances do their base word, sans tense & plurality. e.g. "was" becomes "be" and "shirts" becomes "shirt"
- **Training** - you can train models
- **POS** - you can tag *parts of speech*
And lots more.
****
# More
## Source
- Grad School
- https://spacy.io/usage/spacy-101#pipelines
[^1]: I never now how to handle capitalization in situations like this.