E-Book, Englisch, 238 Seiten
Mesquita / Altinok Mastering spaCy
2. Auflage 2025
ISBN: 978-1-83588-047-0
Verlag: De Gruyter
Format: EPUB
Kopierschutz: 0 - No protection
Build structured NLP solutions with custom components and models powered by spacy-llm
E-Book, Englisch, 238 Seiten
ISBN: 978-1-83588-047-0
Verlag: De Gruyter
Format: EPUB
Kopierschutz: 0 - No protection
Mastering spaCy, Second Edition is your comprehensive guide to building sophisticated NLP applications using the spaCy ecosystem. This revised edition builds on the expertise of Duygu Altinok, a seasoned NLP engineer and spaCy contributor, and introduces new chapters by Déborah Mesquita, a data science educator and consultant known for making complex concepts accessible.
This edition embraces the latest advancements in NLP, featuring chapters on large language models with spacy-llm, transformer integration, and end-to-end workflow management with Weasel.
You'll learn how to enhance NLP tasks using LLMs, streamline workflows using Weasel, and integrate spaCy with third-party libraries like Streamlit, FastAPI, and DVC. From training custom Named Entity Recognition (NER) pipelines to categorizing emotions in Reddit posts, this book covers advanced topics such as text classification and coreference resolution. Starting with the fundamentals-tokenization, NER, and dependency parsing-you'll explore more advanced topics like creating custom components, training domain-specific models, and building scalable NLP workflows.
Through practical examples, clear explanations, tips, and tricks, this book will equip you to build robust NLP pipelines and seamlessly integrate them into web applications for end-to-end solutions.
Autoren/Hrsg.
Weitere Infos & Material
1
Getting Started with spaCy
In this chapter, we will have a comprehensive introduction to natural language processing (NLP) application development with Python and spaCy. First, we will see how NLP development can go hand in hand with Python, along with an overview of what spaCy offers as a Python library.
After the warm-up, you will quickly get started with spaCy by downloading the library and loading the models. You will then explore spaCy’s popular visualizer, displaCy, to visualize language data and explore its various features.
By the end of this chapter, you will know what you can achieve with spaCy and gain an overview of some of its key features. You will be also settled with your development environment, which will be used in all the chapters of this book.
We’re going to cover the following topics:
- Overview of spaCy
- Installing spaCy
- Installing spaCy’s language models
- Visualization with displaCy
Technical requirements
The code of this chapter can be found at https://github.com/PacktPublishing/Mastering-spaCy-Second-Edition.
Overview of spaCy
NLP is a subfield of AI that analyzes text, speech, and other forms of human-generated language data. Human language is complicated – even a short paragraph contains references to the previous words, pointers to real-world objects, cultural references, and the writer’s or speaker’s personal experiences. shows such an example sentence, which includes a (recently), phrases that can be (regarding the city that the speaker’s parents live in), and (a city is a place where human beings live together):
Figure 1.1 – An example of human language, containing many cognitive and cultural aspects
How do we process such a complicated structure using computers? With spaCy, we can easily model natural language with statistical models, and process linguistic features to turn the text into a well-structured representation. This book provides all the necessary background and tools for you to extract the meaning from text.
With the launch of ChatGPT in November 2022, the whole world was impressed by the ability of a model to understand instructions and generate text in a way very similar to how we humans do. However, much like how a food processor can chop, slice, and puree in seconds, it’s not always the best tool for every job. Sometimes, all you need is a simple kitchen knife to get the task done quickly and efficiently. In the same way, while large language models (LLMs) such as ChatGPT are powerful and versatile, they can be overkill for many real-world applications where focused, efficient, and interpretable solutions are more appropriate.
That’s why learning about libraries such as spaCy is so valuable. spaCy offers specialized tools for NLP that allow you to tackle specific tasks with speed, without the complexity and resource requirements of LLMs. And with spacy-llm you can also incorporate LLM components in your spaCy pipelines as well. Whether you’re building named entity recognizers, text classifiers, or tokenizers, spaCy provides the practical, well-optimized that cuts through complex language tasks efficiently. Understanding when and how to use the right tools can make all the difference in building effective NLP systems.
A high-level overview of the spaCy library
spaCy is an open source Python library designed to help us do real work. It’s pretty fast because its performance-critical parts are implemented in Cython, allowing for optimized speed while still being easy to use with Python. spaCy is shipped with pretrained language models and word vectors for 75+ languages.
Another famous and frequently used Python library is the Natural Language Toolkit (NLTK). NLTK’s focus was providing students and researchers with an idea of language processing. spaCy focused on providing production-ready code from the first day. You can expect models to perform on real-world data, the code to be efficient, and the ability to process a huge amount of text data in a reasonable time.
The fact that spaCy is focused on real-world NLP applications is not just due to its processing speed but also the ease of maintaining code for applications built with it. In the PyCon India 2019 keynote titled (slides available at https://speakerdeck.com/inesmontani/let-them-write-code-keynote-pycon-india-2019), Ines Montani (one of the core makers of spaCy) discusses the philosophy behind the creation of spaCy. The main idea behind spaCy design is “.” Some of the worst developer experiences are “ .”
With spaCy, we can break down each NLP application into pipeline components, reusing pre-built library components or creating our own custom components. We will dive deep into spaCy pipelines in . The spaCy container objects (Doc, Token, and Span) make working and processing text seamless (presented in detail in ) and we can train statistical models using spaCy’s config system (https://spacy.io/usage/training#config), which brings modularity, flexibility, and clear declarative configuration, enabling easy customization and reuse of NLP components (). SpaCy also makes it easy to incorporate components that use LLMs in our NLP processing pipelines () and also helps us manage and share end-to-end workflows for different use cases and domains with Weasel (). Finally, spaCy also integrates with other cool open source libraries such as DVC, Streamlit, and FastAPI ( and ). All this content covers the main building blocks of how the spaCy library is structured and how it can help us build maintainable NLP solutions.
By now, I hope you’re excited to learn how to use all these cool features during our learning journey throughout the book. In the next section, let’s install spaCy so we can start coding.
Installing spaCy
Let’s get started by installing and setting up spaCy. spaCy is compatible with 64-bit CPython 3.7+ and can run on Unix/Linux, macOS/OS X, and Windows. CPython is a reference implementation of Python in C. If you already have Python running on your system, most probably your CPython modules are fine too – hence, you don’t need to worry about this detail. The newest spaCy releases are always downloadable via pip (https://pypi.org/) and conda (https://conda.io/en/latest/). pip and conda are two of the most popular distribution packages.
It’s always a good idea to create a virtual environment to isolate the independent set of Python packages for each project. On Windows, we can create a virtual environment and install spacy with pip using these commands:
python -m venv .env .env\Scripts\activate pip install -U pip setuptools wheel pip install -U spacyIf your machine has a GPU available, you can install spaCy with GPU support with this command:
pip install -U 'spacy[cuda12x]'You can see the installation instructions for each operating system at https://spacy.io/usage#quickstart. shows all the available installation options.
Figure 1.2 – spaCy installation options
After installing the library, we need to install the language models. Let’s do that in the next section.
Installing spaCy’s language models
The spaCy installation doesn’t come with the statistical language models needed for the spaCy pipeline tasks. spaCy language models contain knowledge about a specific language collected from a set of resources. Language models let us perform a variety of NLP tasks, including parts of speech tagging popularly called as POS tagging and named entity recognition (NER).
Different languages have different models that are language-specific. There are also different models available for the same language. The naming convention of the models is [lang]_[name]. The...




