Mesquita / Altinok | Mastering spaCy | E-Book | www2.sack.de
E-Book

E-Book, Englisch, 238 Seiten

Mesquita / Altinok Mastering spaCy

Build structured NLP solutions with custom components and models powered by spacy-llm
2. Auflage 2025
ISBN: 978-1-83588-047-0
Verlag: De Gruyter
Format: EPUB
Kopierschutz: 0 - No protection

Build structured NLP solutions with custom components and models powered by spacy-llm

E-Book, Englisch, 238 Seiten

ISBN: 978-1-83588-047-0
Verlag: De Gruyter
Format: EPUB
Kopierschutz: 0 - No protection



Mastering spaCy, Second Edition is your comprehensive guide to building sophisticated NLP applications using the spaCy ecosystem. This revised edition builds on the expertise of Duygu Altinok, a seasoned NLP engineer and spaCy contributor, and introduces new chapters by Déborah Mesquita, a data science educator and consultant known for making complex concepts accessible.
This edition embraces the latest advancements in NLP, featuring chapters on large language models with spacy-llm, transformer integration, and end-to-end workflow management with Weasel.
You'll learn how to enhance NLP tasks using LLMs, streamline workflows using Weasel, and integrate spaCy with third-party libraries like Streamlit, FastAPI, and DVC. From training custom Named Entity Recognition (NER) pipelines to categorizing emotions in Reddit posts, this book covers advanced topics such as text classification and coreference resolution. Starting with the fundamentals-tokenization, NER, and dependency parsing-you'll explore more advanced topics like creating custom components, training domain-specific models, and building scalable NLP workflows.
Through practical examples, clear explanations, tips, and tricks, this book will equip you to build robust NLP pipelines and seamlessly integrate them into web applications for end-to-end solutions.

Mesquita / Altinok Mastering spaCy jetzt bestellen!

Weitere Infos & Material


1


Getting Started with spaCy


In this chapter, we will have a comprehensive introduction to natural language processing (NLP) application development with Python and spaCy. First, we will see how NLP development can go hand in hand with Python, along with an overview of what spaCy offers as a Python library.

After the warm-up, you will quickly get started with spaCy by downloading the library and loading the models. You will then explore spaCy’s popular visualizer, displaCy, to visualize language data and explore its various features.

By the end of this chapter, you will know what you can achieve with spaCy and gain an overview of some of its key features. You will be also settled with your development environment, which will be used in all the chapters of this book.

We’re going to cover the following topics:

  • Overview of spaCy
  • Installing spaCy
  • Installing spaCy’s language models
  • Visualization with displaCy

Technical requirements


The code of this chapter can be found at https://github.com/PacktPublishing/Mastering-spaCy-Second-Edition.

Overview of spaCy


NLP is a subfield of AI that analyzes text, speech, and other forms of human-generated language data. Human language is complicated – even a short paragraph contains references to the previous words, pointers to real-world objects, cultural references, and the writer’s or speaker’s personal experiences. shows such an example sentence, which includes a (recently), phrases that can be (regarding the city that the speaker’s parents live in), and (a city is a place where human beings live together):

Figure 1.1 – An example of human language, containing many cognitive and cultural aspects

How do we process such a complicated structure using computers? With spaCy, we can easily model natural language with statistical models, and process linguistic features to turn the text into a well-structured representation. This book provides all the necessary background and tools for you to extract the meaning from text.

With the launch of ChatGPT in November 2022, the whole world was impressed by the ability of a model to understand instructions and generate text in a way very similar to how we humans do. However, much like how a food processor can chop, slice, and puree in seconds, it’s not always the best tool for every job. Sometimes, all you need is a simple kitchen knife to get the task done quickly and efficiently. In the same way, while large language models (LLMs) such as ChatGPT are powerful and versatile, they can be overkill for many real-world applications where focused, efficient, and interpretable solutions are more appropriate.

That’s why learning about libraries such as spaCy is so valuable. spaCy offers specialized tools for NLP that allow you to tackle specific tasks with speed, without the complexity and resource requirements of LLMs. And with spacy-llm you can also incorporate LLM components in your spaCy pipelines as well. Whether you’re building named entity recognizers, text classifiers, or tokenizers, spaCy provides the practical, well-optimized that cuts through complex language tasks efficiently. Understanding when and how to use the right tools can make all the difference in building effective NLP systems.

A high-level overview of the spaCy library


spaCy is an open source Python library designed to help us do real work. It’s pretty fast because its performance-critical parts are implemented in Cython, allowing for optimized speed while still being easy to use with Python. spaCy is shipped with pretrained language models and word vectors for 75+ languages.

Another famous and frequently used Python library is the Natural Language Toolkit (NLTK). NLTK’s focus was providing students and researchers with an idea of language processing. spaCy focused on providing production-ready code from the first day. You can expect models to perform on real-world data, the code to be efficient, and the ability to process a huge amount of text data in a reasonable time.

The fact that spaCy is focused on real-world NLP applications is not just due to its processing speed but also the ease of maintaining code for applications built with it. In the PyCon India 2019 keynote titled (slides available at https://speakerdeck.com/inesmontani/let-them-write-code-keynote-pycon-india-2019), Ines Montani (one of the core makers of spaCy) discusses the philosophy behind the creation of spaCy. The main idea behind spaCy design is “.” Some of the worst developer experiences are “ .”

With spaCy, we can break down each NLP application into pipeline components, reusing pre-built library components or creating our own custom components. We will dive deep into spaCy pipelines in . The spaCy container objects (Doc, Token, and Span) make working and processing text seamless (presented in detail in ) and we can train statistical models using spaCy’s config system (https://spacy.io/usage/training#config), which brings modularity, flexibility, and clear declarative configuration, enabling easy customization and reuse of NLP components (). SpaCy also makes it easy to incorporate components that use LLMs in our NLP processing pipelines () and also helps us manage and share end-to-end workflows for different use cases and domains with Weasel (). Finally, spaCy also integrates with other cool open source libraries such as DVC, Streamlit, and FastAPI ( and ). All this content covers the main building blocks of how the spaCy library is structured and how it can help us build maintainable NLP solutions.

By now, I hope you’re excited to learn how to use all these cool features during our learning journey throughout the book. In the next section, let’s install spaCy so we can start coding.

Installing spaCy


Let’s get started by installing and setting up spaCy. spaCy is compatible with 64-bit CPython 3.7+ and can run on Unix/Linux, macOS/OS X, and Windows. CPython is a reference implementation of Python in C. If you already have Python running on your system, most probably your CPython modules are fine too – hence, you don’t need to worry about this detail. The newest spaCy releases are always downloadable via pip (https://pypi.org/) and conda (https://conda.io/en/latest/). pip and conda are two of the most popular distribution packages.

It’s always a good idea to create a virtual environment to isolate the independent set of Python packages for each project. On Windows, we can create a virtual environment and install spacy with pip using these commands:

python -m venv .env .env\Scripts\activate pip install -U pip setuptools wheel pip install -U spacy

If your machine has a GPU available, you can install spaCy with GPU support with this command:

pip install -U 'spacy[cuda12x]'

You can see the installation instructions for each operating system at https://spacy.io/usage#quickstart. shows all the available installation options.

Figure 1.2 – spaCy installation options

After installing the library, we need to install the language models. Let’s do that in the next section.

Installing spaCy’s language models


The spaCy installation doesn’t come with the statistical language models needed for the spaCy pipeline tasks. spaCy language models contain knowledge about a specific language collected from a set of resources. Language models let us perform a variety of NLP tasks, including parts of speech tagging popularly called as POS tagging and named entity recognition (NER).

Different languages have different models that are language-specific. There are also different models available for the same language. The naming convention of the models is [lang]_[name]. The...


Mesquita Déborah :

Déborah is a data science consultant and writer. With a BSc in Computer Science from UFPE, one of Brazil's top computer science programs, she brings a diversified skill set refined through hands-on experience with various technologies. Déborah has thrived in different data science projects, including roles such as lead data scientist and technical contributor for respected publications. Her ability to translate complex concepts into simple language, coupled with her quick learning and broad vision, make her an effective educator. Actively engaged in community initiatives, she works to ensure equitable access to knowledge, reflecting her belief that technology is not a panacea, but a powerful tool for societal improvement when used for that purpose.Altinok Duygu :

Duygu Altinok is a senior NLP engineer with 12 years of experience in almost all areas of NLP including search engine technology, speech recognition, text analytics, and conversational AI. She authored several publications in the NLP area at conferences such as LREC and CLNLP. She also enjoys working on open-source projects and is a contributor to the spaCy library. Duygu earned her undergraduate degree in Computer Engineering from METU, Ankara in 2010 and later earned her Master's degree in Mathematics from Bilkent University, Ankara in 2012. She is currently a senior engineer at German Autolabs with a focus on conversational AI for voice assistants. Originally from Istanbul, Duygu currently resides in Berlin, DE with her cute dog Adele.



Ihre Fragen, Wünsche oder Anmerkungen
Vorname*
Nachname*
Ihre E-Mail-Adresse*
Kundennr.
Ihre Nachricht*
Lediglich mit * gekennzeichnete Felder sind Pflichtfelder.
Wenn Sie die im Kontaktformular eingegebenen Daten durch Klick auf den nachfolgenden Button übersenden, erklären Sie sich damit einverstanden, dass wir Ihr Angaben für die Beantwortung Ihrer Anfrage verwenden. Selbstverständlich werden Ihre Daten vertraulich behandelt und nicht an Dritte weitergegeben. Sie können der Verwendung Ihrer Daten jederzeit widersprechen. Das Datenhandling bei Sack Fachmedien erklären wir Ihnen in unserer Datenschutzerklärung.