Jeroen Janssens2023-06-06T00:00:00Zhttps://jeroenjanssens.com/Jeroen Janssensjeroen@jeroenjanssens.comWe Are Writing Python Polars: The Definitive Guide2023-06-06T00:00:00Zhttps://jeroenjanssens.com/pp/<p>I’m excited to announce, on my 40<sup>th</sup> birthday no less, that
I’ll be writing another book. But this time I won’t be alone. <a href="https://twitter.com/thijsnieuwdorp">Thijs
Nieuwdorp</a> is joining me in this
adventure that we’ve dubbed <em>Python Polars: The Definitive Guide</em>. We
expect our upcoming O’Reilly title to be about 400 pages and to hit the
shelves in Q3 2024. Fun fact: Thijs and I are colleagues at
<a href="https://www.xomnia.com/">Xomnia</a>, the very birthplace of Polars.</p>
<figure>
<a href="https://jeroenjanssens.com/img/social/an-impressionist-oil-painting-of-a-polar-bear-and-a-python-reading-a-book.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 1)" srcset="https://jeroenjanssens.com/img/yB3Ydy0deX-336.webp 336w, https://jeroenjanssens.com/img/yB3Ydy0deX-504.webp 504w, https://jeroenjanssens.com/img/yB3Ydy0deX-672.webp 672w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 1)" srcset="https://jeroenjanssens.com/img/yB3Ydy0deX-336.webp 336w, https://jeroenjanssens.com/img/yB3Ydy0deX-504.webp 504w, https://jeroenjanssens.com/img/yB3Ydy0deX-672.webp 672w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 1)" srcset="https://jeroenjanssens.com/img/yB3Ydy0deX-336.jpeg 336w, https://jeroenjanssens.com/img/yB3Ydy0deX-504.jpeg 504w, https://jeroenjanssens.com/img/yB3Ydy0deX-672.jpeg 672w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 1)" srcset="https://jeroenjanssens.com/img/yB3Ydy0deX-336.jpeg 336w, https://jeroenjanssens.com/img/yB3Ydy0deX-504.jpeg 504w, https://jeroenjanssens.com/img/yB3Ydy0deX-672.jpeg 672w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 100%;" src="https://jeroenjanssens.com/img/yB3Ydy0deX-336.jpeg" alt="An impressionist oil painting of a polar bear and a python reading a book. Any similarity to the authors is entirely coincidental." loading="lazy" />
</picture></a>
<figcaption>An impressionist oil painting of a polar bear and a python reading a book. Any similarity to the authors is entirely coincidental.</figcaption>
</figure>
<p>A big thank you to <a href="https://www.linkedin.com/in/aaronblackcapm/">Aaron
Black</a> for helping us to
seal this deal. We’re looking forward to work again with <a href="https://twitter.com/GreyEditing">Sarah
Grey</a>. Sarah was also the development
editor for the second edition of <em><a href="https://jeroenjanssens.com/dsatcl">Data Science at the Command
Line</a></em>.</p>
<h2>Stay up to date</h2>
<p>We’ll share regular updates via Twitter
(<a href="https://twitter.com/jeroenhjanssens">JJ</a>,
<a href="https://twitter.com/thijsnieuwdorp">TN</a>) and LinkedIn
(<a href="https://www.linkedin.com/in/jeroenjanssens/">JJ</a>,
<a href="https://www.linkedin.com/in/thijsnieuwdorp/">TN</a>). Sign up for my
newsletter if you want to receive an email when the book is out:</p>
<script src="https://f.convertkit.com/ckjs/ck.5.js"></script>
<form action="https://app.convertkit.com/forms/4791907/subscriptions" class="seva-form formkit-form font-sans" method="post" data-sv-form="4791907" data-uid="8a19ffd998" data-format="inline" data-version="5" data-options='{"settings":{"after_subscribe":{"action":"message","success_message":"Success! Now check your email to confirm your subscription.","redirect_url":""},"analytics":{"google":null,"fathom":null,"facebook":null,"segment":null,"pinterest":null,"sparkloop":null,"googletagmanager":null},"modal":{"trigger":"timer","scroll_percentage":null,"timer":5,"devices":"all","show_once_every":15},"powered_by":{"show":true,"url":"https://convertkit.com/features/forms?utm_campaign=poweredby&utm_content=form&utm_medium=referral&utm_source=dynamic"},"recaptcha":{"enabled":false},"return_visitor":{"action":"show","custom_content":""},"slide_in":{"display_in":"bottom_right","trigger":"timer","scroll_percentage":null,"timer":5,"devices":"all","show_once_every":15},"sticky_bar":{"display_in":"top","trigger":"timer","scroll_percentage":null,"timer":5,"devices":"all","show_once_every":15}},"version":"5"}' min-width="400 500 600 700 800">
<div data-style="clean">
<ul class="formkit-alert formkit-alert-error" data-element="errors" data-group="alert">
</ul>
<div class="seva-fields formkit-fields" data-element="fields" data-stacked="false">
<div class="grid sm:grid-cols-2 gap-4">
<input type="hidden" name="fields[signup_url]" value="/pp/" />
<div>
<input class="w-full border-green-800 border-2 font-serif ring-2 ring-yellow-300" aria-label="First Name" name="fields[first_name]" required="" placeholder="First Name" type="text" />
</div>
<div>
<input class="w-full border-green-800 border-2 font-serif ring-2 ring-yellow-300" name="email_address" aria-label="Email Address" placeholder="Email Address" required="" type="email" />
</div>
<div class="sm:col-span-2 place-self-end">
<button data-element="submit" class="sm:hover:bg-yellow-300 sm:hover:text-black bg-green-800 rounded-lg px-6 py-2 text-white font-sans text-base">
Sign Up
</button>
</div>
</div>
</div>
</div>
</form>
<p>If want to help us spread the word, you can like or share this
announcement on
<a href="https://twitter.com/jeroenhjanssens/status/1666072496121032706">Twitter</a>
and
<a href="https://www.linkedin.com/feed/update/urn:li:activity:7071833117296062464/">LinkedIn</a>.
Your help is much appreciated.</p>
<h2>About Polars</h2>
<p>Polars is a highly performant DataFrame library for manipulating
structured data. The core is written in Rust, and the library is
officially available in Python, Rust, NodeJS, R, and SQL. Its three key
selling points are:</p>
<ul>
<li>Record-breaking speed on common DataFrame operations</li>
<li>Processing of larger than memory datasets</li>
<li>Explicit, concise, and flexible syntax</li>
</ul>
<figure>
<a href="https://jeroenjanssens.com/img/star-history-polars.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 1)" srcset="https://jeroenjanssens.com/img/l_iUp1DxQI-336.webp 336w, https://jeroenjanssens.com/img/l_iUp1DxQI-504.webp 504w, https://jeroenjanssens.com/img/l_iUp1DxQI-672.webp 672w, https://jeroenjanssens.com/img/l_iUp1DxQI-1008.webp 1008w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 1)" srcset="https://jeroenjanssens.com/img/l_iUp1DxQI-336.webp 336w, https://jeroenjanssens.com/img/l_iUp1DxQI-504.webp 504w, https://jeroenjanssens.com/img/l_iUp1DxQI-672.webp 672w, https://jeroenjanssens.com/img/l_iUp1DxQI-1008.webp 1008w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 1)" srcset="https://jeroenjanssens.com/img/l_iUp1DxQI-336.jpeg 336w, https://jeroenjanssens.com/img/l_iUp1DxQI-504.jpeg 504w, https://jeroenjanssens.com/img/l_iUp1DxQI-672.jpeg 672w, https://jeroenjanssens.com/img/l_iUp1DxQI-1008.jpeg 1008w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 1)" srcset="https://jeroenjanssens.com/img/l_iUp1DxQI-336.jpeg 336w, https://jeroenjanssens.com/img/l_iUp1DxQI-504.jpeg 504w, https://jeroenjanssens.com/img/l_iUp1DxQI-672.jpeg 672w, https://jeroenjanssens.com/img/l_iUp1DxQI-1008.jpeg 1008w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 100%;" src="https://jeroenjanssens.com/img/l_iUp1DxQI-336.jpeg" alt="Polars is still young compared to related technologies, but it's quickly gaining popularity." loading="lazy" />
</picture></a>
<figcaption>Polars is still young compared to related technologies, but it's quickly gaining popularity.</figcaption>
</figure>
<p>For more information see the <a href="https://www.pola.rs/">official Polars
website</a> and the <a href="https://github.com/pola-rs/polars">Polars GitHub
repository</a>.</p>
<h2>Foreword by Ritchie Vink</h2>
<p><a href="https://twitter.com/RitchieVink">Ritchie Vink</a>, the creator of Polars,
has kindly agreed to write the foreword. We couldn’t wish for a bigger
endorsement. Ritchie has no interest in writing a book himself as he
wants to focus all his time and attention on developing Polars. He’s
very excited that Thijs and I will write this book and he’s happy to
provide assistance throughout the writing process.</p>
<h2>Tentative description</h2>
<p>Get ready to speed up your data analysis and start working with
larger-than-memory datasets. Polars offers a blazingly fast,
multi-threaded, elegant API for data loading, manipulation, and
processing. Authors Jeroen Janssens and Thijs Nieuwdorp walk you through
every aspect of Python Polars as they tackle practical use cases using
real-world datasets. You’ll not only learn the syntax, but also
understand the underlying concepts. You don’t need to have any
experience with Pandas or Spark, but if you do, this book will help you
make a smooth transition.</p>
<p>With this definitive guide at your side, you’ll be able to:</p>
<ul>
<li>Process larger-than-memory datasets at record speed</li>
<li>Apply the eager, lazy, and streaming APIs of Polars and decide when to
use which</li>
<li>Transition smoothly from Pandas or Spark to Polars</li>
<li>Integrate Polars into your existing codebase</li>
<li>Work with Arrow and Parquet to efficiently read and write data</li>
<li>Translate complex ETL tasks into efficient and elegant queries</li>
</ul>
<h2>Tentative outline</h2>
<p>We’re quite happy with this outline, but it’s definitely not set in
stone. If you have any ideas don’t hesitate to <a href="mailto:jeroen@jeroenjanssens.com">reach
out</a>.</p>
<h3>Part I: Getting Started</h3>
<h4>Chapter 1: Introducing Polars</h4>
<p>The goal of this chapter is to get you excited about Polars as soon as
possible, by discussing where it comes from, covering its unique
features regarding speed and elegance, explaining how it fits into the
bigger picture, and walking them through a case study on a real-world
public dataset.</p>
<ul>
<li>Origin Story</li>
<li>Polars Philosophy and Features</li>
<li>Polars within the Bigger Ecosystem</li>
<li>Why Focus on Python Polars?</li>
<li>A Real-World Case Study</li>
</ul>
<h4>Chapter 2: First Steps</h4>
<p>Once you’re excited, it’s important to get you on board, so you can
follow along and run the code samples themselves. The goal of this
chapter is to help you get set up, whether you’re installing Polars
using <code>pip install</code>, using it via our accompanying Docker image, or
compiling it from scratch.</p>
<ul>
<li>Installing Polars</li>
<li>Using Polars in a Docker Container</li>
<li>Compiling Polars from Scratch</li>
<li>Importing Polars</li>
<li>Configuring Polars</li>
</ul>
<h4>Chapter 3: Transitioning from Pandas or Spark to Polars</h4>
<p>We expect many readers to have experience with Pandas or Spark. In this
chapter we ensure that their transition to Polars is as smooth as
possible by highlighting similarities and, more importantly, important
differences between these tools.</p>
<ul>
<li>Similarities</li>
<li>No Index and MultiIndex</li>
<li>Numpy Versus Arrow Arrays</li>
<li>Rows versus Columns</li>
<li>Differences in Syntax</li>
<li>Common Pitfalls To Avoid</li>
</ul>
<h3>Part II: Concepts and Syntax</h3>
<p>This part forms the heart of the book. The goal is to explain all the
functionality needed to analyze data efficiently and effectively. The
chapters are meant to complement the online documentation. That means
they will not be just a list of methods. Instead, we will use real-world
public datasets, provide context, and explain the why and how behind an
approach. If there are multiple approaches to accomplish a task, we will
discuss the pros and cons of each.</p>
<h4>Chapter 4: Data Types and Data Structures</h4>
<p>The goal of this chapter is to introduce the fundamental data types and
data structures. All functionality interacts with these, so it’s
important to induce this at the beginning.</p>
<ul>
<li>Arrow Data Types</li>
<li>Series</li>
<li>DataFrame</li>
<li>LazyFrame</li>
</ul>
<h4>Chapter 5: Eager, Lazy, and Streaming APIs</h4>
<p>In this chapter we explain the different types of APIs Polars has to
offer.</p>
<ul>
<li>Collecting</li>
<li>Caching</li>
<li>Performance Differences</li>
<li>Functionality Differences</li>
<li>When to use Which API?</li>
</ul>
<h4>Chapter 6: Reading and Writing Data</h4>
<p>We want to encourage the reader to start working with their own data as
soon as possible. In this chapter we demonstrate the various ways to
read data into Polars and to write the result back.</p>
<ul>
<li>CSV</li>
<li>Excel</li>
<li>Parquet</li>
<li>JSON</li>
<li>Multiple Files</li>
<li>Databases</li>
<li>AWS</li>
<li>Google BigQuery</li>
</ul>
<h4>Chapter 7: Expressions</h4>
<p>The goal of this chapter is to introduce Expressions, which are what
makes the Polars API so powerful and elegant. They play an essential
role in the remaining chapters of Part II.</p>
<ul>
<li>Operators</li>
<li>Composing Expressions</li>
<li>Functions</li>
<li>Type Casting</li>
<li>Renaming</li>
</ul>
<h4>Chapter 8: Selecting and Creating Columns</h4>
<p>The goal of this chapter is to explain how existing columns in a
DataFrame can be rearranged or dropped and new columns can be created.
We’re going to apply the various functions on real-world datasets.</p>
<ul>
<li>Selection Context</li>
<li>Regular Expressions</li>
<li><code>.with_columns()</code> and Relevant Expressions</li>
<li>Adding Row Counts</li>
</ul>
<h4>Chapter 9: Filtering and Sorting Rows</h4>
<p>Whereas the previous chapter was about columns, this chapter is all
about the rows in a DataFrame. How can rows be sorted or discarded based
on some condition. Again, we’re going to demonstrate the various
functions by using real-world datasets.</p>
<ul>
<li>Filtering Context</li>
<li>Predicates</li>
<li>Compound Predicates</li>
<li>Sorting</li>
<li>Sorting in a Selection Context</li>
</ul>
<h4>Chapter 10: Working with Special Data Types</h4>
<p>There are certain data types that deserve special attention. This
chapter covers how to deal with strings, categories, time series,
columns that contain lists as values, and missing values.</p>
<ul>
<li>Strings</li>
<li>Categories</li>
<li>Temporal Data</li>
<li>Lists</li>
<li>Missing Values</li>
</ul>
<h4>Chapter 11: Summarizing and Aggregating</h4>
<p>This chapter discusses how the reader can summarize and aggregate their
data. There are various ways to do this, and it’s important to know when
to use which.</p>
<ul>
<li>Groupby Context</li>
<li><code>.over()</code> Expressions in Selection Context</li>
<li>Dynamic Grouping</li>
<li>Rolling Aggregations</li>
</ul>
<h4>Chapter 12: Joining and Concatenating</h4>
<p>Data often comes from multiple sources. In this chapter we explain
different ways how these sources can be combined.</p>
<ul>
<li>Basic Joining</li>
<li>Semi and Anti Joining</li>
<li>Inexact Joining</li>
<li>Vertical Concatenation</li>
<li>Horizontal Concatenation</li>
</ul>
<h4>Chapter 13: Reshaping</h4>
<p>The same values can be represented in a long or wide format (or
something in between). This chapter covers different ways to reshape the
data.</p>
<ul>
<li>Wide Versus Long DataFrames</li>
<li>Pivot to Wider DataFrame</li>
<li>Melt to Longer DataFrame</li>
<li>Exploding</li>
<li>Correlating</li>
<li>Partition Into Multiple DataFrames</li>
</ul>
<h3>Part III: Advanced Topics</h3>
<h4>Chapter 14: Extending Polars</h4>
<p>Sometimes you just need additional functionality and business logic in
your data analysis. This chapter explains how to properly create User
Defined Functions and extend the Polars data structures with additional
expressions and methods so that the code remains fast and elegant.</p>
<ul>
<li>User Defined Functions</li>
<li>Custom Expressions</li>
<li>Custom Methods</li>
</ul>
<h4>Chapter 15: SQL with Polars</h4>
<p>Polars allows you to apply SQL queries directly on DataFrames. If you
already knows SQL, then that can be very useful. This chapter explains
how to do that in Python and from the command line.</p>
<ul>
<li><code>SELECT</code> Queries</li>
<li><code>CREATE</code> Queries</li>
<li>Common Table Expressions</li>
<li>Command-Line Interface</li>
</ul>
<h4>Chapter 16: Debugging and Testing with Polars</h4>
<p>When a data analysis has to be put in production, it’s important to be
able to deal with exceptions and to add appropriate unit tests. This
chapter explains how to debug and test your Polars code.</p>
<ul>
<li>Explaining Query Plans</li>
<li>Using Polars in Unit Tests</li>
<li>Polars Exceptions and Asserts</li>
<li>Parametric Testing</li>
</ul>
<h4>Chapter 17: Polars Internals</h4>
<p>In this chapter we take a look under the hood of Polars. If the reader
understands what makes Polars fast, then they’ll be able to avoid
writing code that slows it down.</p>
<ul>
<li>What Makes Polars so Fast?</li>
<li>Query Optimization</li>
<li>Multi-Threaded Computations</li>
<li>SIMD Operations</li>
</ul>
<h4>Chapter 18: Integrating with Other Tools</h4>
<p>Polars is part of a larger PyData ecosystem. Thanks to Apache Arrow,
Polars is able to work together seamlessly with other tools. This
chapter explains how to integrate Polars with those tools.</p>
<ul>
<li>Pandas</li>
<li>PyArrow</li>
<li>DuckDB</li>
</ul>
Stem-and-Leaf Plot Playground2023-04-23T00:00:00Zhttps://jeroenjanssens.com/stem/<style>
ul.stems {
list-style-type: none;
}
ul.stems, ul.leaves, li.stem, li.leaf {
margin: 0 !important;
line-height: 1;
}
li.stem a, li.leaf a {
text-decoration-line: none;
color: inherit;
}
ul.stems > li > label {
display: inline-block;
padding: .1em .5em;
border-right: 2px solid #000;
text-align: right;
font-weight: bold;
width: 2em;
}
ul.leaves {
display: inline-block;
padding-left: .5em;
}
ul.leaves > li {
display: inline-block;
width: .7em;
text-align: center;
position: relative;
cursor: default;
padding: .1em 0;
}
ul.leaves > li:hover {
background: #000;
color: #fff;
}
ul.leaves > li:hover:after {
background: #ccc;
border: 1px solid #000;
color: #000;
content: attr(data-value);
position: absolute;
padding: .2em .5em;
top: -2em;
left: 1em;
pointer-events: none;
font-size: 80%;
}
textarea {
border: 1px solid #333;
width: 100%;
}
</style>
<p><em>Inspired by Julia Evans’ post <a href="https://jvns.ca/blog/2023/04/17/a-list-of-programming-playgrounds/">A list of programming
playgrounds</a>,
I decided to revive a playground I made about 10 years ago.</em></p>
<p>Back in the old days, when many data sets were still small,
<a href="http://en.wikipedia.org/wiki/Stem-and-leaf_display">stem-and-leaf
plots</a> were a common
method of representing quantitative data. John Tukey’s <a href="http://www.amazon.com/Exploratory-Data-Analysis-John-Tukey/dp/0201076160/">Exploratory Data
Analysis</a>,
which popularized stem-and-leaf plots, has a nice example on the cover.</p>
<figure>
<a href="https://jeroenjanssens.com/img/eda.jpg">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.6)" srcset="https://jeroenjanssens.com/img/Pr9E-VLpwv-201.webp 201w, https://jeroenjanssens.com/img/Pr9E-VLpwv-302.webp 302w, https://jeroenjanssens.com/img/Pr9E-VLpwv-403.webp 403w, https://jeroenjanssens.com/img/Pr9E-VLpwv-604.webp 604w, https://jeroenjanssens.com/img/Pr9E-VLpwv-806.webp 806w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.6)" srcset="https://jeroenjanssens.com/img/Pr9E-VLpwv-201.webp 201w, https://jeroenjanssens.com/img/Pr9E-VLpwv-302.webp 302w, https://jeroenjanssens.com/img/Pr9E-VLpwv-403.webp 403w, https://jeroenjanssens.com/img/Pr9E-VLpwv-604.webp 604w, https://jeroenjanssens.com/img/Pr9E-VLpwv-806.webp 806w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.6)" srcset="https://jeroenjanssens.com/img/Pr9E-VLpwv-201.jpeg 201w, https://jeroenjanssens.com/img/Pr9E-VLpwv-302.jpeg 302w, https://jeroenjanssens.com/img/Pr9E-VLpwv-403.jpeg 403w, https://jeroenjanssens.com/img/Pr9E-VLpwv-604.jpeg 604w, https://jeroenjanssens.com/img/Pr9E-VLpwv-806.jpeg 806w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.6)" srcset="https://jeroenjanssens.com/img/Pr9E-VLpwv-201.jpeg 201w, https://jeroenjanssens.com/img/Pr9E-VLpwv-302.jpeg 302w, https://jeroenjanssens.com/img/Pr9E-VLpwv-403.jpeg 403w, https://jeroenjanssens.com/img/Pr9E-VLpwv-604.jpeg 604w, https://jeroenjanssens.com/img/Pr9E-VLpwv-806.jpeg 806w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 60%;" src="https://jeroenjanssens.com/img/Pr9E-VLpwv-201.jpeg" alt="EDA’s cover contains a stem-and-leaf plot. Click to enlarge." loading="lazy" />
</picture></a>
<figcaption>EDA’s cover contains a stem-and-leaf plot. Click to enlarge.</figcaption>
</figure>
<h2>Playground</h2>
<p>In 2013 I needed an excuse to try out <a href="https://d3js.org/">D3</a> and
<a href="https://coffeescript.org/">CoffeeScript</a>, so I made an interactive
playground for stem-and-leaf plots. The original CoffeeScript code and
transpiled JavaScript are available <a href="https://gist.github.com/jeroenjanssens/6395842">on
GitHub</a>.</p>
<p>The plot updates as you change the values in the textarea. Try adding
negative values, very large values, or even fractions. If you hover over
the leaves you see the original values. The example data comes from
EDA’s cover.</p>
<p>
<textarea class="font-sans border-green-800 border-2" rows="7">17, 32, 47, 53, 60, 61, 64, 67, 70, 70, 71, 72, 73, 73, 74, 76, 77, 79, 81, 82, 83, 83, 83, 83, 84, 85, 86, 87, 87, 88, 89, 90, 91, 91, 92, 94, 94, 95, 96, 97, 98, 98, 99, 99, 99, 99, 99, 100, 100, 100, 101, 101, 102, 103, 103, 103, 103, 104, 106, 106, 106, 106, 107, 107, 107, 107, 108, 109, 109, 110, 111, 111, 111, 112, 112, 113, 114, 114, 114, 115, 116, 117, 117, 119, 120, 120, 120, 120, 121, 121, 122, 122, 122, 123, 124, 124, 125, 125, 126, 126, 127, 128, 129, 130, 131, 131, 131, 131, 132, 132, 132, 132, 133, 133, 134, 134, 134, 135, 135, 135, 136, 136, 136, 137, 138, 139, 140, 140, 142, 143, 144, 145, 145, 145, 145, 145, 147, 149, 152, 155, 157, 159</textarea>
</p>
<div class="playground"></div>
<script src="https://d3js.org/d3.v3.min.js"></script>
<script>
// Generated by CoffeeScript 1.6.1
(function() {
var div, plot, stem_and_leaf, stems_and_leaves, text, update;
stem_and_leaf = function(value, base) {
var leaf, sign, stem;
sign = value < 0 ? "-" : "";
value = Math.abs(Math.round(value));
stem = sign + Math.floor(value / base);
leaf = value % base;
return [stem, leaf];
};
stems_and_leaves = function(data, base) {
var end, leaf, leaves, max, min, start, stem, stemdata, stems, value, _i, _j, _len, _ref;
if (base == null) {
base = 10;
}
data.sort(function(a, b) {
return a - b;
});
min = data[0];
max = data[data.length - 1];
start = +stem_and_leaf(min, base)[0];
end = +stem_and_leaf(max, base)[0];
stems = {};
for (stem = _i = start; start <= end ? _i <= end : _i >= end; stem = start <= end ? ++_i : --_i) {
if (stem === 0 && min < 0) {
stems["-0"] = [];
}
stems["" + stem] = [];
}
for (_j = 0, _len = data.length; _j < _len; _j++) {
value = data[_j];
_ref = stem_and_leaf(value, base), stem = _ref[0], leaf = _ref[1];
stems[stem].push({
'leaf': leaf,
'value': value
});
}
stemdata = [];
for (stem in stems) {
leaves = stems[stem];
stemdata.push({
'stem': stem,
'leaves': leaves
});
}
return stemdata;
};
update = function() {
var data, stem_enter, stemdata, stems, x;
data = text.node().value.split(",").map(function(x) {
return parseFloat(x);
});
data = (function() {
var _i, _len, _results;
_results = [];
for (_i = 0, _len = data.length; _i < _len; _i++) {
x = data[_i];
if (!isNaN(x)) {
_results.push(x);
}
}
return _results;
})();
console.log(data);
stemdata = stems_and_leaves(data);
stems = plot.selectAll("li.stem").data(stemdata, function(d) {
return d.stem + ((function() {
var _i, _len, _ref, _results;
_ref = d.leaves;
_results = [];
for (_i = 0, _len = _ref.length; _i < _len; _i++) {
x = _ref[_i];
_results.push(x.value);
}
return _results;
})()).join(',');
});
stem_enter = stems.enter().append("li").attr("class", "stem");
stem_enter.append("label");
stem_enter.append("ul").attr("class", "leaves").selectAll("li.leaf").data(function(d) {
return d.leaves;
}).enter().append("li").attr("class", "leaf").attr('data-value', function(d) {
return d.value;
}).text(function(d) {
return d.leaf;
});
stems.sort(function(a, b) {
var y;
x = a.stem;
y = b.stem;
if ((x === "-0") && (y === "0")) {
return -1;
}
if ((y === "-0") && (x === "0")) {
return 1;
}
if (+x < +y) {
return -1;
} else {
return 1;
}
});
stems.select("label").html(function(d) {
return d.stem + "‍";
});
return stems.exit().remove();
d3.select(self.frameElement).style("height", div.style("height"));
};
div = d3.select("div.playground");
plot = div.append("ul").attr("class", "stems font-sans");
text = d3.select("textarea")
text.on("keyup", function() {
return update();
});
update();
}).call(this);
</script>
Archive of My Online Course Embrace the Command Line2023-03-23T00:00:00Zhttps://jeroenjanssens.com/embrace/<p><em>Embrace the Command Line</em> was a three-week cohort-based course, created
to help developers and researchers getting started with the command
line. It was roughly based on my book <em>Data Science at the Command
Line</em>.</p>
<p>I had the pleasure of running the course twice in 2022, with students
from all over the world. During the course, I really got to know the
students and their situation, allowing me to better help them. Both
times were rewarding yet time consuming.</p>
<p>It was scheduled for a third time in April 2023, but because I joined
<a href="https://www.xomnia.com/">Xomnia</a> as a full-time employee at the
beginning of the year, I didn’t have enough time left and needed to
cancel it, unfortunately. I’m not sure whether I’ll be able to run
another cohort. If you and your colleagues are interested we could talk
about organizing an incompany training.</p>
<p>The course was hosted on <a href="https://maven.com/">Maven</a>, which provides
many great tools to manage the course material, keep in touch with
(potential) students, and organize online sessions. I participated in
their rather excellent bootcamp, which really helped me to think about
the course structure and write the <a href="https://maven.com/data-science-workshops/embrace-the-command-line">landing
page</a>.</p>
<p>Should the landing page go down, below is most of the copy together with
a few screenshots for archival purposes. The video shown in the first
screenshot, <em>How Researchers and Developers Can Benefit from the Command
Line</em>, can be watched on YouTube:
<a href="https://www.youtube.com/watch?v=0XI8FPVnNzY">https://www.youtube.com/watch?v=0XI8FPVnNzY</a>.</p>
<p>– Jeroen</p>
<figure>
<a href="https://jeroenjanssens.com/img/embrace/embrace-screenshot-1.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 1)" srcset="https://jeroenjanssens.com/img/4E9cKAiaRo-336.webp 336w, https://jeroenjanssens.com/img/4E9cKAiaRo-504.webp 504w, https://jeroenjanssens.com/img/4E9cKAiaRo-672.webp 672w, https://jeroenjanssens.com/img/4E9cKAiaRo-1008.webp 1008w, https://jeroenjanssens.com/img/4E9cKAiaRo-1344.webp 1344w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 1)" srcset="https://jeroenjanssens.com/img/4E9cKAiaRo-336.webp 336w, https://jeroenjanssens.com/img/4E9cKAiaRo-504.webp 504w, https://jeroenjanssens.com/img/4E9cKAiaRo-672.webp 672w, https://jeroenjanssens.com/img/4E9cKAiaRo-1008.webp 1008w, https://jeroenjanssens.com/img/4E9cKAiaRo-1344.webp 1344w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 1)" srcset="https://jeroenjanssens.com/img/4E9cKAiaRo-336.jpeg 336w, https://jeroenjanssens.com/img/4E9cKAiaRo-504.jpeg 504w, https://jeroenjanssens.com/img/4E9cKAiaRo-672.jpeg 672w, https://jeroenjanssens.com/img/4E9cKAiaRo-1008.jpeg 1008w, https://jeroenjanssens.com/img/4E9cKAiaRo-1344.jpeg 1344w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 1)" srcset="https://jeroenjanssens.com/img/4E9cKAiaRo-336.jpeg 336w, https://jeroenjanssens.com/img/4E9cKAiaRo-504.jpeg 504w, https://jeroenjanssens.com/img/4E9cKAiaRo-672.jpeg 672w, https://jeroenjanssens.com/img/4E9cKAiaRo-1008.jpeg 1008w, https://jeroenjanssens.com/img/4E9cKAiaRo-1344.jpeg 1344w" media="(min-width: 0px)" />
<img class="mx-auto img-border" style="width: 100%;" src="https://jeroenjanssens.com/img/4E9cKAiaRo-336.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
<blockquote>
<p>In this hands-on course you’ll learn to use the command line to
automate tedious tasks, work with data quickly, and create your own
toolbox.</p>
</blockquote>
<blockquote>
<p>It’s amazing how fast so much data work can be performed at the
command line before ever pulling the data into R, Python, or a
database. Knowing it well makes it easy to take back control of your
computer and to translate questions you have of your data to real-time
insights.</p>
</blockquote>
<figure>
<a href="https://jeroenjanssens.com/img/embrace/embrace-screenshot-2.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 1)" srcset="https://jeroenjanssens.com/img/_wsZkyGEXs-336.webp 336w, https://jeroenjanssens.com/img/_wsZkyGEXs-504.webp 504w, https://jeroenjanssens.com/img/_wsZkyGEXs-672.webp 672w, https://jeroenjanssens.com/img/_wsZkyGEXs-1008.webp 1008w, https://jeroenjanssens.com/img/_wsZkyGEXs-1344.webp 1344w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 1)" srcset="https://jeroenjanssens.com/img/_wsZkyGEXs-336.webp 336w, https://jeroenjanssens.com/img/_wsZkyGEXs-504.webp 504w, https://jeroenjanssens.com/img/_wsZkyGEXs-672.webp 672w, https://jeroenjanssens.com/img/_wsZkyGEXs-1008.webp 1008w, https://jeroenjanssens.com/img/_wsZkyGEXs-1344.webp 1344w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 1)" srcset="https://jeroenjanssens.com/img/_wsZkyGEXs-336.jpeg 336w, https://jeroenjanssens.com/img/_wsZkyGEXs-504.jpeg 504w, https://jeroenjanssens.com/img/_wsZkyGEXs-672.jpeg 672w, https://jeroenjanssens.com/img/_wsZkyGEXs-1008.jpeg 1008w, https://jeroenjanssens.com/img/_wsZkyGEXs-1344.jpeg 1344w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 1)" srcset="https://jeroenjanssens.com/img/_wsZkyGEXs-336.jpeg 336w, https://jeroenjanssens.com/img/_wsZkyGEXs-504.jpeg 504w, https://jeroenjanssens.com/img/_wsZkyGEXs-672.jpeg 672w, https://jeroenjanssens.com/img/_wsZkyGEXs-1008.jpeg 1008w, https://jeroenjanssens.com/img/_wsZkyGEXs-1344.jpeg 1344w" media="(min-width: 0px)" />
<img class="mx-auto img-border" style="width: 100%;" src="https://jeroenjanssens.com/img/_wsZkyGEXs-336.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
<blockquote>
<p>This course is for you if… You’re a developer looking to streamline
your workflow and take back control over your computer. You’re a
researcher looking to become more efficient and productive at working
with data. You feel intimidated by the power of the command line but
understand the benefits it brings.</p>
</blockquote>
<blockquote>
<p>Key outcomes</p>
<p>A new way of working: Run and string together small but powerful tools
to accomplish and automate tedious tasks. Integrate seamlessly with
your existing workflow.</p>
<p>Be more efficient: Parallelize and distribute your data-intensive or
compute-heavy tasks to multiple cores and machines.</p>
<p>Data science skills: Easily obtain, inspect, transform, and visualize
data coming from various sources (including APIs, server logs,
spreadsheets, and databases).</p>
<p>Build your own toolbox: Turn ad-hoc commands into reusable
command-line tools and even convert your existing code (including
Python, R, and JavaScript) to create your own tools.</p>
<p>Hands-on experience: We’re actually going to get our hands dirty in
this course. Through workshops and exercises you’ll quickly become
comfortable working at the command line.</p>
<p>Be part of a great community: You’re not alone in this. You’ll
surround yourself with like-minded people who want to grow alongside
you.</p>
<p>Solid foundation: It’s impossible to cover everything the command-line
has to offer. Instead, I’ll make sure you know how to keep on learning
after the course.</p>
</blockquote>
<figure>
<a href="https://jeroenjanssens.com/img/embrace/embrace-screenshot-3.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 1)" srcset="https://jeroenjanssens.com/img/4wsnajnIpk-336.webp 336w, https://jeroenjanssens.com/img/4wsnajnIpk-504.webp 504w, https://jeroenjanssens.com/img/4wsnajnIpk-672.webp 672w, https://jeroenjanssens.com/img/4wsnajnIpk-1008.webp 1008w, https://jeroenjanssens.com/img/4wsnajnIpk-1344.webp 1344w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 1)" srcset="https://jeroenjanssens.com/img/4wsnajnIpk-336.webp 336w, https://jeroenjanssens.com/img/4wsnajnIpk-504.webp 504w, https://jeroenjanssens.com/img/4wsnajnIpk-672.webp 672w, https://jeroenjanssens.com/img/4wsnajnIpk-1008.webp 1008w, https://jeroenjanssens.com/img/4wsnajnIpk-1344.webp 1344w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 1)" srcset="https://jeroenjanssens.com/img/4wsnajnIpk-336.jpeg 336w, https://jeroenjanssens.com/img/4wsnajnIpk-504.jpeg 504w, https://jeroenjanssens.com/img/4wsnajnIpk-672.jpeg 672w, https://jeroenjanssens.com/img/4wsnajnIpk-1008.jpeg 1008w, https://jeroenjanssens.com/img/4wsnajnIpk-1344.jpeg 1344w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 1)" srcset="https://jeroenjanssens.com/img/4wsnajnIpk-336.jpeg 336w, https://jeroenjanssens.com/img/4wsnajnIpk-504.jpeg 504w, https://jeroenjanssens.com/img/4wsnajnIpk-672.jpeg 672w, https://jeroenjanssens.com/img/4wsnajnIpk-1008.jpeg 1008w, https://jeroenjanssens.com/img/4wsnajnIpk-1344.jpeg 1344w" media="(min-width: 0px)" />
<img class="mx-auto img-border" style="width: 100%;" src="https://jeroenjanssens.com/img/4wsnajnIpk-336.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
<blockquote>
<p>The command line is a powerful piece of technology available on
Windows, macOS, and Linux.</p>
</blockquote>
<blockquote>
<p>The Unix or Linux command line, sometimes referred to as the shell or
the terminal, is as powerful as it is intimidating. By typing
commands, you can rename thousands of files, process large amounts of
data, and work on remote machines with ease. But make one mistake and
everything will explode!</p>
<p>At least, that’s what many think when they first encounter this stark
and unforgiving environment. I can’t blame them, the command line just
doesn’t look very inviting. Still, the fact remains that the command
line successfully enables thousands of developers and researchers to
be more efficient and productive at work. All they had to do is
embrace it.</p>
<p>In this three-week cohort based course, I’ll help you embrace the
command line so you can also become more efficient and productive.</p>
</blockquote>
<figure>
<a href="https://jeroenjanssens.com/img/embrace/embrace-screenshot-4.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 1)" srcset="https://jeroenjanssens.com/img/0XkIUrJ7SF-336.webp 336w, https://jeroenjanssens.com/img/0XkIUrJ7SF-504.webp 504w, https://jeroenjanssens.com/img/0XkIUrJ7SF-672.webp 672w, https://jeroenjanssens.com/img/0XkIUrJ7SF-1008.webp 1008w, https://jeroenjanssens.com/img/0XkIUrJ7SF-1344.webp 1344w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 1)" srcset="https://jeroenjanssens.com/img/0XkIUrJ7SF-336.webp 336w, https://jeroenjanssens.com/img/0XkIUrJ7SF-504.webp 504w, https://jeroenjanssens.com/img/0XkIUrJ7SF-672.webp 672w, https://jeroenjanssens.com/img/0XkIUrJ7SF-1008.webp 1008w, https://jeroenjanssens.com/img/0XkIUrJ7SF-1344.webp 1344w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 1)" srcset="https://jeroenjanssens.com/img/0XkIUrJ7SF-336.jpeg 336w, https://jeroenjanssens.com/img/0XkIUrJ7SF-504.jpeg 504w, https://jeroenjanssens.com/img/0XkIUrJ7SF-672.jpeg 672w, https://jeroenjanssens.com/img/0XkIUrJ7SF-1008.jpeg 1008w, https://jeroenjanssens.com/img/0XkIUrJ7SF-1344.jpeg 1344w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 1)" srcset="https://jeroenjanssens.com/img/0XkIUrJ7SF-336.jpeg 336w, https://jeroenjanssens.com/img/0XkIUrJ7SF-504.jpeg 504w, https://jeroenjanssens.com/img/0XkIUrJ7SF-672.jpeg 672w, https://jeroenjanssens.com/img/0XkIUrJ7SF-1008.jpeg 1008w, https://jeroenjanssens.com/img/0XkIUrJ7SF-1344.jpeg 1344w" media="(min-width: 0px)" />
<img class="mx-auto img-border" style="width: 100%;" src="https://jeroenjanssens.com/img/0XkIUrJ7SF-336.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
<blockquote>
<p>Great workshop! Very well done and very useful information delivered
in an excellent and interactive manner. Jeroen anticipated very well
on the different knowledge levels within the group. I would highly
recommend this course to anyone that is interested kickstarting their
command-line experiences.</p>
<p>As a seasoned UNIX command line adept, I didn’t expect to learn much.
I was wrong! Over the years, many new tools have become available that
I didn’t know about, and that can be combined with traditional tools
in new ways. I have been able to simplify and improve the efficiency
of many of the scripts I use on a daily basis.</p>
<p>Besides demonstrating a good knowledge and experience in command-line
tools for data science, Jeroen had very good training skills, clear
communication, and managed to adapt the level of the training to the
level of the audience, which is not always easy!</p>
<p>I found Jeroen to be a wonderfully welcoming, knowledgeable, and
patient instructor. He covered content at a very nice pace, and made
the workshop feel like a welcoming space where any question was fair
game. Thanks to our small class, I really appreciated how he took
interest in what each participant wanted to get out of the class.</p>
<p>Jeroen is a great coach. Because he is able to tailor the course to
the business challenges of the participants, the learning curve goes
straight up! Jeroen quickly switches to the knowledge level of the
participants, so that everyone is guided in a tailored manner.</p>
<p>This training was very enlightening. I discovered that most of our
tasks could be achieved using simple tools, without the need for
heavyweight & complex software. This training not only got me data
science skills with simple tools, but I also felt very confident as a
command-line power user.</p>
</blockquote>
<figure>
<a href="https://jeroenjanssens.com/img/embrace/embrace-screenshot-5.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 1)" srcset="https://jeroenjanssens.com/img/SmvFKNVSOV-336.webp 336w, https://jeroenjanssens.com/img/SmvFKNVSOV-504.webp 504w, https://jeroenjanssens.com/img/SmvFKNVSOV-672.webp 672w, https://jeroenjanssens.com/img/SmvFKNVSOV-1008.webp 1008w, https://jeroenjanssens.com/img/SmvFKNVSOV-1344.webp 1344w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 1)" srcset="https://jeroenjanssens.com/img/SmvFKNVSOV-336.webp 336w, https://jeroenjanssens.com/img/SmvFKNVSOV-504.webp 504w, https://jeroenjanssens.com/img/SmvFKNVSOV-672.webp 672w, https://jeroenjanssens.com/img/SmvFKNVSOV-1008.webp 1008w, https://jeroenjanssens.com/img/SmvFKNVSOV-1344.webp 1344w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 1)" srcset="https://jeroenjanssens.com/img/SmvFKNVSOV-336.jpeg 336w, https://jeroenjanssens.com/img/SmvFKNVSOV-504.jpeg 504w, https://jeroenjanssens.com/img/SmvFKNVSOV-672.jpeg 672w, https://jeroenjanssens.com/img/SmvFKNVSOV-1008.jpeg 1008w, https://jeroenjanssens.com/img/SmvFKNVSOV-1344.jpeg 1344w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 1)" srcset="https://jeroenjanssens.com/img/SmvFKNVSOV-336.jpeg 336w, https://jeroenjanssens.com/img/SmvFKNVSOV-504.jpeg 504w, https://jeroenjanssens.com/img/SmvFKNVSOV-672.jpeg 672w, https://jeroenjanssens.com/img/SmvFKNVSOV-1008.jpeg 1008w, https://jeroenjanssens.com/img/SmvFKNVSOV-1344.jpeg 1344w" media="(min-width: 0px)" />
<img class="mx-auto img-border" style="width: 100%;" src="https://jeroenjanssens.com/img/SmvFKNVSOV-336.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
<blockquote>
<p>Meet your instructor Jeroen Janssens, PhD</p>
<p>Hi there, I’m Jeroen. I’m a data science consultant and certified
instructor. My expertise lies in visualizing data, implementing
machine learning models, and building software using Python, R,
JavaScript, and Bash.</p>
<p>In 2014 I wrote the book Data Science at the Command Line (O’Reilly
Media). Since then I’ve helped hundreds of developers and researchers
embrace the command line. Recently I finished the second edition of
the book.</p>
<p>I run Data Science Workshops, a training and coaching firm that helps
organizations such as Amazon, eHealth Africa, Schiphol Airport, The
New York Times, and T-Mobile to upgrade their skills and knowledge. I
hold a PhD in machine learning from Tilburg University and an MSc in
artificial intelligence from Maastricht University.</p>
</blockquote>
<figure>
<a href="https://jeroenjanssens.com/img/embrace/embrace-screenshot-6.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 1)" srcset="https://jeroenjanssens.com/img/K037gWDtbQ-336.webp 336w, https://jeroenjanssens.com/img/K037gWDtbQ-504.webp 504w, https://jeroenjanssens.com/img/K037gWDtbQ-672.webp 672w, https://jeroenjanssens.com/img/K037gWDtbQ-1008.webp 1008w, https://jeroenjanssens.com/img/K037gWDtbQ-1344.webp 1344w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 1)" srcset="https://jeroenjanssens.com/img/K037gWDtbQ-336.webp 336w, https://jeroenjanssens.com/img/K037gWDtbQ-504.webp 504w, https://jeroenjanssens.com/img/K037gWDtbQ-672.webp 672w, https://jeroenjanssens.com/img/K037gWDtbQ-1008.webp 1008w, https://jeroenjanssens.com/img/K037gWDtbQ-1344.webp 1344w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 1)" srcset="https://jeroenjanssens.com/img/K037gWDtbQ-336.jpeg 336w, https://jeroenjanssens.com/img/K037gWDtbQ-504.jpeg 504w, https://jeroenjanssens.com/img/K037gWDtbQ-672.jpeg 672w, https://jeroenjanssens.com/img/K037gWDtbQ-1008.jpeg 1008w, https://jeroenjanssens.com/img/K037gWDtbQ-1344.jpeg 1344w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 1)" srcset="https://jeroenjanssens.com/img/K037gWDtbQ-336.jpeg 336w, https://jeroenjanssens.com/img/K037gWDtbQ-504.jpeg 504w, https://jeroenjanssens.com/img/K037gWDtbQ-672.jpeg 672w, https://jeroenjanssens.com/img/K037gWDtbQ-1008.jpeg 1008w, https://jeroenjanssens.com/img/K037gWDtbQ-1344.jpeg 1344w" media="(min-width: 0px)" />
<img class="mx-auto img-border" style="width: 100%;" src="https://jeroenjanssens.com/img/K037gWDtbQ-336.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
<blockquote>
<p>Designed for both researchers and developers. While my book is aimed
at data scientists only, this course will give you command-line
knowledge and skills that are useful for doing research and developing
software.</p>
<p>Absolutely hands-on. It’s one thing to read a book. To try it for
yourself is a different story. During the live sessions, you’ll get
hands-on experience in a safe environment, making you well prepared.</p>
<p>Taught by an experienced, certified instructor. I’ve trained and
coached hundreds of students in the past eight years. My approach is
practical and casual, but also sustainable. I’ll be able to give you
the personal attention you need.</p>
<p>More fun and effective. Because you’ll be embracing the command line
with other researchers and developers. You’ll be part of a welcoming
community of like-minded people.</p>
</blockquote>
<figure>
<a href="https://jeroenjanssens.com/img/embrace/embrace-screenshot-7.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 1)" srcset="https://jeroenjanssens.com/img/OGH3AcNCrd-336.webp 336w, https://jeroenjanssens.com/img/OGH3AcNCrd-504.webp 504w, https://jeroenjanssens.com/img/OGH3AcNCrd-672.webp 672w, https://jeroenjanssens.com/img/OGH3AcNCrd-1008.webp 1008w, https://jeroenjanssens.com/img/OGH3AcNCrd-1344.webp 1344w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 1)" srcset="https://jeroenjanssens.com/img/OGH3AcNCrd-336.webp 336w, https://jeroenjanssens.com/img/OGH3AcNCrd-504.webp 504w, https://jeroenjanssens.com/img/OGH3AcNCrd-672.webp 672w, https://jeroenjanssens.com/img/OGH3AcNCrd-1008.webp 1008w, https://jeroenjanssens.com/img/OGH3AcNCrd-1344.webp 1344w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 1)" srcset="https://jeroenjanssens.com/img/OGH3AcNCrd-336.jpeg 336w, https://jeroenjanssens.com/img/OGH3AcNCrd-504.jpeg 504w, https://jeroenjanssens.com/img/OGH3AcNCrd-672.jpeg 672w, https://jeroenjanssens.com/img/OGH3AcNCrd-1008.jpeg 1008w, https://jeroenjanssens.com/img/OGH3AcNCrd-1344.jpeg 1344w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 1)" srcset="https://jeroenjanssens.com/img/OGH3AcNCrd-336.jpeg 336w, https://jeroenjanssens.com/img/OGH3AcNCrd-504.jpeg 504w, https://jeroenjanssens.com/img/OGH3AcNCrd-672.jpeg 672w, https://jeroenjanssens.com/img/OGH3AcNCrd-1008.jpeg 1008w, https://jeroenjanssens.com/img/OGH3AcNCrd-1344.jpeg 1344w" media="(min-width: 0px)" />
<img class="mx-auto img-border" style="width: 100%;" src="https://jeroenjanssens.com/img/OGH3AcNCrd-336.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
<blockquote>
<p>Course syllabus</p>
<p>01 Essential concepts of the command line: Run command-line tools;
Combine command-line tools; Redirect input and output; Work with files
and directories; Get help.</p>
<p>02 Making the Command Line Less Scary: Customizing your prompt and
environment; Creating aliases for rm and mv; Setting up a “recycle
bin”.</p>
<p>03 Obtaining Data: Download files and data; Import spreadsheets; Query
databases; Call RESTful APIs.</p>
<p>04 Parallel processing: Introducing GNU parallel; Looping over files
and lines; Logging and output; Distributed processing.</p>
<p>05 Working with Text: Data Search through text; Extract values; Clean
up messy data.</p>
<p>06 Working with JSON: Data Introducing jq; Reformat; Extract values;
Convert to CSV.</p>
<p>07 Working with CSV: Data Introducing xsv; Select rows and columns;
Run SQL queries on CSV.</p>
<p>08 Editing Files: The basics cat and echo; Introducing nano; What
about vim and emacs?</p>
<p>09 Creating Command-line tools: From Bash; From Python; From R.</p>
<p>10 Exploring Data: Inspect data quickly; Create visualizations;
Viewing images on the command line.</p>
<p>11 Automating Things: Set up build pipeline; Deploy software; Make
analyses reproducible.</p>
<p>12 Version Control: Introducing Git and GitHub; Staging and
committing; Branching and merging; Pulling and pushing.</p>
</blockquote>
<figure>
<a href="https://jeroenjanssens.com/img/embrace/embrace-screenshot-8.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 1)" srcset="https://jeroenjanssens.com/img/2G36w8psbC-336.webp 336w, https://jeroenjanssens.com/img/2G36w8psbC-504.webp 504w, https://jeroenjanssens.com/img/2G36w8psbC-672.webp 672w, https://jeroenjanssens.com/img/2G36w8psbC-1008.webp 1008w, https://jeroenjanssens.com/img/2G36w8psbC-1344.webp 1344w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 1)" srcset="https://jeroenjanssens.com/img/2G36w8psbC-336.webp 336w, https://jeroenjanssens.com/img/2G36w8psbC-504.webp 504w, https://jeroenjanssens.com/img/2G36w8psbC-672.webp 672w, https://jeroenjanssens.com/img/2G36w8psbC-1008.webp 1008w, https://jeroenjanssens.com/img/2G36w8psbC-1344.webp 1344w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 1)" srcset="https://jeroenjanssens.com/img/2G36w8psbC-336.jpeg 336w, https://jeroenjanssens.com/img/2G36w8psbC-504.jpeg 504w, https://jeroenjanssens.com/img/2G36w8psbC-672.jpeg 672w, https://jeroenjanssens.com/img/2G36w8psbC-1008.jpeg 1008w, https://jeroenjanssens.com/img/2G36w8psbC-1344.jpeg 1344w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 1)" srcset="https://jeroenjanssens.com/img/2G36w8psbC-336.jpeg 336w, https://jeroenjanssens.com/img/2G36w8psbC-504.jpeg 504w, https://jeroenjanssens.com/img/2G36w8psbC-672.jpeg 672w, https://jeroenjanssens.com/img/2G36w8psbC-1008.jpeg 1008w, https://jeroenjanssens.com/img/2G36w8psbC-1344.jpeg 1344w" media="(min-width: 0px)" />
<img class="mx-auto img-border" style="width: 100%;" src="https://jeroenjanssens.com/img/2G36w8psbC-336.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
<blockquote>
<p>Frequently Asked Questions</p>
<p>Do I need to have experience with the command line? You don’t need to
have any experience with the command line, Unix/Linux, or even
programming.</p>
<p>We’ll start at the beginning and steadily go into more advanced
topics. The only prerequisites are a computer and the willingness to
learn!</p>
<p>Will I be able to do this next to my regular job? Absolutely. I have
designed this course such that you can do this next to a full-time
job. In fact, nearly all of the people who have already signed up have
a full-time job.</p>
<p>When will live sessions be held? 6 two-hour live sessions will take
place over a period of three weeks on each Monday and Thursday at
11:00 PDT / 14:00 EDT / 18:00 UTC / 20:00 CEST. You’ll receive
calendar invites for all sessions once you’ve signed up.</p>
<p>Do I have to attend all of the live sessions? It’s okay if you have to
miss a session or two. Every live session is recorded and made
available for you to replay, at your convenience. With that said, I
strongly recommend you make time for them so you can ask questions
directly to me and join productive breakout discussions.</p>
<p>Is there a community to interact with others? Yes! We’ve created a
private community space for students of the Embrace the Command Line
course. There you can share your progress, get feedback, ask for help,
and more.</p>
<p>Is this course also available as a corporate training? If you have a
couple of colleagues who are interested in this topic, then an
in-company training might be worthwhile. Visit my company Data Science
Workshops for more information.</p>
</blockquote>
Scripting iTerm Key Mappings2023-01-19T00:00:00Zhttps://jeroenjanssens.com/itermkeymap/<p>To improve my iTerm+tmux experience, I’ve set up a whole bunch of key
mappings. Rather than defining these manually, I wrote <a href="https://gist.github.com/jeroenjanssens/4050c1a328db89d62c6b9459b4544f68">a Python
script</a>
to generate the corresponding JSON programmatically.</p>
<figure>
<a href="https://jeroenjanssens.com/img/social/an-expressive-oil-painting-of-a-desk-with-a-couple-of-keyboards-and-a-monitor-showing-the-unix-terminal.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 1)" srcset="https://jeroenjanssens.com/img/ZvQLZSZ7wg-336.webp 336w, https://jeroenjanssens.com/img/ZvQLZSZ7wg-504.webp 504w, https://jeroenjanssens.com/img/ZvQLZSZ7wg-672.webp 672w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 1)" srcset="https://jeroenjanssens.com/img/ZvQLZSZ7wg-336.webp 336w, https://jeroenjanssens.com/img/ZvQLZSZ7wg-504.webp 504w, https://jeroenjanssens.com/img/ZvQLZSZ7wg-672.webp 672w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 1)" srcset="https://jeroenjanssens.com/img/ZvQLZSZ7wg-336.jpeg 336w, https://jeroenjanssens.com/img/ZvQLZSZ7wg-504.jpeg 504w, https://jeroenjanssens.com/img/ZvQLZSZ7wg-672.jpeg 672w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 1)" srcset="https://jeroenjanssens.com/img/ZvQLZSZ7wg-336.jpeg 336w, https://jeroenjanssens.com/img/ZvQLZSZ7wg-504.jpeg 504w, https://jeroenjanssens.com/img/ZvQLZSZ7wg-672.jpeg 672w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 100%;" src="https://jeroenjanssens.com/img/ZvQLZSZ7wg-336.jpeg" alt="An expressive oil painting of a desk with a couple of keyboards and a monitor showing the unix terminal." loading="lazy" />
</picture></a>
<figcaption>An expressive oil painting of a desk with a couple of keyboards and a monitor showing the unix terminal.</figcaption>
</figure>
<h2>Fixing prefixing</h2>
<p>I’ve recently refound my love for
<a href="https://en.wikipedia.org/wiki/Tmux">tmux</a>. Having multiple terminal
panes, windows, and even sessions helps me keep my command-line
shenanigans organized. By default, you interact with tmux by pressing a
prefix key (<kbd>Ctrl-B</kbd>) followed by another key. For instance,
<kbd>Ctrl-B</kbd> <kbd>C</kbd> creates a new window and
<kbd>Ctrl-B</kbd> <kbd>"</kbd> splits the current pane into two. That’s
incredibly powerful, but I don’t particularly enjoy pressing prefixes
all the time.</p>
<p>Fortunately, iTerm, which is the terminal emulator that I use on macOS,
has the ability to define key mappings. This allows you to simulate,
say, <kbd>Ctrl-B</kbd> <kbd>J</kbd> by pressing <kbd>Cmd-J</kbd>.<sup class="footnote-ref"><a href="https://jeroenjanssens.com/itermkeymap/#fn1" id="fnref1">[1]</a></sup>
Defining these key mappings manually, however, turns out to be a tedious
and error-prone task.</p>
<h2>Using Python to generate key mappings</h2>
<p>Like any <em>real</em> developer, I wrote a script to automate this task.
Here’s a screenshot showing a portion of the 116(!) key mappings the
script generates:</p>
<figure>
<a href="https://jeroenjanssens.com/img/itermkeymap.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 1)" srcset="https://jeroenjanssens.com/img/lt4LNDgMp7-336.webp 336w, https://jeroenjanssens.com/img/lt4LNDgMp7-504.webp 504w, https://jeroenjanssens.com/img/lt4LNDgMp7-672.webp 672w, https://jeroenjanssens.com/img/lt4LNDgMp7-1008.webp 1008w, https://jeroenjanssens.com/img/lt4LNDgMp7-1344.webp 1344w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 1)" srcset="https://jeroenjanssens.com/img/lt4LNDgMp7-336.webp 336w, https://jeroenjanssens.com/img/lt4LNDgMp7-504.webp 504w, https://jeroenjanssens.com/img/lt4LNDgMp7-672.webp 672w, https://jeroenjanssens.com/img/lt4LNDgMp7-1008.webp 1008w, https://jeroenjanssens.com/img/lt4LNDgMp7-1344.webp 1344w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 1)" srcset="https://jeroenjanssens.com/img/lt4LNDgMp7-336.jpeg 336w, https://jeroenjanssens.com/img/lt4LNDgMp7-504.jpeg 504w, https://jeroenjanssens.com/img/lt4LNDgMp7-672.jpeg 672w, https://jeroenjanssens.com/img/lt4LNDgMp7-1008.jpeg 1008w, https://jeroenjanssens.com/img/lt4LNDgMp7-1344.jpeg 1344w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 1)" srcset="https://jeroenjanssens.com/img/lt4LNDgMp7-336.jpeg 336w, https://jeroenjanssens.com/img/lt4LNDgMp7-504.jpeg 504w, https://jeroenjanssens.com/img/lt4LNDgMp7-672.jpeg 672w, https://jeroenjanssens.com/img/lt4LNDgMp7-1008.jpeg 1008w, https://jeroenjanssens.com/img/lt4LNDgMp7-1344.jpeg 1344w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 100%;" src="https://jeroenjanssens.com/img/lt4LNDgMp7-336.jpeg" alt="I currently have 116 key mappings defined in iTerm. Thanks Python!" loading="lazy" />
</picture></a>
<figcaption>I currently have 116 key mappings defined in iTerm. Thanks Python!</figcaption>
</figure>
<p>You can find the complete Python script below or you can <a href="https://gist.github.com/jeroenjanssens/4050c1a328db89d62c6b9459b4544f68">download it
from
GitHub</a>.
It generates three types of key mappings:</p>
<ol>
<li>If you press <kbd>Cmd-<em>key</em></kbd>, iTerm sends <kbd>Ctrl-B</kbd>
<kbd><em>key</em></kbd>, for <kbd>A</kbd> to <kbd>Z</kbd> and then some,
except <kbd>X</kbd>, <kbd>C</kbd>, <kbd>V</kbd>, and <kbd>Q</kbd> to
keep the default cut, copy, paste, and quit keyboard shortcuts.</li>
<li>If you press <kbd>Cmd-Ctrl-<em>key</em></kbd>, iTerm sends
<kbd>Ctrl-B</kbd> <kbd>Shift-<em>key</em></kbd> for <kbd>A</kbd> to
<kbd>Z</kbd> and then some.<sup class="footnote-ref"><a href="https://jeroenjanssens.com/itermkeymap/#fn2" id="fnref2">[2]</a></sup></li>
<li>If you press <kbd>Option-<em>key</em></kbd>, iTerm sends <kbd>Ctrl-B</kbd>
<kbd>Ctrl-<em>key</em></kbd>, for <kbd>A</kbd> to <kbd>Z</kbd>.</li>
</ol>
<p>The output of the script is JSON, which looks<sup class="footnote-ref"><a href="https://jeroenjanssens.com/itermkeymap/#fn3" id="fnref3">[3]</a></sup> like this:</p>
<pre class="language-shell-session"><code class="language-shell-session"><span class="token command"><span class="token shell-symbol important">$</span> <span class="token bash language-bash">./itermkeymap.py <span class="token operator">|</span> trim <span class="token number">20</span></span></span><br /><span class="token output">{<br /> "Key Mappings": {<br /> "0x31-0x140000": {<br /> "Version": 1,<br /> "Action": 11,<br /> "Text": "0x02 0x21",<br /> "Label": "C-b !"<br /> },<br /> "0x27-0x140000": {<br /> "Version": 1,<br /> "Action": 11,<br /> "Text": "0x02 0x22",<br /> "Label": "C-b \""<br /> },<br /> "0x33-0x140000": {<br /> "Version": 1,<br /> "Action": 11,<br /> "Text": "0x02 0x23",<br /> "Label": "C-b #"<br /> },<br />… with 680 more lines</span></code></pre>
<p>This JSON structure is discussed in more detail below.</p>
<p>I haven’t yet found a way to import these keymappings programmatically,
so you’ll first have to save the output to a file:</p>
<pre class="language-shell-session"><code class="language-shell-session"><span class="token command"><span class="token shell-symbol important">$</span> <span class="token bash language-bash">./itermkeymap.py <span class="token operator">></span> generated.itermkeymap </span></span></code></pre>
<p>And then import that file by clicking “Presets…” and “Import…” at
the bottom of the Key Mappings section of the iTerm preferences.</p>
<h2>Key mappings are personal</h2>
<p>You’re of course welcome to use this script as it is, but I can imagine
you’ll want to adapt it to your personal preferences. For instance, if
you use <kbd>Ctrl-A</kbd> as your tmux prefix, you’d need to change “b”
and “02” to “a” and “01” in the following two lines:</p>
<pre class="language-python"><code class="language-python">prefix_key <span class="token operator">=</span> <span class="token string">"b"</span> <span class="token comment"># My prefix key in tmux is Ctrl-B</span><br />prefix_hex <span class="token operator">=</span> <span class="token string">"02"</span> <span class="token comment"># The hex code that iTerm sends (corresponds to Ctrl-B)</span></code></pre>
<p>Or perhaps you’d like to change which keys and modifiers are used:</p>
<pre class="language-python"><code class="language-python">keys_upper <span class="token operator">=</span> string<span class="token punctuation">.</span>ascii_uppercase <span class="token operator">+</span> <span class="token string">r'!@#$%^&*()_+{}:"|<>?~'</span><br />keys_lower <span class="token operator">=</span> string<span class="token punctuation">.</span>ascii_lowercase <span class="token operator">+</span> <span class="token string">r"1234567890-=[];'\,./`"</span><br />keys_exclude <span class="token operator">=</span> <span class="token string">"xcvq"</span> <span class="token comment"># Don't override cut, copy, paste, and quit shortcuts</span><br /><br />mod <span class="token operator">=</span> <span class="token punctuation">{</span><span class="token string">"cmd"</span><span class="token punctuation">:</span> <span class="token string">"100000"</span><span class="token punctuation">,</span><br /> <span class="token string">"cmd+ctrl"</span><span class="token punctuation">:</span> <span class="token string">"140000"</span><span class="token punctuation">,</span><br /> <span class="token string">"option"</span><span class="token punctuation">:</span> <span class="token string">"80000"</span><span class="token punctuation">,</span><br /><span class="token punctuation">}</span></code></pre>
<p>If you want to use different modifiers or other special keys (e.g.,
arrow keys), my advice is to</p>
<ol>
<li>manually define a key mapping in iTerm,</li>
<li>export the key mapping to a file, and</li>
<li>inspect the JSON using, e.g., <code>jq . export.itermkeymap</code>.</li>
</ol>
<p>Two useful tools for finding hex codes are the ASCII manual page, c.f.
<code>man ascii</code> and making hexdumps using <code>xxd -p</code>.</p>
<h2>JSON structure explained</h2>
<p>As an example, consider the JSON corresponding to the key mapping
<kbd>Cmd-J</kbd> sending <kbd>Ctrl-B</kbd> <kbd>J</kbd>:</p>
<pre class="language-json"><code class="language-json"><span class="token punctuation">{</span><br /> <span class="token property">"Key Mappings"</span><span class="token operator">:</span> <span class="token punctuation">{</span> <br /> <span class="token property">"0x6a-0x100000-0x0"</span><span class="token operator">:</span> <span class="token punctuation">{</span><br /> <span class="token property">"Version"</span><span class="token operator">:</span> <span class="token number">1</span><span class="token punctuation">,</span><br /> <span class="token property">"Action"</span><span class="token operator">:</span> <span class="token number">11</span><span class="token punctuation">,</span><br /> <span class="token property">"Text"</span><span class="token operator">:</span> <span class="token string">"0x02 0x6a"</span><span class="token punctuation">,</span><br /> <span class="token property">"Label"</span><span class="token operator">:</span> <span class="token string">""</span><br /> <span class="token punctuation">}</span><br /> <span class="token punctuation">}</span><br /><span class="token punctuation">}</span></code></pre>
<p>I couldn’t find any documentation about the JSON structure, but here’s
what I’ve learned so far:</p>
<ul>
<li>“0x6a-0x100000-0x0”: Refers to the key combination you press to
trigger the key mapping, consisting of three parts:
<ol>
<li>“0x6a”: is the hex code for “j”. See <code>man ascii</code> for the complete
ASCII table in hex.</li>
<li>“0x100000”: refers to the <kbd>Cmd</kbd> key.</li>
<li>“0x0”: I’m not sure what this does. It can be safely left out when
importing a JSON file.</li>
</ol>
</li>
<li>“Version”: Is always 1; not interesting.</li>
<li>“Action”: “11” stands for “Send Hex Codes”.</li>
<li>“Text”: The hex codes being sent. In this case <kbd>Ctrl-B</kbd>
followed by <kbd>J</kbd>.</li>
<li>“Label”: Doesn’t seem to be used by iTerm. The script puts the key
mapping in plain text here for debugging purposes.</li>
</ul>
<h2>Defining the corresponding tmux key bindings</h2>
<p>All these key mappings only make sense when you define the corresponding
key bindings in tmux<sup class="footnote-ref"><a href="https://jeroenjanssens.com/itermkeymap/#fn4" id="fnref4">[4]</a></sup>. Here are tree key bindings to illustrate the
three types of mappings for the <kbd>H</kbd> key:</p>
<pre class="language-bash"><code class="language-bash"><span class="token builtin class-name">bind</span> h show-message <span class="token string">"Received Ctrl-B H"</span> <span class="token comment"># If you press Cmd-H</span><br /><span class="token builtin class-name">bind</span> H show-message <span class="token string">"Received Ctrl-B Shift-H"</span> <span class="token comment"># If you press Cmd-Ctrl-H</span><br /><span class="token builtin class-name">bind</span> ^h show-message <span class="token string">"Received Ctrl-B Ctrl-H"</span> <span class="token comment"># If you press Option-H</span></code></pre>
<p>My own tmux configuration defines, among many others, the following key
bindings:</p>
<pre class="language-bash"><code class="language-bash"><span class="token builtin class-name">bind</span> h select-pane <span class="token parameter variable">-L</span> <span class="token comment"># Select pane to the left</span><br /><span class="token builtin class-name">bind</span> j select-pane <span class="token parameter variable">-D</span> <span class="token comment"># Select pane below</span><br /><span class="token builtin class-name">bind</span> k select-pane <span class="token parameter variable">-U</span> <span class="token comment"># Select pane above</span><br /><span class="token builtin class-name">bind</span> l select-pane <span class="token parameter variable">-R</span> <span class="token comment"># Select pane to the right</span><br /><span class="token builtin class-name">bind</span> t new-window <span class="token parameter variable">-c</span> <span class="token string">"#{pane_current_path}"</span> <span class="token comment"># New window</span><br /><span class="token builtin class-name">bind</span> w kill-pane <span class="token comment"># Close current pane</span></code></pre>
<p>These key bindings enable me to navigate to other panes using
<kbd>Cmd-H</kbd>, <kbd>Cmd-J</kbd>, <kbd>Cmd-K</kbd>, and
<kbd>Cmd-L</kbd> (aka vim style). Moreover, common keyboard shortcuts
such as <kbd>Cmd-T</kbd> and <kbd>Cmd-W</kbd> are now intercepted; they
no longer create and close iTerm tabs, but create a new tmux window and
close the current tmux pane, respectively. After all, who needs tabs
when you’re running tmux?</p>
<p>That’s all for now. May your sessions live long and prosper.</p>
<p>– Jeroen</p>
<h2>Appendix: The complete script</h2>
<p>Below is the script I use to generate the JSON containing the key
mappings. Obviously this script can be improved in many ways, but it
gets the job done. You can also <a href="https://gist.github.com/jeroenjanssens/4050c1a328db89d62c6b9459b4544f68">download the script from
GitHub</a>.</p>
<pre class="language-python"><code class="language-python"><span class="token comment">#!/usr/bin/env python3</span><br /><br /><span class="token keyword">import</span> json<br /><span class="token keyword">import</span> string<br /><br />prefix_key <span class="token operator">=</span> <span class="token string">"b"</span> <span class="token comment"># My prefix key in tmux is Ctrl-B</span><br />prefix_hex <span class="token operator">=</span> <span class="token string">"02"</span> <span class="token comment"># The hex code that iTerm sends (corresponds to Ctrl-B)</span><br /><br />keys_upper <span class="token operator">=</span> string<span class="token punctuation">.</span>ascii_uppercase <span class="token operator">+</span> <span class="token string">r'!@#$%^&*()_+{}:"|<>?~'</span><br />keys_lower <span class="token operator">=</span> string<span class="token punctuation">.</span>ascii_lowercase <span class="token operator">+</span> <span class="token string">r"1234567890-=[];'\,./`"</span><br />keys_exclude <span class="token operator">=</span> <span class="token string">"xcvq"</span> <span class="token comment"># Don't override cut, copy, paste, and quit shortcuts</span><br /><br /><br />mod <span class="token operator">=</span> <span class="token punctuation">{</span><span class="token string">"cmd"</span><span class="token punctuation">:</span> <span class="token string">"100000"</span><span class="token punctuation">,</span><br /> <span class="token string">"cmd+ctrl"</span><span class="token punctuation">:</span> <span class="token string">"140000"</span><span class="token punctuation">,</span><br /> <span class="token string">"option"</span><span class="token punctuation">:</span> <span class="token string">"80000"</span><span class="token punctuation">,</span><br /><span class="token punctuation">}</span><br /><br />keys_all <span class="token operator">=</span> <span class="token string">""</span><span class="token punctuation">.</span>join<span class="token punctuation">(</span><span class="token builtin">sorted</span><span class="token punctuation">(</span><span class="token builtin">set</span><span class="token punctuation">(</span>keys_upper <span class="token operator">+</span><br /> keys_lower<span class="token punctuation">)</span><span class="token punctuation">.</span>difference<span class="token punctuation">(</span>keys_exclude<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">)</span><br /><br />mappings <span class="token operator">=</span> <span class="token punctuation">{</span><span class="token punctuation">}</span><br /><br /><span class="token comment"># Cmd+<KEY> becomes Prefix <KEY></span><br /><span class="token comment"># Cmd+Ctrl+<KEY> becomes Prefix Shift+<KEY></span><br /><span class="token keyword">for</span> send_key <span class="token keyword">in</span> keys_all<span class="token punctuation">:</span><br /><br /> send_hex <span class="token operator">=</span> send_key<span class="token punctuation">.</span>encode<span class="token punctuation">(</span><span class="token string">"utf-8"</span><span class="token punctuation">)</span><span class="token punctuation">.</span><span class="token builtin">hex</span><span class="token punctuation">(</span><span class="token punctuation">)</span><br /><br /> <span class="token keyword">if</span> send_key <span class="token keyword">in</span> keys_upper<span class="token punctuation">:</span><br /> press_key <span class="token operator">=</span> keys_lower<span class="token punctuation">[</span>keys_upper<span class="token punctuation">.</span>index<span class="token punctuation">(</span>send_key<span class="token punctuation">)</span><span class="token punctuation">]</span><br /> press_mod <span class="token operator">=</span> mod<span class="token punctuation">[</span><span class="token string">"cmd+ctrl"</span><span class="token punctuation">]</span> <span class="token comment"># Command+Ctrl modifier</span><br /> <span class="token keyword">else</span><span class="token punctuation">:</span><br /> press_key <span class="token operator">=</span> send_key<br /> press_mod <span class="token operator">=</span> mod<span class="token punctuation">[</span><span class="token string">"cmd"</span><span class="token punctuation">]</span> <span class="token comment"># Command modifier</span><br /><br /> press_hex <span class="token operator">=</span> press_key<span class="token punctuation">.</span>encode<span class="token punctuation">(</span><span class="token string">"utf-8"</span><span class="token punctuation">)</span><span class="token punctuation">.</span><span class="token builtin">hex</span><span class="token punctuation">(</span><span class="token punctuation">)</span><br /><br /> mappings<span class="token punctuation">[</span><span class="token string-interpolation"><span class="token string">f"0x</span><span class="token interpolation"><span class="token punctuation">{</span>press_hex<span class="token punctuation">}</span></span><span class="token string">-0x</span><span class="token interpolation"><span class="token punctuation">{</span>press_mod<span class="token punctuation">}</span></span><span class="token string">"</span></span><span class="token punctuation">]</span> <span class="token operator">=</span> <span class="token punctuation">{</span><br /> <span class="token string">"Version"</span><span class="token punctuation">:</span> <span class="token number">1</span><span class="token punctuation">,</span><br /> <span class="token string">"Action"</span><span class="token punctuation">:</span> <span class="token number">11</span><span class="token punctuation">,</span> <span class="token comment"># Corresponds to "Send Hex Codes" action</span><br /> <span class="token string">"Text"</span><span class="token punctuation">:</span> <span class="token string-interpolation"><span class="token string">f"0x</span><span class="token interpolation"><span class="token punctuation">{</span>prefix_hex<span class="token punctuation">}</span></span><span class="token string"> 0x</span><span class="token interpolation"><span class="token punctuation">{</span>send_hex<span class="token punctuation">}</span></span><span class="token string">"</span></span><span class="token punctuation">,</span> <span class="token comment"># Hex codes to send</span><br /> <span class="token string">"Label"</span><span class="token punctuation">:</span> <span class="token string-interpolation"><span class="token string">f"C-</span><span class="token interpolation"><span class="token punctuation">{</span>prefix_key<span class="token punctuation">}</span></span><span class="token string"> </span><span class="token interpolation"><span class="token punctuation">{</span>send_key<span class="token punctuation">}</span></span><span class="token string">"</span></span> <span class="token comment"># Nice for debugging</span><br /> <span class="token punctuation">}</span><br /><br /><span class="token comment"># Option+<KEY> becomes Prefix Ctrl+<KEY></span><br /><span class="token comment"># Generate Option+A through Option+Z</span><br /><span class="token keyword">for</span> i<span class="token punctuation">,</span> send_key <span class="token keyword">in</span> <span class="token builtin">enumerate</span><span class="token punctuation">(</span>string<span class="token punctuation">.</span>ascii_lowercase<span class="token punctuation">,</span> start<span class="token operator">=</span><span class="token number">1</span><span class="token punctuation">)</span><span class="token punctuation">:</span><br /> send_hex <span class="token operator">=</span> <span class="token builtin">hex</span><span class="token punctuation">(</span>i<span class="token punctuation">)</span> <span class="token comment"># returns 0x01 for a, 0x02 for b, etc.</span><br /><br /> press_mod <span class="token operator">=</span> mod<span class="token punctuation">[</span><span class="token string">"option"</span><span class="token punctuation">]</span> <span class="token comment"># Option modifier</span><br /> press_hex <span class="token operator">=</span> send_key<span class="token punctuation">.</span>encode<span class="token punctuation">(</span><span class="token string">"utf-8"</span><span class="token punctuation">)</span><span class="token punctuation">.</span><span class="token builtin">hex</span><span class="token punctuation">(</span><span class="token punctuation">)</span><br /><br /> mappings<span class="token punctuation">[</span><span class="token string-interpolation"><span class="token string">f"0x</span><span class="token interpolation"><span class="token punctuation">{</span>press_hex<span class="token punctuation">}</span></span><span class="token string">-0x</span><span class="token interpolation"><span class="token punctuation">{</span>press_mod<span class="token punctuation">}</span></span><span class="token string">"</span></span><span class="token punctuation">]</span> <span class="token operator">=</span> <span class="token punctuation">{</span><br /> <span class="token string">"Version"</span><span class="token punctuation">:</span> <span class="token number">1</span><span class="token punctuation">,</span><br /> <span class="token string">"Action"</span><span class="token punctuation">:</span> <span class="token number">11</span><span class="token punctuation">,</span> <span class="token comment"># Corresponds to "Send Hex Codes" action</span><br /> <span class="token string">"Text"</span><span class="token punctuation">:</span> <span class="token string-interpolation"><span class="token string">f"0x</span><span class="token interpolation"><span class="token punctuation">{</span>prefix_hex<span class="token punctuation">}</span></span><span class="token string"> </span><span class="token interpolation"><span class="token punctuation">{</span>send_hex<span class="token punctuation">}</span></span><span class="token string">"</span></span><span class="token punctuation">,</span> <span class="token comment"># Hex codes to send</span><br /> <span class="token string">"Label"</span><span class="token punctuation">:</span> <span class="token string-interpolation"><span class="token string">f"C-</span><span class="token interpolation"><span class="token punctuation">{</span>prefix_key<span class="token punctuation">}</span></span><span class="token string"> C-</span><span class="token interpolation"><span class="token punctuation">{</span>send_key<span class="token punctuation">}</span></span><span class="token string">"</span></span> <span class="token comment"># Nice for debugging</span><br /> <span class="token punctuation">}</span><br /><br />result <span class="token operator">=</span> <span class="token punctuation">{</span><span class="token string">"Key Mappings"</span><span class="token punctuation">:</span> mappings<span class="token punctuation">}</span><br /><span class="token keyword">print</span><span class="token punctuation">(</span>json<span class="token punctuation">.</span>dumps<span class="token punctuation">(</span>result<span class="token punctuation">,</span> indent<span class="token operator">=</span><span class="token number">4</span><span class="token punctuation">)</span><span class="token punctuation">)</span></code></pre>
<hr class="footnotes-sep" />
<section class="footnotes">
<ol class="footnotes-list">
<li id="fn1" class="footnote-item"><p>Although you cannot bind to the <kbd>Cmd</kbd> key in tmux, you
can still use it in an iTerm key mapping to simulate pressing other
key combinations. <a href="https://jeroenjanssens.com/itermkeymap/#fnref1" class="footnote-backref">↩︎</a></p>
</li>
<li id="fn2" class="footnote-item"><p>For some reason, I’m having trouble mapping <kbd>Cmd-Shift</kbd>
so I’m currently using <kbd>Cmd-Ctrl</kbd>. <a href="https://jeroenjanssens.com/itermkeymap/#fnref2" class="footnote-backref">↩︎</a></p>
</li>
<li id="fn3" class="footnote-item"><p>In case you were wondering about <code>trim</code>, it can be found in my
<a href="https://github.com/jeroenjanssens/dsutils">dsutils repository</a>. <a href="https://jeroenjanssens.com/itermkeymap/#fnref3" class="footnote-backref">↩︎</a></p>
</li>
<li id="fn4" class="footnote-item"><p>The tmux configuration is usually located at <em>~/.tmux.conf</em> or
<em>~/.config/tmux/tmux.conf</em>. <a href="https://jeroenjanssens.com/itermkeymap/#fnref4" class="footnote-backref">↩︎</a></p>
</li>
</ol>
</section>
I'm Selling 16 Domains Including datascienceworkshops dot com2023-01-10T00:00:00Zhttps://jeroenjanssens.com/domains-for-sale/<p>Because <a href="https://jeroenjanssens.com/closed">I’ve closed my company Data Science Workshops</a>, I
don’t need its 16 domains names any more. They’ve served me well over
the years and I hope to find a new owner who can put them to good use.</p>
<p>They are as follows:</p>
<ol>
<li><a href="https://datascienceworkshops.com/">https://datascienceworkshops.com</a></li>
<li><a href="https://datascienceworkshops.nl/">https://datascienceworkshops.nl</a></li>
<li><a href="https://datascienceworkshops.be/">https://datascienceworkshops.be</a></li>
<li><a href="https://datascienceworkshops.uk/">https://datascienceworkshops.uk</a></li>
<li><a href="https://datascienceworkshops.de/">https://datascienceworkshops.de</a></li>
<li><a href="https://datascienceworkshops.eu/">https://datascienceworkshops.eu</a></li>
<li><a href="https://datascienceworkshops.net/">https://datascienceworkshops.net</a></li>
<li><a href="https://datascienceworkshops.org/">https://datascienceworkshops.org</a></li>
<li><a href="https://datascienceworkshops.info/">https://datascienceworkshops.info</a></li>
<li><a href="https://datascienceworkshops.co.uk/">https://datascienceworkshops.co.uk</a></li>
<li><a href="https://datascienceworkshop.com/">https://datascienceworkshop.com</a></li>
<li><a href="https://datascienceworkshop.nl/">https://datascienceworkshop.nl</a></li>
<li><a href="https://datasciencework.shop/">https://datasciencework.shop</a></li>
<li><a href="https://data-science-workshop.nl/">https://data-science-workshop.nl</a></li>
<li><a href="https://data-science-workshops.com/">https://data-science-workshops.com</a></li>
<li><a href="https://data-science-workshops.nl/">https://data-science-workshops.nl</a></li>
</ol>
<p>If you’re interested in purchasing these domains please email me at
<a href="mailto:jeroen@jeroenjanssens.com">jeroen@jeroenjanssens.com</a>.</p>
<p>– Jeroen</p>
Closing Shop2023-01-01T00:00:00Zhttps://jeroenjanssens.com/closed/<p>I have decided to close my company Data Science Workshops B.V.</p>
<figure>
<a href="https://jeroenjanssens.com/img/data-science-workshops-jeroen-janssens.en.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.5)" srcset="https://jeroenjanssens.com/img/A5m5xSpD4G-168.webp 168w, https://jeroenjanssens.com/img/A5m5xSpD4G-252.webp 252w, https://jeroenjanssens.com/img/A5m5xSpD4G-336.webp 336w, https://jeroenjanssens.com/img/A5m5xSpD4G-504.webp 504w, https://jeroenjanssens.com/img/A5m5xSpD4G-672.webp 672w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.5)" srcset="https://jeroenjanssens.com/img/A5m5xSpD4G-168.webp 168w, https://jeroenjanssens.com/img/A5m5xSpD4G-252.webp 252w, https://jeroenjanssens.com/img/A5m5xSpD4G-336.webp 336w, https://jeroenjanssens.com/img/A5m5xSpD4G-504.webp 504w, https://jeroenjanssens.com/img/A5m5xSpD4G-672.webp 672w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.5)" srcset="https://jeroenjanssens.com/img/A5m5xSpD4G-168.jpeg 168w, https://jeroenjanssens.com/img/A5m5xSpD4G-252.jpeg 252w, https://jeroenjanssens.com/img/A5m5xSpD4G-336.jpeg 336w, https://jeroenjanssens.com/img/A5m5xSpD4G-504.jpeg 504w, https://jeroenjanssens.com/img/A5m5xSpD4G-672.jpeg 672w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.5)" srcset="https://jeroenjanssens.com/img/A5m5xSpD4G-168.jpeg 168w, https://jeroenjanssens.com/img/A5m5xSpD4G-252.jpeg 252w, https://jeroenjanssens.com/img/A5m5xSpD4G-336.jpeg 336w, https://jeroenjanssens.com/img/A5m5xSpD4G-504.jpeg 504w, https://jeroenjanssens.com/img/A5m5xSpD4G-672.jpeg 672w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 50%;" src="https://jeroenjanssens.com/img/A5m5xSpD4G-168.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
<p>For nearly seven years I’ve had the pleasure of teaching a variety of
data science topics to hundreds of students. If you’re curious which
workshops I gave and to which clients, you can <a href="https://jeroenjanssens.com/dsw">browse an archived copy
of the Data Science Workshops website</a>. Should you have any other
questions, please don’t hesitate to <a href="mailto:jeroen@jeroenjanssens.com">reach
out</a>.</p>
<figure>
<a href="https://jeroenjanssens.com/img/dsw/kpn-python-inspiration-session-01.jpg">
<picture>
<source type="image/webp" sizes="calc(42rem * 1)" srcset="https://jeroenjanssens.com/img/m9uZngGvyi-336.webp 336w, https://jeroenjanssens.com/img/m9uZngGvyi-504.webp 504w, https://jeroenjanssens.com/img/m9uZngGvyi-672.webp 672w, https://jeroenjanssens.com/img/m9uZngGvyi-1008.webp 1008w, https://jeroenjanssens.com/img/m9uZngGvyi-1344.webp 1344w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 1)" srcset="https://jeroenjanssens.com/img/m9uZngGvyi-336.webp 336w, https://jeroenjanssens.com/img/m9uZngGvyi-504.webp 504w, https://jeroenjanssens.com/img/m9uZngGvyi-672.webp 672w, https://jeroenjanssens.com/img/m9uZngGvyi-1008.webp 1008w, https://jeroenjanssens.com/img/m9uZngGvyi-1344.webp 1344w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 1)" srcset="https://jeroenjanssens.com/img/m9uZngGvyi-336.jpeg 336w, https://jeroenjanssens.com/img/m9uZngGvyi-504.jpeg 504w, https://jeroenjanssens.com/img/m9uZngGvyi-672.jpeg 672w, https://jeroenjanssens.com/img/m9uZngGvyi-1008.jpeg 1008w, https://jeroenjanssens.com/img/m9uZngGvyi-1344.jpeg 1344w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 1)" srcset="https://jeroenjanssens.com/img/m9uZngGvyi-336.jpeg 336w, https://jeroenjanssens.com/img/m9uZngGvyi-504.jpeg 504w, https://jeroenjanssens.com/img/m9uZngGvyi-672.jpeg 672w, https://jeroenjanssens.com/img/m9uZngGvyi-1008.jpeg 1008w, https://jeroenjanssens.com/img/m9uZngGvyi-1344.jpeg 1344w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 100%;" src="https://jeroenjanssens.com/img/m9uZngGvyi-336.jpeg" alt="Inspiration session about Python at KPN in Amsterdam" loading="lazy" />
</picture></a>
<figcaption>Inspiration session about Python at KPN in Amsterdam</figcaption>
</figure>
<p>Having my own training company has mostly been incredibly rewarding and
educational. That feeling when you see a student “get it”, is priceless.
So is landing a 10-week training at a big client. Or traveling to
Nigeria, the US, or Kuwait, for that matter.</p>
<figure>
<a href="https://jeroenjanssens.com/img/dsw/eha-04.jpg">
<picture>
<source type="image/webp" sizes="calc(42rem * 1)" srcset="https://jeroenjanssens.com/img/Wq-rKoINl4-336.webp 336w, https://jeroenjanssens.com/img/Wq-rKoINl4-504.webp 504w, https://jeroenjanssens.com/img/Wq-rKoINl4-672.webp 672w, https://jeroenjanssens.com/img/Wq-rKoINl4-1008.webp 1008w, https://jeroenjanssens.com/img/Wq-rKoINl4-1344.webp 1344w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 1)" srcset="https://jeroenjanssens.com/img/Wq-rKoINl4-336.webp 336w, https://jeroenjanssens.com/img/Wq-rKoINl4-504.webp 504w, https://jeroenjanssens.com/img/Wq-rKoINl4-672.webp 672w, https://jeroenjanssens.com/img/Wq-rKoINl4-1008.webp 1008w, https://jeroenjanssens.com/img/Wq-rKoINl4-1344.webp 1344w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 1)" srcset="https://jeroenjanssens.com/img/Wq-rKoINl4-336.jpeg 336w, https://jeroenjanssens.com/img/Wq-rKoINl4-504.jpeg 504w, https://jeroenjanssens.com/img/Wq-rKoINl4-672.jpeg 672w, https://jeroenjanssens.com/img/Wq-rKoINl4-1008.jpeg 1008w, https://jeroenjanssens.com/img/Wq-rKoINl4-1344.jpeg 1344w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 1)" srcset="https://jeroenjanssens.com/img/Wq-rKoINl4-336.jpeg 336w, https://jeroenjanssens.com/img/Wq-rKoINl4-504.jpeg 504w, https://jeroenjanssens.com/img/Wq-rKoINl4-672.jpeg 672w, https://jeroenjanssens.com/img/Wq-rKoINl4-1008.jpeg 1008w, https://jeroenjanssens.com/img/Wq-rKoINl4-1344.jpeg 1344w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 100%;" src="https://jeroenjanssens.com/img/Wq-rKoINl4-336.jpeg" alt="Group picture at e-Health Africa in Kano, Nigeria" loading="lazy" />
</picture></a>
<figcaption>Group picture at e-Health Africa in Kano, Nigeria</figcaption>
</figure>
<p>I’ve not only been able to improve my own data science and teaching
skills, I’ve also learned valuable things about marketing, sales,
copywriting, advertising, not advertising, implementing static
multi-lingual websites, doing taxes, delegating taxes, hiring other
instructors, organizing meetups, editing videos, teaching online, live
streaming, et cetera. Mostly by making plenty of mistakes.</p>
<figure>
<a href="https://jeroenjanssens.com/img/dsw/transavia-hackathon-03.jpg">
<picture>
<source type="image/webp" sizes="calc(42rem * 1)" srcset="https://jeroenjanssens.com/img/2QiR9r0VUp-336.webp 336w, https://jeroenjanssens.com/img/2QiR9r0VUp-504.webp 504w, https://jeroenjanssens.com/img/2QiR9r0VUp-672.webp 672w, https://jeroenjanssens.com/img/2QiR9r0VUp-1008.webp 1008w, https://jeroenjanssens.com/img/2QiR9r0VUp-1344.webp 1344w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 1)" srcset="https://jeroenjanssens.com/img/2QiR9r0VUp-336.webp 336w, https://jeroenjanssens.com/img/2QiR9r0VUp-504.webp 504w, https://jeroenjanssens.com/img/2QiR9r0VUp-672.webp 672w, https://jeroenjanssens.com/img/2QiR9r0VUp-1008.webp 1008w, https://jeroenjanssens.com/img/2QiR9r0VUp-1344.webp 1344w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 1)" srcset="https://jeroenjanssens.com/img/2QiR9r0VUp-336.jpeg 336w, https://jeroenjanssens.com/img/2QiR9r0VUp-504.jpeg 504w, https://jeroenjanssens.com/img/2QiR9r0VUp-672.jpeg 672w, https://jeroenjanssens.com/img/2QiR9r0VUp-1008.jpeg 1008w, https://jeroenjanssens.com/img/2QiR9r0VUp-1344.jpeg 1344w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 1)" srcset="https://jeroenjanssens.com/img/2QiR9r0VUp-336.jpeg 336w, https://jeroenjanssens.com/img/2QiR9r0VUp-504.jpeg 504w, https://jeroenjanssens.com/img/2QiR9r0VUp-672.jpeg 672w, https://jeroenjanssens.com/img/2QiR9r0VUp-1008.jpeg 1008w, https://jeroenjanssens.com/img/2QiR9r0VUp-1344.jpeg 1344w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 100%;" src="https://jeroenjanssens.com/img/2QiR9r0VUp-336.jpeg" alt="Two-day data hackathon at Transavia in Schiphol" loading="lazy" />
</picture></a>
<figcaption>Two-day data hackathon at Transavia in Schiphol</figcaption>
</figure>
<p>However, in the past few months, partly due to the pandemic, having a
company got increasingly lonely and uncertain. Long story short,
teaching and acquisition is a lot easier and more enjoyable to do in
person!</p>
<figure>
<a href="https://jeroenjanssens.com/img/dsw/equate-data-science-r-03.jpg">
<picture>
<source type="image/webp" sizes="calc(42rem * 1)" srcset="https://jeroenjanssens.com/img/K_aXeRxTRf-336.webp 336w, https://jeroenjanssens.com/img/K_aXeRxTRf-504.webp 504w, https://jeroenjanssens.com/img/K_aXeRxTRf-672.webp 672w, https://jeroenjanssens.com/img/K_aXeRxTRf-1008.webp 1008w, https://jeroenjanssens.com/img/K_aXeRxTRf-1344.webp 1344w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 1)" srcset="https://jeroenjanssens.com/img/K_aXeRxTRf-336.webp 336w, https://jeroenjanssens.com/img/K_aXeRxTRf-504.webp 504w, https://jeroenjanssens.com/img/K_aXeRxTRf-672.webp 672w, https://jeroenjanssens.com/img/K_aXeRxTRf-1008.webp 1008w, https://jeroenjanssens.com/img/K_aXeRxTRf-1344.webp 1344w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 1)" srcset="https://jeroenjanssens.com/img/K_aXeRxTRf-336.jpeg 336w, https://jeroenjanssens.com/img/K_aXeRxTRf-504.jpeg 504w, https://jeroenjanssens.com/img/K_aXeRxTRf-672.jpeg 672w, https://jeroenjanssens.com/img/K_aXeRxTRf-1008.jpeg 1008w, https://jeroenjanssens.com/img/K_aXeRxTRf-1344.jpeg 1344w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 1)" srcset="https://jeroenjanssens.com/img/K_aXeRxTRf-336.jpeg 336w, https://jeroenjanssens.com/img/K_aXeRxTRf-504.jpeg 504w, https://jeroenjanssens.com/img/K_aXeRxTRf-672.jpeg 672w, https://jeroenjanssens.com/img/K_aXeRxTRf-1008.jpeg 1008w, https://jeroenjanssens.com/img/K_aXeRxTRf-1344.jpeg 1344w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 100%;" src="https://jeroenjanssens.com/img/K_aXeRxTRf-336.jpeg" alt="Group picture at EQUATE in Ahmadi, Kuwait" loading="lazy" />
</picture></a>
<figcaption>Group picture at EQUATE in Ahmadi, Kuwait</figcaption>
</figure>
<p>Besides this, an important reason why I decided to make the switch from
solo entrepreneur to employee is that I missed building things. Training
is inherently a short-term engagement where the focus is to teach. I’m
happy to share that I started working at
<a href="https://www.xomnia.com/">Xomnia</a> as a Senior Machine Learning Engineer.
I can now collaborate with colleagues on bigger projects.</p>
<figure>
<a href="https://jeroenjanssens.com/img/dsw/data-science-meetup-picnic-01.jpg">
<picture>
<source type="image/webp" sizes="calc(42rem * 1)" srcset="https://jeroenjanssens.com/img/iBV7P_CiXt-336.webp 336w, https://jeroenjanssens.com/img/iBV7P_CiXt-504.webp 504w, https://jeroenjanssens.com/img/iBV7P_CiXt-672.webp 672w, https://jeroenjanssens.com/img/iBV7P_CiXt-1008.webp 1008w, https://jeroenjanssens.com/img/iBV7P_CiXt-1344.webp 1344w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 1)" srcset="https://jeroenjanssens.com/img/iBV7P_CiXt-336.webp 336w, https://jeroenjanssens.com/img/iBV7P_CiXt-504.webp 504w, https://jeroenjanssens.com/img/iBV7P_CiXt-672.webp 672w, https://jeroenjanssens.com/img/iBV7P_CiXt-1008.webp 1008w, https://jeroenjanssens.com/img/iBV7P_CiXt-1344.webp 1344w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 1)" srcset="https://jeroenjanssens.com/img/iBV7P_CiXt-336.jpeg 336w, https://jeroenjanssens.com/img/iBV7P_CiXt-504.jpeg 504w, https://jeroenjanssens.com/img/iBV7P_CiXt-672.jpeg 672w, https://jeroenjanssens.com/img/iBV7P_CiXt-1008.jpeg 1008w, https://jeroenjanssens.com/img/iBV7P_CiXt-1344.jpeg 1344w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 1)" srcset="https://jeroenjanssens.com/img/iBV7P_CiXt-336.jpeg 336w, https://jeroenjanssens.com/img/iBV7P_CiXt-504.jpeg 504w, https://jeroenjanssens.com/img/iBV7P_CiXt-672.jpeg 672w, https://jeroenjanssens.com/img/iBV7P_CiXt-1008.jpeg 1008w, https://jeroenjanssens.com/img/iBV7P_CiXt-1344.jpeg 1344w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 100%;" src="https://jeroenjanssens.com/img/iBV7P_CiXt-336.jpeg" alt="Data Science NL meetup at Picnic in Amsterdam" loading="lazy" />
</picture></a>
<figcaption>Data Science NL meetup at Picnic in Amsterdam</figcaption>
</figure>
<p>That doesn’t mean I’ll stop teaching all together. I’m still available
for workshops related to my book <a href="https://jeroenjanssens.com/dsatcl">Data Science at the Command
Line</a>. Other workshops, such as data visualization or machine
learning, are still possible, but these will be facilitated via Xomnia.
If you want to know more, best to <a href="mailto:jeroen@jeroenjanssens.com">send me an
email</a>.</p>
<figure>
<a href="https://jeroenjanssens.com/img/dsw/xomnia-web-scraping-01.jpg">
<picture>
<source type="image/webp" sizes="calc(42rem * 1)" srcset="https://jeroenjanssens.com/img/rkwvsBez3o-336.webp 336w, https://jeroenjanssens.com/img/rkwvsBez3o-504.webp 504w, https://jeroenjanssens.com/img/rkwvsBez3o-672.webp 672w, https://jeroenjanssens.com/img/rkwvsBez3o-1008.webp 1008w, https://jeroenjanssens.com/img/rkwvsBez3o-1344.webp 1344w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 1)" srcset="https://jeroenjanssens.com/img/rkwvsBez3o-336.webp 336w, https://jeroenjanssens.com/img/rkwvsBez3o-504.webp 504w, https://jeroenjanssens.com/img/rkwvsBez3o-672.webp 672w, https://jeroenjanssens.com/img/rkwvsBez3o-1008.webp 1008w, https://jeroenjanssens.com/img/rkwvsBez3o-1344.webp 1344w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 1)" srcset="https://jeroenjanssens.com/img/rkwvsBez3o-336.jpeg 336w, https://jeroenjanssens.com/img/rkwvsBez3o-504.jpeg 504w, https://jeroenjanssens.com/img/rkwvsBez3o-672.jpeg 672w, https://jeroenjanssens.com/img/rkwvsBez3o-1008.jpeg 1008w, https://jeroenjanssens.com/img/rkwvsBez3o-1344.jpeg 1344w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 1)" srcset="https://jeroenjanssens.com/img/rkwvsBez3o-336.jpeg 336w, https://jeroenjanssens.com/img/rkwvsBez3o-504.jpeg 504w, https://jeroenjanssens.com/img/rkwvsBez3o-672.jpeg 672w, https://jeroenjanssens.com/img/rkwvsBez3o-1008.jpeg 1008w, https://jeroenjanssens.com/img/rkwvsBez3o-1344.jpeg 1344w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 100%;" src="https://jeroenjanssens.com/img/rkwvsBez3o-336.jpeg" alt="Workshop Web Scraping with R at Xomnia in Amsterdam" loading="lazy" />
</picture></a>
<figcaption>Workshop Web Scraping with R at Xomnia in Amsterdam</figcaption>
</figure>
<p>I’d like to thank all the students, managers, clients, fellow
instructors, conference organizers, meetup members, meetup<sup class="footnote-ref"><a href="https://jeroenjanssens.com/closed/#fn1" id="fnref1">[1]</a></sup> speakers,
meetup hosts, newsletter subscribers<sup class="footnote-ref"><a href="https://jeroenjanssens.com/closed/#fn2" id="fnref2">[2]</a></sup>, friends, and family who have
supported me on this journey along the way. Your questions, answers,
feedback, ideas, follows, likes, dislikes, acceptances, and rejections
have made this endeavor worthwhile.</p>
<p>– Jeroen</p>
<hr class="footnotes-sep" />
<section class="footnotes">
<ol class="footnotes-list">
<li id="fn1" class="footnote-item"><p>I’m also no longer organizing the Data Science NL meetup. Luckily,
there are plenty of other data-related meetups being organized in
the Netherlands, such as the <a href="https://www.xomnia.com/event-series/data-and-drinks/">Data & Drinks
meetup</a> by
Xomnia. <a href="https://jeroenjanssens.com/closed/#fnref1" class="footnote-backref">↩︎</a></p>
</li>
<li id="fn2" class="footnote-item"><p>The Data Science Workshops newsletter has been replaced by a new,
personal, one. You’ll have to <a href="https://jeroenjanssens.com/newsletter">sign up</a> again if you
want to continue receiving emails. <a href="https://jeroenjanssens.com/closed/#fnref2" class="footnote-backref">↩︎</a></p>
</li>
</ol>
</section>
I'm Building a Lego Table2022-12-22T00:00:00Zhttps://jeroenjanssens.com/legotable/<p>I’m building a Lego table for my kids (and, I’ll admit, a little bit for
myself). It’s still a work in progress, but here are already some
highlights of this hobby project.</p>
<figure>
<a href="https://jeroenjanssens.com/img/legotable/IMG_6647.jpg">
<picture>
<source type="image/webp" sizes="calc(42rem * 1)" srcset="https://jeroenjanssens.com/img/Jb8Rd5fBmH-336.webp 336w, https://jeroenjanssens.com/img/Jb8Rd5fBmH-504.webp 504w, https://jeroenjanssens.com/img/Jb8Rd5fBmH-672.webp 672w, https://jeroenjanssens.com/img/Jb8Rd5fBmH-1008.webp 1008w, https://jeroenjanssens.com/img/Jb8Rd5fBmH-1344.webp 1344w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 1)" srcset="https://jeroenjanssens.com/img/Jb8Rd5fBmH-336.webp 336w, https://jeroenjanssens.com/img/Jb8Rd5fBmH-504.webp 504w, https://jeroenjanssens.com/img/Jb8Rd5fBmH-672.webp 672w, https://jeroenjanssens.com/img/Jb8Rd5fBmH-1008.webp 1008w, https://jeroenjanssens.com/img/Jb8Rd5fBmH-1344.webp 1344w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 1)" srcset="https://jeroenjanssens.com/img/Jb8Rd5fBmH-336.jpeg 336w, https://jeroenjanssens.com/img/Jb8Rd5fBmH-504.jpeg 504w, https://jeroenjanssens.com/img/Jb8Rd5fBmH-672.jpeg 672w, https://jeroenjanssens.com/img/Jb8Rd5fBmH-1008.jpeg 1008w, https://jeroenjanssens.com/img/Jb8Rd5fBmH-1344.jpeg 1344w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 1)" srcset="https://jeroenjanssens.com/img/Jb8Rd5fBmH-336.jpeg 336w, https://jeroenjanssens.com/img/Jb8Rd5fBmH-504.jpeg 504w, https://jeroenjanssens.com/img/Jb8Rd5fBmH-672.jpeg 672w, https://jeroenjanssens.com/img/Jb8Rd5fBmH-1008.jpeg 1008w, https://jeroenjanssens.com/img/Jb8Rd5fBmH-1344.jpeg 1344w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 100%;" src="https://jeroenjanssens.com/img/Jb8Rd5fBmH-336.jpeg" alt="First things first: plywood" loading="lazy" />
</picture></a>
<figcaption>First things first: plywood</figcaption>
</figure>
<figure>
<a href="https://jeroenjanssens.com/img/legotable/IMG_6970.jpg">
<picture>
<source type="image/webp" sizes="calc(42rem * 1)" srcset="https://jeroenjanssens.com/img/Y3PjZ90edI-336.webp 336w, https://jeroenjanssens.com/img/Y3PjZ90edI-504.webp 504w, https://jeroenjanssens.com/img/Y3PjZ90edI-672.webp 672w, https://jeroenjanssens.com/img/Y3PjZ90edI-1008.webp 1008w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 1)" srcset="https://jeroenjanssens.com/img/Y3PjZ90edI-336.webp 336w, https://jeroenjanssens.com/img/Y3PjZ90edI-504.webp 504w, https://jeroenjanssens.com/img/Y3PjZ90edI-672.webp 672w, https://jeroenjanssens.com/img/Y3PjZ90edI-1008.webp 1008w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 1)" srcset="https://jeroenjanssens.com/img/Y3PjZ90edI-336.jpeg 336w, https://jeroenjanssens.com/img/Y3PjZ90edI-504.jpeg 504w, https://jeroenjanssens.com/img/Y3PjZ90edI-672.jpeg 672w, https://jeroenjanssens.com/img/Y3PjZ90edI-1008.jpeg 1008w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 1)" srcset="https://jeroenjanssens.com/img/Y3PjZ90edI-336.jpeg 336w, https://jeroenjanssens.com/img/Y3PjZ90edI-504.jpeg 504w, https://jeroenjanssens.com/img/Y3PjZ90edI-672.jpeg 672w, https://jeroenjanssens.com/img/Y3PjZ90edI-1008.jpeg 1008w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 100%;" src="https://jeroenjanssens.com/img/Y3PjZ90edI-336.jpeg" alt="I don't have a fancy ventilation system, so it's important to mask up" loading="lazy" />
</picture></a>
<figcaption>I don't have a fancy ventilation system, so it's important to mask up</figcaption>
</figure>
<figure>
<a href="https://jeroenjanssens.com/img/legotable/IMG_6722.jpg">
<picture>
<source type="image/webp" sizes="calc(42rem * 1)" srcset="https://jeroenjanssens.com/img/Cc6Trlk9Ft-336.webp 336w, https://jeroenjanssens.com/img/Cc6Trlk9Ft-504.webp 504w, https://jeroenjanssens.com/img/Cc6Trlk9Ft-672.webp 672w, https://jeroenjanssens.com/img/Cc6Trlk9Ft-1008.webp 1008w, https://jeroenjanssens.com/img/Cc6Trlk9Ft-1344.webp 1344w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 1)" srcset="https://jeroenjanssens.com/img/Cc6Trlk9Ft-336.webp 336w, https://jeroenjanssens.com/img/Cc6Trlk9Ft-504.webp 504w, https://jeroenjanssens.com/img/Cc6Trlk9Ft-672.webp 672w, https://jeroenjanssens.com/img/Cc6Trlk9Ft-1008.webp 1008w, https://jeroenjanssens.com/img/Cc6Trlk9Ft-1344.webp 1344w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 1)" srcset="https://jeroenjanssens.com/img/Cc6Trlk9Ft-336.jpeg 336w, https://jeroenjanssens.com/img/Cc6Trlk9Ft-504.jpeg 504w, https://jeroenjanssens.com/img/Cc6Trlk9Ft-672.jpeg 672w, https://jeroenjanssens.com/img/Cc6Trlk9Ft-1008.jpeg 1008w, https://jeroenjanssens.com/img/Cc6Trlk9Ft-1344.jpeg 1344w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 1)" srcset="https://jeroenjanssens.com/img/Cc6Trlk9Ft-336.jpeg 336w, https://jeroenjanssens.com/img/Cc6Trlk9Ft-504.jpeg 504w, https://jeroenjanssens.com/img/Cc6Trlk9Ft-672.jpeg 672w, https://jeroenjanssens.com/img/Cc6Trlk9Ft-1008.jpeg 1008w, https://jeroenjanssens.com/img/Cc6Trlk9Ft-1344.jpeg 1344w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 100%;" src="https://jeroenjanssens.com/img/Cc6Trlk9Ft-336.jpeg" alt="I made a router jig to cut some clean arcs" loading="lazy" />
</picture></a>
<figcaption>I made a router jig to cut some clean arcs</figcaption>
</figure>
<figure>
<a href="https://jeroenjanssens.com/img/legotable/IMG_6749.jpg">
<picture>
<source type="image/webp" sizes="calc(42rem * 1)" srcset="https://jeroenjanssens.com/img/M2NEPZWXIu-336.webp 336w, https://jeroenjanssens.com/img/M2NEPZWXIu-504.webp 504w, https://jeroenjanssens.com/img/M2NEPZWXIu-672.webp 672w, https://jeroenjanssens.com/img/M2NEPZWXIu-1008.webp 1008w, https://jeroenjanssens.com/img/M2NEPZWXIu-1344.webp 1344w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 1)" srcset="https://jeroenjanssens.com/img/M2NEPZWXIu-336.webp 336w, https://jeroenjanssens.com/img/M2NEPZWXIu-504.webp 504w, https://jeroenjanssens.com/img/M2NEPZWXIu-672.webp 672w, https://jeroenjanssens.com/img/M2NEPZWXIu-1008.webp 1008w, https://jeroenjanssens.com/img/M2NEPZWXIu-1344.webp 1344w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 1)" srcset="https://jeroenjanssens.com/img/M2NEPZWXIu-336.jpeg 336w, https://jeroenjanssens.com/img/M2NEPZWXIu-504.jpeg 504w, https://jeroenjanssens.com/img/M2NEPZWXIu-672.jpeg 672w, https://jeroenjanssens.com/img/M2NEPZWXIu-1008.jpeg 1008w, https://jeroenjanssens.com/img/M2NEPZWXIu-1344.jpeg 1344w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 1)" srcset="https://jeroenjanssens.com/img/M2NEPZWXIu-336.jpeg 336w, https://jeroenjanssens.com/img/M2NEPZWXIu-504.jpeg 504w, https://jeroenjanssens.com/img/M2NEPZWXIu-672.jpeg 672w, https://jeroenjanssens.com/img/M2NEPZWXIu-1008.jpeg 1008w, https://jeroenjanssens.com/img/M2NEPZWXIu-1344.jpeg 1344w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 100%;" src="https://jeroenjanssens.com/img/M2NEPZWXIu-336.jpeg" alt="For these two sides." loading="lazy" />
</picture></a>
<figcaption>For these two sides.</figcaption>
</figure>
<figure>
<a href="https://jeroenjanssens.com/img/legotable/IMG_6873.jpg">
<picture>
<source type="image/webp" sizes="calc(42rem * 1)" srcset="https://jeroenjanssens.com/img/_tSZD2DIFt-336.webp 336w, https://jeroenjanssens.com/img/_tSZD2DIFt-504.webp 504w, https://jeroenjanssens.com/img/_tSZD2DIFt-672.webp 672w, https://jeroenjanssens.com/img/_tSZD2DIFt-1008.webp 1008w, https://jeroenjanssens.com/img/_tSZD2DIFt-1344.webp 1344w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 1)" srcset="https://jeroenjanssens.com/img/_tSZD2DIFt-336.webp 336w, https://jeroenjanssens.com/img/_tSZD2DIFt-504.webp 504w, https://jeroenjanssens.com/img/_tSZD2DIFt-672.webp 672w, https://jeroenjanssens.com/img/_tSZD2DIFt-1008.webp 1008w, https://jeroenjanssens.com/img/_tSZD2DIFt-1344.webp 1344w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 1)" srcset="https://jeroenjanssens.com/img/_tSZD2DIFt-336.jpeg 336w, https://jeroenjanssens.com/img/_tSZD2DIFt-504.jpeg 504w, https://jeroenjanssens.com/img/_tSZD2DIFt-672.jpeg 672w, https://jeroenjanssens.com/img/_tSZD2DIFt-1008.jpeg 1008w, https://jeroenjanssens.com/img/_tSZD2DIFt-1344.jpeg 1344w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 1)" srcset="https://jeroenjanssens.com/img/_tSZD2DIFt-336.jpeg 336w, https://jeroenjanssens.com/img/_tSZD2DIFt-504.jpeg 504w, https://jeroenjanssens.com/img/_tSZD2DIFt-672.jpeg 672w, https://jeroenjanssens.com/img/_tSZD2DIFt-1008.jpeg 1008w, https://jeroenjanssens.com/img/_tSZD2DIFt-1344.jpeg 1344w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 100%;" src="https://jeroenjanssens.com/img/_tSZD2DIFt-336.jpeg" alt="I have used pocket hole joints (and glue). The holes will be filled with flush-cut dowels." loading="lazy" />
</picture></a>
<figcaption>I have used pocket hole joints (and glue). The holes will be filled with flush-cut dowels.</figcaption>
</figure>
<figure>
<a href="https://jeroenjanssens.com/img/legotable/IMG_6875.jpg">
<picture>
<source type="image/webp" sizes="calc(42rem * 1)" srcset="https://jeroenjanssens.com/img/MOAOTVgP1e-336.webp 336w, https://jeroenjanssens.com/img/MOAOTVgP1e-504.webp 504w, https://jeroenjanssens.com/img/MOAOTVgP1e-672.webp 672w, https://jeroenjanssens.com/img/MOAOTVgP1e-1008.webp 1008w, https://jeroenjanssens.com/img/MOAOTVgP1e-1344.webp 1344w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 1)" srcset="https://jeroenjanssens.com/img/MOAOTVgP1e-336.webp 336w, https://jeroenjanssens.com/img/MOAOTVgP1e-504.webp 504w, https://jeroenjanssens.com/img/MOAOTVgP1e-672.webp 672w, https://jeroenjanssens.com/img/MOAOTVgP1e-1008.webp 1008w, https://jeroenjanssens.com/img/MOAOTVgP1e-1344.webp 1344w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 1)" srcset="https://jeroenjanssens.com/img/MOAOTVgP1e-336.jpeg 336w, https://jeroenjanssens.com/img/MOAOTVgP1e-504.jpeg 504w, https://jeroenjanssens.com/img/MOAOTVgP1e-672.jpeg 672w, https://jeroenjanssens.com/img/MOAOTVgP1e-1008.jpeg 1008w, https://jeroenjanssens.com/img/MOAOTVgP1e-1344.jpeg 1344w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 1)" srcset="https://jeroenjanssens.com/img/MOAOTVgP1e-336.jpeg 336w, https://jeroenjanssens.com/img/MOAOTVgP1e-504.jpeg 504w, https://jeroenjanssens.com/img/MOAOTVgP1e-672.jpeg 672w, https://jeroenjanssens.com/img/MOAOTVgP1e-1008.jpeg 1008w, https://jeroenjanssens.com/img/MOAOTVgP1e-1344.jpeg 1344w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 100%;" src="https://jeroenjanssens.com/img/MOAOTVgP1e-336.jpeg" alt="This thing is huge!" loading="lazy" />
</picture></a>
<figcaption>This thing is huge!</figcaption>
</figure>
<figure>
<a href="https://jeroenjanssens.com/img/legotable/IMG_6926.jpg">
<picture>
<source type="image/webp" sizes="calc(42rem * 1)" srcset="https://jeroenjanssens.com/img/ZC94iutIJT-336.webp 336w, https://jeroenjanssens.com/img/ZC94iutIJT-504.webp 504w, https://jeroenjanssens.com/img/ZC94iutIJT-672.webp 672w, https://jeroenjanssens.com/img/ZC94iutIJT-1008.webp 1008w, https://jeroenjanssens.com/img/ZC94iutIJT-1344.webp 1344w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 1)" srcset="https://jeroenjanssens.com/img/ZC94iutIJT-336.webp 336w, https://jeroenjanssens.com/img/ZC94iutIJT-504.webp 504w, https://jeroenjanssens.com/img/ZC94iutIJT-672.webp 672w, https://jeroenjanssens.com/img/ZC94iutIJT-1008.webp 1008w, https://jeroenjanssens.com/img/ZC94iutIJT-1344.webp 1344w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 1)" srcset="https://jeroenjanssens.com/img/ZC94iutIJT-336.jpeg 336w, https://jeroenjanssens.com/img/ZC94iutIJT-504.jpeg 504w, https://jeroenjanssens.com/img/ZC94iutIJT-672.jpeg 672w, https://jeroenjanssens.com/img/ZC94iutIJT-1008.jpeg 1008w, https://jeroenjanssens.com/img/ZC94iutIJT-1344.jpeg 1344w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 1)" srcset="https://jeroenjanssens.com/img/ZC94iutIJT-336.jpeg 336w, https://jeroenjanssens.com/img/ZC94iutIJT-504.jpeg 504w, https://jeroenjanssens.com/img/ZC94iutIJT-672.jpeg 672w, https://jeroenjanssens.com/img/ZC94iutIJT-1008.jpeg 1008w, https://jeroenjanssens.com/img/ZC94iutIJT-1344.jpeg 1344w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 100%;" src="https://jeroenjanssens.com/img/ZC94iutIJT-336.jpeg" alt="Creating Lego insets" loading="lazy" />
</picture></a>
<figcaption>Creating Lego insets</figcaption>
</figure>
<figure>
<a href="https://jeroenjanssens.com/img/legotable/IMG_7292.jpg">
<picture>
<source type="image/webp" sizes="calc(42rem * 1)" srcset="https://jeroenjanssens.com/img/AFjGan74OY-336.webp 336w, https://jeroenjanssens.com/img/AFjGan74OY-504.webp 504w, https://jeroenjanssens.com/img/AFjGan74OY-672.webp 672w, https://jeroenjanssens.com/img/AFjGan74OY-1008.webp 1008w, https://jeroenjanssens.com/img/AFjGan74OY-1344.webp 1344w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 1)" srcset="https://jeroenjanssens.com/img/AFjGan74OY-336.webp 336w, https://jeroenjanssens.com/img/AFjGan74OY-504.webp 504w, https://jeroenjanssens.com/img/AFjGan74OY-672.webp 672w, https://jeroenjanssens.com/img/AFjGan74OY-1008.webp 1008w, https://jeroenjanssens.com/img/AFjGan74OY-1344.webp 1344w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 1)" srcset="https://jeroenjanssens.com/img/AFjGan74OY-336.jpeg 336w, https://jeroenjanssens.com/img/AFjGan74OY-504.jpeg 504w, https://jeroenjanssens.com/img/AFjGan74OY-672.jpeg 672w, https://jeroenjanssens.com/img/AFjGan74OY-1008.jpeg 1008w, https://jeroenjanssens.com/img/AFjGan74OY-1344.jpeg 1344w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 1)" srcset="https://jeroenjanssens.com/img/AFjGan74OY-336.jpeg 336w, https://jeroenjanssens.com/img/AFjGan74OY-504.jpeg 504w, https://jeroenjanssens.com/img/AFjGan74OY-672.jpeg 672w, https://jeroenjanssens.com/img/AFjGan74OY-1008.jpeg 1008w, https://jeroenjanssens.com/img/AFjGan74OY-1344.jpeg 1344w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 100%;" src="https://jeroenjanssens.com/img/AFjGan74OY-336.jpeg" alt="Can't have too many clamps" loading="lazy" />
</picture></a>
<figcaption>Can't have too many clamps</figcaption>
</figure>
<figure>
<a href="https://jeroenjanssens.com/img/legotable/IMG_7430.jpg">
<picture>
<source type="image/webp" sizes="calc(42rem * 1)" srcset="https://jeroenjanssens.com/img/HN3Essz61i-336.webp 336w, https://jeroenjanssens.com/img/HN3Essz61i-504.webp 504w, https://jeroenjanssens.com/img/HN3Essz61i-672.webp 672w, https://jeroenjanssens.com/img/HN3Essz61i-1008.webp 1008w, https://jeroenjanssens.com/img/HN3Essz61i-1344.webp 1344w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 1)" srcset="https://jeroenjanssens.com/img/HN3Essz61i-336.webp 336w, https://jeroenjanssens.com/img/HN3Essz61i-504.webp 504w, https://jeroenjanssens.com/img/HN3Essz61i-672.webp 672w, https://jeroenjanssens.com/img/HN3Essz61i-1008.webp 1008w, https://jeroenjanssens.com/img/HN3Essz61i-1344.webp 1344w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 1)" srcset="https://jeroenjanssens.com/img/HN3Essz61i-336.jpeg 336w, https://jeroenjanssens.com/img/HN3Essz61i-504.jpeg 504w, https://jeroenjanssens.com/img/HN3Essz61i-672.jpeg 672w, https://jeroenjanssens.com/img/HN3Essz61i-1008.jpeg 1008w, https://jeroenjanssens.com/img/HN3Essz61i-1344.jpeg 1344w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 1)" srcset="https://jeroenjanssens.com/img/HN3Essz61i-336.jpeg 336w, https://jeroenjanssens.com/img/HN3Essz61i-504.jpeg 504w, https://jeroenjanssens.com/img/HN3Essz61i-672.jpeg 672w, https://jeroenjanssens.com/img/HN3Essz61i-1008.jpeg 1008w, https://jeroenjanssens.com/img/HN3Essz61i-1344.jpeg 1344w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 100%;" src="https://jeroenjanssens.com/img/HN3Essz61i-336.jpeg" alt="Thunder!" loading="lazy" />
</picture></a>
<figcaption>Thunder!</figcaption>
</figure>
<figure>
<a href="https://jeroenjanssens.com/img/legotable/IMG_7473.jpg">
<picture>
<source type="image/webp" sizes="calc(42rem * 1)" srcset="https://jeroenjanssens.com/img/kvCdAt1UXz-336.webp 336w, https://jeroenjanssens.com/img/kvCdAt1UXz-504.webp 504w, https://jeroenjanssens.com/img/kvCdAt1UXz-672.webp 672w, https://jeroenjanssens.com/img/kvCdAt1UXz-1008.webp 1008w, https://jeroenjanssens.com/img/kvCdAt1UXz-1344.webp 1344w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 1)" srcset="https://jeroenjanssens.com/img/kvCdAt1UXz-336.webp 336w, https://jeroenjanssens.com/img/kvCdAt1UXz-504.webp 504w, https://jeroenjanssens.com/img/kvCdAt1UXz-672.webp 672w, https://jeroenjanssens.com/img/kvCdAt1UXz-1008.webp 1008w, https://jeroenjanssens.com/img/kvCdAt1UXz-1344.webp 1344w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 1)" srcset="https://jeroenjanssens.com/img/kvCdAt1UXz-336.jpeg 336w, https://jeroenjanssens.com/img/kvCdAt1UXz-504.jpeg 504w, https://jeroenjanssens.com/img/kvCdAt1UXz-672.jpeg 672w, https://jeroenjanssens.com/img/kvCdAt1UXz-1008.jpeg 1008w, https://jeroenjanssens.com/img/kvCdAt1UXz-1344.jpeg 1344w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 1)" srcset="https://jeroenjanssens.com/img/kvCdAt1UXz-336.jpeg 336w, https://jeroenjanssens.com/img/kvCdAt1UXz-504.jpeg 504w, https://jeroenjanssens.com/img/kvCdAt1UXz-672.jpeg 672w, https://jeroenjanssens.com/img/kvCdAt1UXz-1008.jpeg 1008w, https://jeroenjanssens.com/img/kvCdAt1UXz-1344.jpeg 1344w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 100%;" src="https://jeroenjanssens.com/img/kvCdAt1UXz-336.jpeg" alt="Creating a raised border around the six base plates" loading="lazy" />
</picture></a>
<figcaption>Creating a raised border around the six base plates</figcaption>
</figure>
<figure>
<a href="https://jeroenjanssens.com/img/legotable/IMG_9212.jpg">
<picture>
<source type="image/webp" sizes="calc(42rem * 1)" srcset="https://jeroenjanssens.com/img/1idELPQN9k-336.webp 336w, https://jeroenjanssens.com/img/1idELPQN9k-504.webp 504w, https://jeroenjanssens.com/img/1idELPQN9k-672.webp 672w, https://jeroenjanssens.com/img/1idELPQN9k-1008.webp 1008w, https://jeroenjanssens.com/img/1idELPQN9k-1344.webp 1344w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 1)" srcset="https://jeroenjanssens.com/img/1idELPQN9k-336.webp 336w, https://jeroenjanssens.com/img/1idELPQN9k-504.webp 504w, https://jeroenjanssens.com/img/1idELPQN9k-672.webp 672w, https://jeroenjanssens.com/img/1idELPQN9k-1008.webp 1008w, https://jeroenjanssens.com/img/1idELPQN9k-1344.webp 1344w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 1)" srcset="https://jeroenjanssens.com/img/1idELPQN9k-336.jpeg 336w, https://jeroenjanssens.com/img/1idELPQN9k-504.jpeg 504w, https://jeroenjanssens.com/img/1idELPQN9k-672.jpeg 672w, https://jeroenjanssens.com/img/1idELPQN9k-1008.jpeg 1008w, https://jeroenjanssens.com/img/1idELPQN9k-1344.jpeg 1344w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 1)" srcset="https://jeroenjanssens.com/img/1idELPQN9k-336.jpeg 336w, https://jeroenjanssens.com/img/1idELPQN9k-504.jpeg 504w, https://jeroenjanssens.com/img/1idELPQN9k-672.jpeg 672w, https://jeroenjanssens.com/img/1idELPQN9k-1008.jpeg 1008w, https://jeroenjanssens.com/img/1idELPQN9k-1344.jpeg 1344w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 100%;" src="https://jeroenjanssens.com/img/1idELPQN9k-336.jpeg" alt="I love rounded corners" loading="lazy" />
</picture></a>
<figcaption>I love rounded corners</figcaption>
</figure>
<figure>
<a href="https://jeroenjanssens.com/img/legotable/IMG_9740.jpg">
<picture>
<source type="image/webp" sizes="calc(42rem * 1)" srcset="https://jeroenjanssens.com/img/0TO21EkmwJ-336.webp 336w, https://jeroenjanssens.com/img/0TO21EkmwJ-504.webp 504w, https://jeroenjanssens.com/img/0TO21EkmwJ-672.webp 672w, https://jeroenjanssens.com/img/0TO21EkmwJ-1008.webp 1008w, https://jeroenjanssens.com/img/0TO21EkmwJ-1344.webp 1344w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 1)" srcset="https://jeroenjanssens.com/img/0TO21EkmwJ-336.webp 336w, https://jeroenjanssens.com/img/0TO21EkmwJ-504.webp 504w, https://jeroenjanssens.com/img/0TO21EkmwJ-672.webp 672w, https://jeroenjanssens.com/img/0TO21EkmwJ-1008.webp 1008w, https://jeroenjanssens.com/img/0TO21EkmwJ-1344.webp 1344w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 1)" srcset="https://jeroenjanssens.com/img/0TO21EkmwJ-336.jpeg 336w, https://jeroenjanssens.com/img/0TO21EkmwJ-504.jpeg 504w, https://jeroenjanssens.com/img/0TO21EkmwJ-672.jpeg 672w, https://jeroenjanssens.com/img/0TO21EkmwJ-1008.jpeg 1008w, https://jeroenjanssens.com/img/0TO21EkmwJ-1344.jpeg 1344w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 1)" srcset="https://jeroenjanssens.com/img/0TO21EkmwJ-336.jpeg 336w, https://jeroenjanssens.com/img/0TO21EkmwJ-504.jpeg 504w, https://jeroenjanssens.com/img/0TO21EkmwJ-672.jpeg 672w, https://jeroenjanssens.com/img/0TO21EkmwJ-1008.jpeg 1008w, https://jeroenjanssens.com/img/0TO21EkmwJ-1344.jpeg 1344w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 100%;" src="https://jeroenjanssens.com/img/0TO21EkmwJ-336.jpeg" alt="Glueing the top to the base" loading="lazy" />
</picture></a>
<figcaption>Glueing the top to the base</figcaption>
</figure>
<p>The next step is to spray paint the table. Stay tuned!</p>
<p>– Jeroen</p>
How to Scrape Multiple Pages in R and Rvest2021-11-05T00:00:00Zhttps://jeroenjanssens.com/scrape/<p>There’s something exciting about scraping a website to build your own
dataset! For R, there’s the <a href="https://rvest.tidyverse.org/"><code>rvest</code>
package</a> to harvest (i.e., scrape) static
HTML.</p>
<p>When the HTML elements you’re interested in are spread across multiple
pages and you know the URLs of the pages up front (or you know how many
pages you need to visit and the URLs are predictable), you can most
likely use a for loop or one of the map functions from the <code>purrr</code>
package. For example, to get the <a href="https://stackoverflow.com/questions/tagged/r">Stack Overflow questions tagged
R</a> from the first three
pages, you could do:</p>
<pre class="language-r"><code class="language-r">library<span class="token punctuation">(</span>purrr<span class="token punctuation">)</span><br />library<span class="token punctuation">(</span>rvest<span class="token punctuation">)</span><br /><br /><span class="token punctuation">(</span>urls <span class="token operator"><-</span> stringr<span class="token operator">::</span>str_c<span class="token punctuation">(</span><span class="token string">"https://stackoverflow.com/questions/"</span><span class="token punctuation">,</span><br /> <span class="token string">"tagged/r?tab=votes&page="</span><span class="token punctuation">,</span> <span class="token number">1</span><span class="token operator">:</span><span class="token number">3</span><span class="token punctuation">)</span><span class="token punctuation">)</span><br /><span class="token comment">## [1] "https://stackoverflow.com/questions/tagged/r?tab=votes&page=1"</span><br /><span class="token comment">## [2] "https://stackoverflow.com/questions/tagged/r?tab=votes&page=2"</span><br /><span class="token comment">## [3] "https://stackoverflow.com/questions/tagged/r?tab=votes&page=3"</span><br /><br />map<span class="token punctuation">(</span>urls<span class="token punctuation">,</span><br /> <span class="token operator">~</span> read_html<span class="token punctuation">(</span>.<span class="token punctuation">)</span> <span class="token percent-operator operator">%>%</span><br /> html_elements<span class="token punctuation">(</span><span class="token string">"h3 > a.s-link"</span><span class="token punctuation">)</span> <span class="token percent-operator operator">%>%</span><br /> html_text<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token percent-operator operator">%>%</span><br /> flatten_chr<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token percent-operator operator">%>%</span><br /> head<span class="token punctuation">(</span>n <span class="token operator">=</span> <span class="token number">10</span><span class="token punctuation">)</span><br /><span class="token comment">## [1] "How to make a great R reproducible example" </span><br /><span class="token comment">## [2] "How to join (merge) data frames (inner, outer, left, right)" </span><br /><span class="token comment">## [3] "Sort (order) data frame rows by multiple columns" </span><br /><span class="token comment">## [4] "Grouping functions (tapply, by, aggregate) and the *apply family" </span><br /><span class="token comment">## [5] "Remove rows with all or some NAs (missing values) in data.frame" </span><br /><span class="token comment">## [6] "Drop data frame columns by name" </span><br /><span class="token comment">## [7] "How do I replace NA values with zeros in an R dataframe?" </span><br /><span class="token comment">## [8] "What are the differences between \"=\" and \"<-\" assignment operators?" </span><br /><span class="token comment">## [9] "data.table vs dplyr: can one do something well the other can't or does poorly?"</span><br /><span class="token comment">## [10] "Rotating and spacing axis labels in ggplot2"</span></code></pre>
<p>However, if you don’t necessarily know how many pages you need to visit
or the URLs are not easily generated up front, <em>but</em> there’s a link to
the next page, something like this function has served (or scraped) me
well:</p>
<pre class="language-r"><code class="language-r">html_more_elements <span class="token operator"><-</span> <span class="token keyword">function</span><span class="token punctuation">(</span>session<span class="token punctuation">,</span> css<span class="token punctuation">,</span> more_css<span class="token punctuation">)</span> <span class="token punctuation">{</span><br /> xml2<span class="token operator">::</span><span class="token operator">:</span>xml_nodeset<span class="token punctuation">(</span>c<span class="token punctuation">(</span><br /> html_elements<span class="token punctuation">(</span>session<span class="token punctuation">,</span> css<span class="token punctuation">)</span><span class="token punctuation">,</span><br /> tryCatch<span class="token punctuation">(</span><span class="token punctuation">{</span><br /> html_more_elements<span class="token punctuation">(</span>session_follow_link<span class="token punctuation">(</span>session<span class="token punctuation">,</span> css <span class="token operator">=</span> more_css<span class="token punctuation">)</span><span class="token punctuation">,</span><br /> css<span class="token punctuation">,</span> more_css<span class="token punctuation">)</span><br /> <span class="token punctuation">}</span><span class="token punctuation">,</span> error <span class="token operator">=</span> <span class="token keyword">function</span><span class="token punctuation">(</span>e<span class="token punctuation">)</span> <span class="token keyword">NULL</span><span class="token punctuation">)</span><br /> <span class="token punctuation">)</span><span class="token punctuation">)</span><br /><span class="token punctuation">}</span></code></pre>
<p>This R function uses several functions from the <code>rvest</code> package and
recursion to select HTML elements across multiple pages<sup class="footnote-ref"><a href="https://jeroenjanssens.com/scrape/#fn1" id="fnref1">[1]</a></sup>. It has
three arguments:</p>
<ol>
<li>A <code>session</code> object created by <code>rvest::session()</code></li>
<li>A CSS selector that identifies the elements you want to select from
each page</li>
<li>A CSS selector that identifies the link to the next page</li>
</ol>
<p>Note that this function only stops either when there’s no more links to
follow or when the server replies with an error.</p>
<p>Here’s an example that scrapes the names of all Lego Star Wars sets:</p>
<pre class="language-r"><code class="language-r">lego_sets <span class="token operator"><-</span><br /> session<span class="token punctuation">(</span><span class="token string">"https://www.lego.com/en-us/themes/star-wars"</span><span class="token punctuation">)</span> <span class="token percent-operator operator">%>%</span><br /> html_more_elements<span class="token punctuation">(</span><span class="token string">"li h2 > span"</span><span class="token punctuation">,</span> <span class="token string">"a[rel=next]"</span><span class="token punctuation">)</span> <span class="token percent-operator operator">%>%</span><br /> html_text<span class="token punctuation">(</span><span class="token punctuation">)</span><br /><span class="token comment">## Navigating to /en-us/themes/star-wars?page=2</span><br /><span class="token comment">## Navigating to /en-us/themes/star-wars?page=3</span><br /><span class="token comment">## Navigating to /en-us/themes/star-wars?page=4</span><br /><span class="token comment">## Navigating to /en-us/themes/star-wars?page=5</span><br /><span class="token comment">## Navigating to /en-us/themes/star-wars?page=6</span><br /><br />length<span class="token punctuation">(</span>lego_sets<span class="token punctuation">)</span><br /><span class="token comment">## [1] 93</span><br /><br />head<span class="token punctuation">(</span>lego_sets<span class="token punctuation">,</span> n <span class="token operator">=</span> <span class="token number">10</span><span class="token punctuation">)</span><br /><span class="token comment">## [1] "The Razor Crest™" "AT-TE™ Walker" </span><br /><span class="token comment">## [3] "AT-AT™" "LEGO® Star Wars™ Advent Calendar" </span><br /><span class="token comment">## [5] "Clone Trooper™ Command Station" "Millennium Falcon™" </span><br /><span class="token comment">## [7] "Republic Fighter Tank™" "R2-D2™" </span><br /><span class="token comment">## [9] "Republic Gunship™" "The Mandalorian's N-1 Starfighter™"</span></code></pre>
<p>Here’s another example that selects all the titles from <a href="https://news.ycombinator.com/">Hacker
News</a> and shows the first 10:</p>
<pre class="language-r"><code class="language-r">session<span class="token punctuation">(</span><span class="token string">"https://news.ycombinator.com"</span><span class="token punctuation">)</span> <span class="token percent-operator operator">%>%</span><br /> html_more_elements<span class="token punctuation">(</span><span class="token string">".titleline"</span><span class="token punctuation">,</span> <span class="token string">".morelink"</span><span class="token punctuation">)</span> <span class="token percent-operator operator">%>%</span><br /> html_text<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token percent-operator operator">%>%</span><br /> head<span class="token punctuation">(</span>n <span class="token operator">=</span> <span class="token number">10</span><span class="token punctuation">)</span><br /><span class="token comment">## Navigating to news?p=2</span><br /><span class="token comment">## Navigating to news?p=3</span><br /><span class="token comment">## Navigating to news?p=4</span><br /><span class="token comment">## Navigating to news?p=5</span><br /><span class="token comment">## Warning in session_set_response(x, resp): Service Unavailable (HTTP 503).</span><br /><span class="token comment">## [1] "WikiLeaks is struggling to stay online as millions of documents disappear (dailydot.com)" </span><br /><span class="token comment">## [2] "Japanese have been producing wood for 700 years without cutting down trees (dsfantiquejewelry.com)"</span><br /><span class="token comment">## [3] "Someone has to say it: Voice assistants are not doing it for big tech (theregister.com)" </span><br /><span class="token comment">## [4] "Help seed Z-Library on IPFS (annas-blog.org)" </span><br /><span class="token comment">## [5] "The Carcinization of Go Programs (xeiaso.net)" </span><br /><span class="token comment">## [6] "The miracle of Smalltalk’s become: (2009) (gbracha.blogspot.com)" </span><br /><span class="token comment">## [7] "Building the fastest Lua interpreter automatically (sillycross.github.io)" </span><br /><span class="token comment">## [8] "Heterogeneous-Memory Storage Engine (hse-project.github.io)" </span><br /><span class="token comment">## [9] "Safely writing code that isn't thread-safe (cliffle.com)" </span><br /><span class="token comment">## [10] "A history of ARM, part 2: Everything starts to come together (arstechnica.com)"</span></code></pre>
<p>Note that I’m getting a
<a href="https://en.wikipedia.org/wiki/List_of_HTTP_status_codes">503</a> after a
couple of pages. That’s probably because I’m making too many requests in
too little time. Adding some delay to the function (with, e.g.,
<code>Sys.sleep(1)</code>) would solve this. Remember, always Scrape
Responsibly™.</p>
<p>— Jeroen</p>
<hr class="footnotes-sep" />
<section class="footnotes">
<ol class="footnotes-list">
<li id="fn1" class="footnote-item"><p>There’s <a href="https://github.com/tidyverse/rvest/issues/193">a discussion on
GitHub</a> about whether
it makes sense to add this functionality to <code>rvest</code>. <a href="https://jeroenjanssens.com/scrape/#fnref1" class="footnote-backref">↩︎</a></p>
</li>
</ol>
</section>
Heuristics for Translating Ggplot2 Code to Plotnine Code2019-12-13T00:00:00Zhttps://jeroenjanssens.com/heuristics/<p>Because ggplot2 is the de-facto package for creating high-quality data
visualizations in R, and has been for a long time, there exists many
excellent resources for learning ggplot2, including:</p>
<ul>
<li>the <a href="https://ggplot2.tidyverse.org/">ggplot2 website</a>,</li>
<li>a two-page <a href="https://posit.co/wp-content/uploads/2022/10/data-visualization-1.pdf">cheat
sheet</a>
(PDF),</li>
<li><a href="https://stackoverflow.com/questions/tagged/ggplot2?sort=MostVotes">Stack
Overflow</a>,
and</li>
<li>books such as <a href="https://www.amazon.com/ggplot2-Elegant-Graphics-Data-Analysis/dp/331924275X/ref=as_li_ss_tl?ie=UTF8&linkCode=sl1&tag=ggplot2-20">ggplot2: Elegant Graphics for Data
Analysis</a>
and <a href="https://www.amazon.com/dp/1491978600/">R Graphics Cookbook: Practical Recipes for Visualizing
Data</a>.</li>
</ul>
<p><a href="https://posit.co/wp-content/uploads/2022/10/data-visualization-1.pdf"><figure>
<a href="https://jeroenjanssens.com/img/data-visualization-cheatsheet-thumbs.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/eZxE2X6r28-302.webp 302w, https://jeroenjanssens.com/img/eZxE2X6r28-453.webp 453w, https://jeroenjanssens.com/img/eZxE2X6r28-604.webp 604w, https://jeroenjanssens.com/img/eZxE2X6r28-907.webp 907w, https://jeroenjanssens.com/img/eZxE2X6r28-1209.webp 1209w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/eZxE2X6r28-302.webp 302w, https://jeroenjanssens.com/img/eZxE2X6r28-453.webp 453w, https://jeroenjanssens.com/img/eZxE2X6r28-604.webp 604w, https://jeroenjanssens.com/img/eZxE2X6r28-907.webp 907w, https://jeroenjanssens.com/img/eZxE2X6r28-1209.webp 1209w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/eZxE2X6r28-302.jpeg 302w, https://jeroenjanssens.com/img/eZxE2X6r28-453.jpeg 453w, https://jeroenjanssens.com/img/eZxE2X6r28-604.jpeg 604w, https://jeroenjanssens.com/img/eZxE2X6r28-907.jpeg 907w, https://jeroenjanssens.com/img/eZxE2X6r28-1209.jpeg 1209w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/eZxE2X6r28-302.jpeg 302w, https://jeroenjanssens.com/img/eZxE2X6r28-453.jpeg 453w, https://jeroenjanssens.com/img/eZxE2X6r28-604.jpeg 604w, https://jeroenjanssens.com/img/eZxE2X6r28-907.jpeg 907w, https://jeroenjanssens.com/img/eZxE2X6r28-1209.jpeg 1209w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/eZxE2X6r28-302.jpeg" alt="A ggplot2 cheat sheet (PDF)" loading="lazy" />
</picture></a>
<figcaption>A ggplot2 cheat sheet (PDF)</figcaption>
</figure></a></p>
<p>Two days ago, I published the tutorial <a href="https://jeroenjanssens.com/plotnine/">Plotnine: Grammar of Graphics
for Python</a>, which is a translation of the visualization
chapters from “R for Data Science” to Python using plotnine and pandas.
plotnine code is bound to be different from ggplot2 code, due to Python
and R having different syntax and mechanics. Moreover, since plotnine is
still young (but actively being developed) some features are not yet
implemented.</p>
<p>Does that mean we cannot make use of the above-mentioned resources? Of
course not! First of all, the underlying grammar of graphics is still
the same. Secondly, when it comes to the syntax, you can easily
translate 95% of ggplot2 code to plotnine code if you take into account
the heuristics listed below. But first, an example.</p>
<h2>An example</h2>
<p>This R and <code>ggplot2</code> code:</p>
<pre class="language-r"><code class="language-r">library<span class="token punctuation">(</span>ggplot2<span class="token punctuation">)</span><br /><br />ggplot<span class="token punctuation">(</span>mpg<span class="token punctuation">,</span> aes<span class="token punctuation">(</span>displ<span class="token punctuation">,</span> hwy<span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token operator">+</span><br /> geom_point<span class="token punctuation">(</span>aes<span class="token punctuation">(</span>colour <span class="token operator">=</span> class<span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token operator">+</span><br /> geom_smooth<span class="token punctuation">(</span>se <span class="token operator">=</span> <span class="token boolean">FALSE</span><span class="token punctuation">,</span> method <span class="token operator">=</span> <span class="token string">"lm"</span><span class="token punctuation">)</span> <span class="token operator">+</span><br /> guides<span class="token punctuation">(</span>colour <span class="token operator">=</span> guide_legend<span class="token punctuation">(</span>override.aes <span class="token operator">=</span> list<span class="token punctuation">(</span>size <span class="token operator">=</span> <span class="token number">4</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">)</span></code></pre>
<figure>
<a href="https://jeroenjanssens.com/img/heuristics_files/figure-commonmark/heuristics-ggplot2-1.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/heuristics_files/N32VRTTo-d-302.webp 302w, https://jeroenjanssens.com/img/heuristics_files/N32VRTTo-d-453.webp 453w, https://jeroenjanssens.com/img/heuristics_files/N32VRTTo-d-604.webp 604w, https://jeroenjanssens.com/img/heuristics_files/N32VRTTo-d-907.webp 907w, https://jeroenjanssens.com/img/heuristics_files/N32VRTTo-d-1209.webp 1209w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/heuristics_files/N32VRTTo-d-302.webp 302w, https://jeroenjanssens.com/img/heuristics_files/N32VRTTo-d-453.webp 453w, https://jeroenjanssens.com/img/heuristics_files/N32VRTTo-d-604.webp 604w, https://jeroenjanssens.com/img/heuristics_files/N32VRTTo-d-907.webp 907w, https://jeroenjanssens.com/img/heuristics_files/N32VRTTo-d-1209.webp 1209w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/heuristics_files/N32VRTTo-d-302.jpeg 302w, https://jeroenjanssens.com/img/heuristics_files/N32VRTTo-d-453.jpeg 453w, https://jeroenjanssens.com/img/heuristics_files/N32VRTTo-d-604.jpeg 604w, https://jeroenjanssens.com/img/heuristics_files/N32VRTTo-d-907.jpeg 907w, https://jeroenjanssens.com/img/heuristics_files/N32VRTTo-d-1209.jpeg 1209w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/heuristics_files/N32VRTTo-d-302.jpeg 302w, https://jeroenjanssens.com/img/heuristics_files/N32VRTTo-d-453.jpeg 453w, https://jeroenjanssens.com/img/heuristics_files/N32VRTTo-d-604.jpeg 604w, https://jeroenjanssens.com/img/heuristics_files/N32VRTTo-d-907.jpeg 907w, https://jeroenjanssens.com/img/heuristics_files/N32VRTTo-d-1209.jpeg 1209w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/heuristics_files/N32VRTTo-d-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
<p>Can be translated into the following Python and <code>plotnine</code> code:</p>
<pre class="language-python"><code class="language-python"><span class="token keyword">from</span> plotnine <span class="token keyword">import</span> <span class="token operator">*</span><br /><span class="token keyword">from</span> plotnine<span class="token punctuation">.</span>data <span class="token keyword">import</span> mpg<br /><br />ggplot<span class="token punctuation">(</span>mpg<span class="token punctuation">,</span> aes<span class="token punctuation">(</span><span class="token string">"displ"</span><span class="token punctuation">,</span> <span class="token string">"hwy"</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_point<span class="token punctuation">(</span>aes<span class="token punctuation">(</span>colour<span class="token operator">=</span><span class="token string">"class"</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_smooth<span class="token punctuation">(</span>se<span class="token operator">=</span><span class="token boolean">False</span><span class="token punctuation">,</span> method<span class="token operator">=</span><span class="token string">"lm"</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />guides<span class="token punctuation">(</span>colour<span class="token operator">=</span>guide_legend<span class="token punctuation">(</span>override_aes<span class="token operator">=</span><span class="token punctuation">{</span><span class="token string">"size"</span><span class="token punctuation">:</span> <span class="token number">4</span><span class="token punctuation">}</span><span class="token punctuation">)</span><span class="token punctuation">)</span></code></pre>
<figure>
<a href="https://jeroenjanssens.com/img/heuristics-plotnine.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/Ahldq2NXXf-302.webp 302w, https://jeroenjanssens.com/img/Ahldq2NXXf-453.webp 453w, https://jeroenjanssens.com/img/Ahldq2NXXf-604.webp 604w, https://jeroenjanssens.com/img/Ahldq2NXXf-907.webp 907w, https://jeroenjanssens.com/img/Ahldq2NXXf-1209.webp 1209w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/Ahldq2NXXf-302.webp 302w, https://jeroenjanssens.com/img/Ahldq2NXXf-453.webp 453w, https://jeroenjanssens.com/img/Ahldq2NXXf-604.webp 604w, https://jeroenjanssens.com/img/Ahldq2NXXf-907.webp 907w, https://jeroenjanssens.com/img/Ahldq2NXXf-1209.webp 1209w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/Ahldq2NXXf-302.jpeg 302w, https://jeroenjanssens.com/img/Ahldq2NXXf-453.jpeg 453w, https://jeroenjanssens.com/img/Ahldq2NXXf-604.jpeg 604w, https://jeroenjanssens.com/img/Ahldq2NXXf-907.jpeg 907w, https://jeroenjanssens.com/img/Ahldq2NXXf-1209.jpeg 1209w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/Ahldq2NXXf-302.jpeg 302w, https://jeroenjanssens.com/img/Ahldq2NXXf-453.jpeg 453w, https://jeroenjanssens.com/img/Ahldq2NXXf-604.jpeg 604w, https://jeroenjanssens.com/img/Ahldq2NXXf-907.jpeg 907w, https://jeroenjanssens.com/img/Ahldq2NXXf-1209.jpeg 1209w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/Ahldq2NXXf-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
<h2>Simple replacements</h2>
<ul>
<li>Change boolean values, i.e., replace <code>TRUE</code> with <code>True</code> and <code>FALSE</code>
with <code>False</code>.</li>
<li>Replace <code>NULL</code> with <code>None</code>.</li>
<li>Quote all column names, e.g., replace <code>Species</code> with <code>"Species"</code>.
Python unfortunately doesn’t have this thing called non-standard
evaluation.</li>
<li>Remove spaces around equal signs, e.g., replace <code>mapping = aes(...)</code>
with <code>mapping=aes(...)</code>. Style is important.</li>
<li>Replace the assignment operator, i.e., <code><-</code> with <code>=</code>.</li>
<li>Replace dots with underscores, e.g., replace <code>show.legend</code> with
<code>show_legend</code>. In Python, names cannot contain dots.</li>
<li>Replace <code>hjust</code> and <code>vjust</code> with <code>ha</code> and <code>va</code>, respectively. This is
inherited from matplotlib, which is used under the hood by plotnine.</li>
<li>If the code consists of multiple lines, add a continuation character,
i.e., replace <code>+</code> with <code>+\</code>. Alternatively, wrap the entire expression
in parentheses.</li>
</ul>
<h2>Miscellaneous</h2>
<ul>
<li>
<p>Quote inline expressions in its entirety, such as <code>"factor(col)"</code> and
<code>"col < 5"</code>.</p>
</li>
<li>
<p>Quote the facet specification in its entirety, such as
<code>facet_wrap("~ class")</code> and <code>facet_grid("drv ~ cyl")</code>.</p>
</li>
<li>
<p>To suppress labels you cannot use <code>labels=None</code> but you need to pass a
list with as many empty strings as there are values. A helper function
is useful here:</p>
<pre class="language-python"><code class="language-python"><span class="token keyword">def</span> <span class="token function">no_labels</span><span class="token punctuation">(</span>values<span class="token punctuation">)</span><span class="token punctuation">:</span><br /> <span class="token keyword">return</span> <span class="token punctuation">[</span><span class="token string">""</span><span class="token punctuation">]</span> <span class="token operator">*</span> <span class="token builtin">len</span><span class="token punctuation">(</span>values<span class="token punctuation">)</span></code></pre>
</li>
<li>
<p>To prevent text labels from overlapping in ggplot2, you would use
<code>geom_text_repel</code> or <code>geom_label_repel</code> functions from the ggrepel
package. In plotnine, you simply use <code>geom_text</code> or <code>geom_label</code> and
specify the <code>adjust_text</code> argument. For example:
<code>geom_label(adjust_text={'expand_points': (1.5, 1.5), 'arrowprops': {'arrowstyle': '-'}})</code>.</p>
</li>
</ul>
<h2>Features not yet implemented</h2>
<ul>
<li>Unlike with ggplot2, in plotnine you cannot assign literal values to
your aesthetics; all values need to refer column names. For example,
<code>aes(color="blue")</code> results in an error if <code>blue</code> is not a column in
the <code>DataFrame</code>.</li>
<li>plotnine is currently missing the following functions:
<code>coord_quickmap()</code> and <code>coord_polar()</code>.</li>
<li>The function <code>labs()</code> does not support a subtitle or a caption.</li>
</ul>
<p>Let me know if you think anything can be added to (or removed from!)
this list of heuristics. Now go plot!</p>
<p>— Jeroen</p>
Plotnine: Grammar of Graphics for Python2019-12-11T00:00:00Zhttps://jeroenjanssens.com/plotnine/<p><a href="https://github.com/has2k1/plotnine">Plotnine</a> is a data visualisation
package for Python based on the grammar of graphics, created by Hassan
Kibirige. Its API is similar to
<a href="https://ggplot2.tidyverse.org/">ggplot2</a>, a widely successful R package
by <a href="https://ggplot2.tidyverse.org/authors.html">Hadley Wickham and
others</a>.<sup class="footnote-ref"><a href="https://jeroenjanssens.com/plotnine/#fn1" id="fnref1">[1]</a></sup></p>
<p>I’m a staunch proponent of ggplot2. The underlying grammar of graphics
is accompanied by a consistent API that allows you to quickly and
iteratively create different types of beautiful data visualisations
while rarely having to consult the documentation. A welcoming set of
properties when doing exploratory data analysis.</p>
<p>I must admit that I haven’t tried every data visualisation package there
is for Python, but when it comes to the most popular ones, I personally
find them either convenient but limited
(<a href="https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html">pandas</a>),
flexible but complicated (<a href="https://matplotlib.org/">matplotlib</a>), or
beautiful but inconsistent (<a href="https://seaborn.pydata.org/">seaborn</a>).
Your mileage may vary. plotnine, on the other hand, shows a lot of
promise. I estimate it currently has a 95% coverage of ggplot2’s
functionality, and it’s still actively being developed. All in all, as
someone who uses both R and Python, I’m very pleased to be able to
transfer my ggplot2 knowledge to the Python ecosystem.</p>
<p>I figured that plotnine could use a good tutorial so that perhaps more
Pythonistas would give this package a shot. Instead of writing one from
scratch, I turned to the, in my opinion, best free tutorial for ggplot2:
<a href="https://r4ds.had.co.nz/">R for Data Science</a> by Hadley Wickham and
Garrett Grolemund, published by O’Reilly Media in 2016.</p>
<figure>
<a href="https://jeroenjanssens.com/img/plotnine-grammar-of-graphics-for-python/plotnine-cover.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.4)" srcset="https://jeroenjanssens.com/img/683EZlgjyT-134.webp 134w, https://jeroenjanssens.com/img/683EZlgjyT-201.webp 201w, https://jeroenjanssens.com/img/683EZlgjyT-268.webp 268w, https://jeroenjanssens.com/img/683EZlgjyT-403.webp 403w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.4)" srcset="https://jeroenjanssens.com/img/683EZlgjyT-134.webp 134w, https://jeroenjanssens.com/img/683EZlgjyT-201.webp 201w, https://jeroenjanssens.com/img/683EZlgjyT-268.webp 268w, https://jeroenjanssens.com/img/683EZlgjyT-403.webp 403w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.4)" srcset="https://jeroenjanssens.com/img/683EZlgjyT-134.jpeg 134w, https://jeroenjanssens.com/img/683EZlgjyT-201.jpeg 201w, https://jeroenjanssens.com/img/683EZlgjyT-268.jpeg 268w, https://jeroenjanssens.com/img/683EZlgjyT-403.jpeg 403w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.4)" srcset="https://jeroenjanssens.com/img/683EZlgjyT-134.jpeg 134w, https://jeroenjanssens.com/img/683EZlgjyT-201.jpeg 201w, https://jeroenjanssens.com/img/683EZlgjyT-268.jpeg 268w, https://jeroenjanssens.com/img/683EZlgjyT-403.jpeg 403w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 40%;" src="https://jeroenjanssens.com/img/683EZlgjyT-134.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
<p>All I had to do was translate<sup class="footnote-ref"><a href="https://jeroenjanssens.com/plotnine/#fn2" id="fnref2">[2]</a></sup> the visualization chapters (chapter 3
and 28) from R and ggplot2 to Python and plotnine. I would like to thank
Hadley, Garrett, and O’Reilly Media, for granting me permission to do
so. Translating an existing text is quicker than writing a new one, and
has the benefit that it becomes possible to compare both the syntax and
coverage of plotnine to ggplot2.</p>
<p>However, while quicker, translating is not always straightforward. I
have tried to change as little as possible to the original text while
making sure that the text and the code are still in sync. In case any
errors or falsehoods have been introduced due to translation, then I’m
the one to blame. For example, to the best of my knowledge, neither
authors have made any claims about plotnine. If you find such an error
and think it is fixable, then it would be greatly appreciated if you’d
let me know by <a href="https://github.com/datascienceworkshops/r4ds-python-plotnine/issues">creating an issue on
Github</a>.
Thank you. The section numbers in this tutorial link back to the
corresponding section of the original text, in case you want to compare
them.<sup class="footnote-ref"><a href="https://jeroenjanssens.com/plotnine/#fn3" id="fnref3">[3]</a></sup> Only this preface and the few footnotes scattered among the
text are entirely mine.</p>
<p>This tutorial is also available as a <a href="https://github.com/datascienceworkshops/r4ds-python-plotnine/blob/master/output/r4ds-python-plotnine.ipynb">Jupyter
notebook</a>
and an <a href="https://github.com/datascienceworkshops/r4ds-python-plotnine/blob/master/output/r4ds-python-plotnine.Rmd">R
notebook</a>
in case you want to follow along. If you clone the <a href="https://github.com/datascienceworkshops/r4ds-python-plotnine">Github
repository</a>
then you can find the notebooks in the <code>output</code> directory. The
<a href="https://github.com/datascienceworkshops/r4ds-python-plotnine/blob/master/README.md">README</a>
contains instructions on how to run the notebooks. The Jupyter notebook
is also available on
<a href="https://mybinder.org/v2/gh/datascienceworkshops/r4ds-python-plotnine/master?filepath=output%2Fr4ds-python-plotnine.ipynb">Binder</a>,
but keep in mind that the interactive version may take a while to
launch.</p>
<p>Without further ado, let’s start learning about plotnine!</p>
<p>— Jeroen</p>
<h1><a href="https://r4ds.had.co.nz/data-visualisation.html">3</a> Data visualisation</h1>
<h2><a href="https://r4ds.had.co.nz/data-visualisation.html#introduction-1">3.1</a> Introduction</h2>
<blockquote>
<p>“The simple graph has brought more information to the data analyst’s
mind than any other device.” — John Tukey</p>
</blockquote>
<p>This tutorial will teach you how to visualise your data using plotnine.
Python has many packages for making graphs, but plotnine is one of the
most elegant and most versatile. plotnine implements the <strong>grammar of
graphics</strong>, a coherent system for describing and building graphs. With
plotnine, you can do more faster by learning one system and applying it
in many places.</p>
<p>If you’d like to learn more about the theoretical underpinnings of
plotnine before you start, I’d recommend reading <a href="http://vita.had.co.nz/papers/layered-grammar.pdf">The Layered Grammar of
Graphics</a>.</p>
<h3><a href="https://r4ds.had.co.nz/data-visualisation.html#prerequisites-1">3.1.1</a> Prerequisites</h3>
<p>This tutorial focusses on plotnine. We’ll also use a little numpy and
pandas for data manipulation. To access the datasets, help pages, and
functions that we will use in this tutorial, import<sup class="footnote-ref"><a href="https://jeroenjanssens.com/plotnine/#fn4" id="fnref4">[4]</a></sup> the necessary
packages by running this code:</p>
<pre class="language-python"><code class="language-python"><span class="token keyword">from</span> plotnine <span class="token keyword">import</span> <span class="token operator">*</span><br /><span class="token keyword">from</span> plotnine<span class="token punctuation">.</span>data <span class="token keyword">import</span> <span class="token operator">*</span><br /><br /><span class="token keyword">import</span> numpy <span class="token keyword">as</span> np<br /><span class="token keyword">import</span> pandas <span class="token keyword">as</span> pd</code></pre>
<p>If you run this code and get the error message
<code>ModuleNotFoundError: No module named 'plotnine'</code>, you’ll need to first
install it<sup class="footnote-ref"><a href="https://jeroenjanssens.com/plotnine/#fn5" id="fnref5">[5]</a></sup>, then run the code once again.</p>
<pre class="language-shell-session"><code class="language-shell-session"><span class="token output">! pip install plotnine[all]</span></code></pre>
<p>You only need to install a package once, but you need to import it every
time you run your script or (re)start the kernel.</p>
<h2><a href="https://r4ds.had.co.nz/data-visualisation.html#first-steps">3.2</a> First steps</h2>
<p>Let’s use our first graph to answer a question: Do cars with big engines
use more fuel than cars with small engines? You probably already have an
answer, but try to make your answer precise. What does the relationship
between engine size and fuel efficiency look like? Is it positive?
Negative? Linear? Nonlinear?</p>
<h3><a href="https://r4ds.had.co.nz/data-visualisation.html#the-mpg-data-frame">3.2.1</a> The <code>mpg</code> DataFrame</h3>
<p>You can test your answer with the <code>mpg</code> DataFrame found in
<code>plotnine.data</code>. A DataFrame is a rectangular collection of variables
(in the columns) and observations (in the rows). <code>mpg</code> contains
observations collected by the US Environmental Protection Agency on 38
models of car.</p>
<pre class="language-python"><code class="language-python">mpg</code></pre>
<pre class="language-text"><code class="language-text"> manufacturer model displ year cyl trans drv cty hwy fl class<br />0 audi a4 1.8 1999 4 auto(l5) f 18 29 p compact<br />1 audi a4 1.8 1999 4 manual(m5) f 21 29 p compact<br />2 audi a4 2.0 2008 4 manual(m6) f 20 31 p compact<br />3 audi a4 2.0 2008 4 auto(av) f 21 30 p compact<br />4 audi a4 2.8 1999 6 auto(l5) f 16 26 p compact<br />.. ... ... ... ... ... ... .. ... ... .. ...<br />229 volkswagen passat 2.0 2008 4 auto(s6) f 19 28 p midsize<br />230 volkswagen passat 2.0 2008 4 manual(m6) f 21 29 p midsize<br />231 volkswagen passat 2.8 1999 6 auto(l5) f 16 26 p midsize<br />232 volkswagen passat 2.8 1999 6 manual(m5) f 18 26 p midsize<br />233 volkswagen passat 3.6 2008 6 auto(s6) f 17 26 p midsize<br /><br />[234 rows x 11 columns]</code></pre>
<p>Among the variables in <code>mpg</code> are:</p>
<ol>
<li>
<p><code>displ</code>, a car’s engine size, in litres.</p>
</li>
<li>
<p><code>hwy</code>, a car’s fuel efficiency on the highway, in miles per gallon
(mpg). A car with a low fuel efficiency consumes more fuel than a
car with a high fuel efficiency when they travel the same distance.</p>
</li>
</ol>
<p>To learn more about <code>mpg</code>, open its help page by running <code>?mpg</code>.</p>
<h3><a href="https://r4ds.had.co.nz/data-visualisation.html#creating-a-ggplot">3.2.2</a> Creating a ggplot</h3>
<p>To plot <code>mpg</code>, run this code<sup class="footnote-ref"><a href="https://jeroenjanssens.com/plotnine/#fn6" id="fnref6">[6]</a></sup> to put <code>displ</code> on the x-axis and <code>hwy</code>
on the y-axis:</p>
<pre class="language-python"><code class="language-python">ggplot<span class="token punctuation">(</span>data<span class="token operator">=</span>mpg<span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_point<span class="token punctuation">(</span>mapping<span class="token operator">=</span>aes<span class="token punctuation">(</span>x<span class="token operator">=</span><span class="token string">"displ"</span><span class="token punctuation">,</span> y<span class="token operator">=</span><span class="token string">"hwy"</span><span class="token punctuation">)</span><span class="token punctuation">)</span></code></pre>
<figure>
<a href="https://jeroenjanssens.com/img/plotnine-grammar-of-graphics-for-python/plotnine-unnamed-chunk-3-1.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/EauI05WqZD-302.webp 302w, https://jeroenjanssens.com/img/EauI05WqZD-453.webp 453w, https://jeroenjanssens.com/img/EauI05WqZD-604.webp 604w, https://jeroenjanssens.com/img/EauI05WqZD-907.webp 907w, https://jeroenjanssens.com/img/EauI05WqZD-1209.webp 1209w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/EauI05WqZD-302.webp 302w, https://jeroenjanssens.com/img/EauI05WqZD-453.webp 453w, https://jeroenjanssens.com/img/EauI05WqZD-604.webp 604w, https://jeroenjanssens.com/img/EauI05WqZD-907.webp 907w, https://jeroenjanssens.com/img/EauI05WqZD-1209.webp 1209w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/EauI05WqZD-302.jpeg 302w, https://jeroenjanssens.com/img/EauI05WqZD-453.jpeg 453w, https://jeroenjanssens.com/img/EauI05WqZD-604.jpeg 604w, https://jeroenjanssens.com/img/EauI05WqZD-907.jpeg 907w, https://jeroenjanssens.com/img/EauI05WqZD-1209.jpeg 1209w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/EauI05WqZD-302.jpeg 302w, https://jeroenjanssens.com/img/EauI05WqZD-453.jpeg 453w, https://jeroenjanssens.com/img/EauI05WqZD-604.jpeg 604w, https://jeroenjanssens.com/img/EauI05WqZD-907.jpeg 907w, https://jeroenjanssens.com/img/EauI05WqZD-1209.jpeg 1209w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/EauI05WqZD-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
<p>The plot shows a negative relationship between engine size (<code>displ</code>) and
fuel efficiency (<code>hwy</code>). In other words, cars with big engines use more
fuel. Does this confirm or refute your hypothesis about fuel efficiency
and engine size?</p>
<p>With plotnine, you begin a plot with the function <code>ggplot()</code>. <code>ggplot()</code>
creates a coordinate system that you can add layers to. The first
argument of <code>ggplot()</code> is the dataset to use in the graph. So
<code>ggplot(data=mpg)</code> creates an empty graph, but it’s not very interesting
so I’m not going to show it here.</p>
<p>You complete your graph by adding one or more layers to <code>ggplot()</code>. The
function <code>geom_point()</code> adds a layer of points to your plot, which
creates a scatterplot. plotnine comes with many geom functions that each
add a different type of layer to a plot. You’ll learn a whole bunch of
them throughout this tutorial.</p>
<p>Each geom function in plotnine takes a <code>mapping</code> argument. This defines
how variables in your dataset are mapped to visual properties. The
<code>mapping</code> argument is always paired with <code>aes()</code>, and the <code>x</code> and <code>y</code>
arguments of <code>aes()</code> specify which variables to map to the x and y axes.
plotnine looks for the mapped variables in the <code>data</code> argument, in this
case, <code>mpg</code>.</p>
<h3><a href="https://r4ds.had.co.nz/data-visualisation.html#a-graphing-template">3.2.3</a> A graphing template</h3>
<p>Let’s turn this code into a reusable template for making graphs with
plotnine. To make a graph, replace the bracketed sections in the code
below with a dataset, a geom function, or a collection of mappings.</p>
<pre class="language-text"><code class="language-text">ggplot(data=<DATA>) +\<br /><GEOM_FUNCTION>(mapping=aes(<MAPPINGS>))</code></pre>
<p>The rest of this tutorial will show you how to complete and extend this
template to make different types of graphs. We will begin with the
<code><MAPPINGS></code> component.</p>
<h3><a href="https://r4ds.had.co.nz/data-visualisation.html#exercises">3.2.4</a> Exercises</h3>
<ol>
<li>
<p>Run <code>ggplot(data=mpg)</code>. What do you see?</p>
</li>
<li>
<p>How many rows are in <code>mpg</code>? How many columns?</p>
</li>
<li>
<p>What does the <code>drv</code> variable describe? Read the help for <code>?mpg</code> to
find out.</p>
</li>
<li>
<p>Make a scatterplot of <code>hwy</code> vs <code>cyl</code>.</p>
</li>
<li>
<p>What happens if you make a scatterplot of <code>class</code> vs <code>drv</code>? Why is
the plot not useful?</p>
</li>
</ol>
<h2><a href="https://r4ds.had.co.nz/data-visualisation.html#aesthetic-mappings">3.3</a> Aesthetic mappings</h2>
<blockquote>
<p>“The greatest value of a picture is when it forces us to notice what
we never expected to see.” — John Tukey</p>
</blockquote>
<p>In the plot below, one group of points (highlighted in red) seems to
fall outside of the linear trend. These cars have a higher mileage than
you might expect. How can you explain these cars?</p>
<figure>
<a href="https://jeroenjanssens.com/img/plotnine-grammar-of-graphics-for-python/plotnine-unnamed-chunk-4-1.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/KrJdIS1uGw-302.webp 302w, https://jeroenjanssens.com/img/KrJdIS1uGw-453.webp 453w, https://jeroenjanssens.com/img/KrJdIS1uGw-604.webp 604w, https://jeroenjanssens.com/img/KrJdIS1uGw-907.webp 907w, https://jeroenjanssens.com/img/KrJdIS1uGw-1209.webp 1209w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/KrJdIS1uGw-302.webp 302w, https://jeroenjanssens.com/img/KrJdIS1uGw-453.webp 453w, https://jeroenjanssens.com/img/KrJdIS1uGw-604.webp 604w, https://jeroenjanssens.com/img/KrJdIS1uGw-907.webp 907w, https://jeroenjanssens.com/img/KrJdIS1uGw-1209.webp 1209w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/KrJdIS1uGw-302.jpeg 302w, https://jeroenjanssens.com/img/KrJdIS1uGw-453.jpeg 453w, https://jeroenjanssens.com/img/KrJdIS1uGw-604.jpeg 604w, https://jeroenjanssens.com/img/KrJdIS1uGw-907.jpeg 907w, https://jeroenjanssens.com/img/KrJdIS1uGw-1209.jpeg 1209w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/KrJdIS1uGw-302.jpeg 302w, https://jeroenjanssens.com/img/KrJdIS1uGw-453.jpeg 453w, https://jeroenjanssens.com/img/KrJdIS1uGw-604.jpeg 604w, https://jeroenjanssens.com/img/KrJdIS1uGw-907.jpeg 907w, https://jeroenjanssens.com/img/KrJdIS1uGw-1209.jpeg 1209w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/KrJdIS1uGw-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
<p>Let’s hypothesize that the cars are hybrids. One way to test this
hypothesis is to look at the <code>class</code> value for each car. The <code>class</code>
variable of the <code>mpg</code> dataset classifies cars into groups such as
compact, midsize, and SUV. If the outlying points are hybrids, they
should be classified as compact cars or, perhaps, subcompact cars (keep
in mind that this data was collected before hybrid trucks and SUVs
became popular).</p>
<p>You can add a third variable, like <code>class</code>, to a two dimensional
scatterplot by mapping it to an <strong>aesthetic</strong>. An aesthetic is a visual
property of the objects in your plot. Aesthetics include things like the
size, the shape, or the color of your points. You can display a point
(like the one below) in different ways by changing the values of its
aesthetic properties. Since we already use the word “value” to describe
data, let’s use the word “level” to describe aesthetic properties. Here
we change the levels of a point’s size, shape, and color to make the
point small, triangular, or blue:</p>
<figure>
<a href="https://jeroenjanssens.com/img/plotnine-grammar-of-graphics-for-python/plotnine-unnamed-chunk-5-1.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/KZ0GrYJDBg-302.webp 302w, https://jeroenjanssens.com/img/KZ0GrYJDBg-453.webp 453w, https://jeroenjanssens.com/img/KZ0GrYJDBg-604.webp 604w, https://jeroenjanssens.com/img/KZ0GrYJDBg-907.webp 907w, https://jeroenjanssens.com/img/KZ0GrYJDBg-1209.webp 1209w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/KZ0GrYJDBg-302.webp 302w, https://jeroenjanssens.com/img/KZ0GrYJDBg-453.webp 453w, https://jeroenjanssens.com/img/KZ0GrYJDBg-604.webp 604w, https://jeroenjanssens.com/img/KZ0GrYJDBg-907.webp 907w, https://jeroenjanssens.com/img/KZ0GrYJDBg-1209.webp 1209w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/KZ0GrYJDBg-302.jpeg 302w, https://jeroenjanssens.com/img/KZ0GrYJDBg-453.jpeg 453w, https://jeroenjanssens.com/img/KZ0GrYJDBg-604.jpeg 604w, https://jeroenjanssens.com/img/KZ0GrYJDBg-907.jpeg 907w, https://jeroenjanssens.com/img/KZ0GrYJDBg-1209.jpeg 1209w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/KZ0GrYJDBg-302.jpeg 302w, https://jeroenjanssens.com/img/KZ0GrYJDBg-453.jpeg 453w, https://jeroenjanssens.com/img/KZ0GrYJDBg-604.jpeg 604w, https://jeroenjanssens.com/img/KZ0GrYJDBg-907.jpeg 907w, https://jeroenjanssens.com/img/KZ0GrYJDBg-1209.jpeg 1209w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/KZ0GrYJDBg-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
<p>You can convey information about your data by mapping the aesthetics in
your plot to the variables in your dataset. For example, you can map the
colors of your points to the <code>class</code> variable to reveal the class of
each car.</p>
<pre class="language-python"><code class="language-python">ggplot<span class="token punctuation">(</span>data<span class="token operator">=</span>mpg<span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_point<span class="token punctuation">(</span>mapping<span class="token operator">=</span>aes<span class="token punctuation">(</span>x<span class="token operator">=</span><span class="token string">"displ"</span><span class="token punctuation">,</span> y<span class="token operator">=</span><span class="token string">"hwy"</span><span class="token punctuation">,</span> color<span class="token operator">=</span><span class="token string">"class"</span><span class="token punctuation">)</span><span class="token punctuation">)</span></code></pre>
<figure>
<a href="https://jeroenjanssens.com/img/plotnine-grammar-of-graphics-for-python/plotnine-unnamed-chunk-6-1.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/8IJIL4Xx4Y-302.webp 302w, https://jeroenjanssens.com/img/8IJIL4Xx4Y-453.webp 453w, https://jeroenjanssens.com/img/8IJIL4Xx4Y-604.webp 604w, https://jeroenjanssens.com/img/8IJIL4Xx4Y-907.webp 907w, https://jeroenjanssens.com/img/8IJIL4Xx4Y-1209.webp 1209w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/8IJIL4Xx4Y-302.webp 302w, https://jeroenjanssens.com/img/8IJIL4Xx4Y-453.webp 453w, https://jeroenjanssens.com/img/8IJIL4Xx4Y-604.webp 604w, https://jeroenjanssens.com/img/8IJIL4Xx4Y-907.webp 907w, https://jeroenjanssens.com/img/8IJIL4Xx4Y-1209.webp 1209w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/8IJIL4Xx4Y-302.jpeg 302w, https://jeroenjanssens.com/img/8IJIL4Xx4Y-453.jpeg 453w, https://jeroenjanssens.com/img/8IJIL4Xx4Y-604.jpeg 604w, https://jeroenjanssens.com/img/8IJIL4Xx4Y-907.jpeg 907w, https://jeroenjanssens.com/img/8IJIL4Xx4Y-1209.jpeg 1209w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/8IJIL4Xx4Y-302.jpeg 302w, https://jeroenjanssens.com/img/8IJIL4Xx4Y-453.jpeg 453w, https://jeroenjanssens.com/img/8IJIL4Xx4Y-604.jpeg 604w, https://jeroenjanssens.com/img/8IJIL4Xx4Y-907.jpeg 907w, https://jeroenjanssens.com/img/8IJIL4Xx4Y-1209.jpeg 1209w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/8IJIL4Xx4Y-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
<p>(If you prefer British English, like Hadley, you can use <code>colour</code>
instead of <code>color</code>.)</p>
<p>To map an aesthetic to a variable, associate the name of the aesthetic
to the name of the variable inside <code>aes()</code>. plotnine will automatically
assign a unique level of the aesthetic (here a unique color) to each
unique value of the variable, a process known as <strong>scaling</strong>. plotnine
will also add a legend that explains which levels correspond to which
values.</p>
<p>The colors reveal that many of the unusual points are two-seater cars.
These cars don’t seem like hybrids, and are, in fact, sports cars!
Sports cars have large engines like SUVs and pickup trucks, but small
bodies like midsize and compact cars, which improves their gas mileage.
In hindsight, these cars were unlikely to be hybrids since they have
large engines.</p>
<p>In the above example, we mapped <code>class</code> to the color aesthetic, but we
could have mapped <code>class</code> to the size aesthetic in the same way. In this
case, the exact size of each point would reveal its class affiliation.
We get a <em>warning</em> here, because mapping an unordered variable (<code>class</code>)
to an ordered aesthetic (<code>size</code>) is not a good idea.</p>
<pre class="language-python"><code class="language-python">ggplot<span class="token punctuation">(</span>data<span class="token operator">=</span>mpg<span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_point<span class="token punctuation">(</span>mapping<span class="token operator">=</span>aes<span class="token punctuation">(</span>x<span class="token operator">=</span><span class="token string">"displ"</span><span class="token punctuation">,</span> y<span class="token operator">=</span><span class="token string">"hwy"</span><span class="token punctuation">,</span> size<span class="token operator">=</span><span class="token string">"class"</span><span class="token punctuation">)</span><span class="token punctuation">)</span></code></pre>
<pre class="language-text"><code class="language-text">./venv/lib/python3.7/site-packages/plotnine/scales/scale_size.py:50: PlotnineWarning: Using alpha for a discrete variable is not advised.<br /> PlotnineWarning</code></pre>
<figure>
<a href="https://jeroenjanssens.com/img/plotnine-grammar-of-graphics-for-python/plotnine-unnamed-chunk-7-1.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/dy1DhP0G0e-302.webp 302w, https://jeroenjanssens.com/img/dy1DhP0G0e-453.webp 453w, https://jeroenjanssens.com/img/dy1DhP0G0e-604.webp 604w, https://jeroenjanssens.com/img/dy1DhP0G0e-907.webp 907w, https://jeroenjanssens.com/img/dy1DhP0G0e-1209.webp 1209w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/dy1DhP0G0e-302.webp 302w, https://jeroenjanssens.com/img/dy1DhP0G0e-453.webp 453w, https://jeroenjanssens.com/img/dy1DhP0G0e-604.webp 604w, https://jeroenjanssens.com/img/dy1DhP0G0e-907.webp 907w, https://jeroenjanssens.com/img/dy1DhP0G0e-1209.webp 1209w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/dy1DhP0G0e-302.jpeg 302w, https://jeroenjanssens.com/img/dy1DhP0G0e-453.jpeg 453w, https://jeroenjanssens.com/img/dy1DhP0G0e-604.jpeg 604w, https://jeroenjanssens.com/img/dy1DhP0G0e-907.jpeg 907w, https://jeroenjanssens.com/img/dy1DhP0G0e-1209.jpeg 1209w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/dy1DhP0G0e-302.jpeg 302w, https://jeroenjanssens.com/img/dy1DhP0G0e-453.jpeg 453w, https://jeroenjanssens.com/img/dy1DhP0G0e-604.jpeg 604w, https://jeroenjanssens.com/img/dy1DhP0G0e-907.jpeg 907w, https://jeroenjanssens.com/img/dy1DhP0G0e-1209.jpeg 1209w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/dy1DhP0G0e-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
<p>Similarly, we could have mapped <code>manufacturer</code> to the <em>alpha</em> aesthetic,
which controls the transparency of the points, or to the <em>shape</em>
aesthetic, which controls the shape of the points.<sup class="footnote-ref"><a href="https://jeroenjanssens.com/plotnine/#fn7" id="fnref7">[7]</a></sup></p>
<pre class="language-python"><code class="language-python"><span class="token comment"># Left</span><br />ggplot<span class="token punctuation">(</span>data<span class="token operator">=</span>mpg<span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_point<span class="token punctuation">(</span>mapping<span class="token operator">=</span>aes<span class="token punctuation">(</span>x<span class="token operator">=</span><span class="token string">"displ"</span><span class="token punctuation">,</span> y<span class="token operator">=</span><span class="token string">"hwy"</span><span class="token punctuation">,</span> alpha<span class="token operator">=</span><span class="token string">"manufacturer"</span><span class="token punctuation">)</span><span class="token punctuation">)</span><br /><br /><span class="token comment"># Right</span><br />ggplot<span class="token punctuation">(</span>data<span class="token operator">=</span>mpg<span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_point<span class="token punctuation">(</span>mapping<span class="token operator">=</span>aes<span class="token punctuation">(</span>x<span class="token operator">=</span><span class="token string">"displ"</span><span class="token punctuation">,</span> y<span class="token operator">=</span><span class="token string">"hwy"</span><span class="token punctuation">,</span> shape<span class="token operator">=</span><span class="token string">"manufacturer"</span><span class="token punctuation">)</span><span class="token punctuation">)</span></code></pre>
<div class="flex flex-wrap md:flex-row md:flex-no-wrap">
<div class="md:w-1/2">
<figure>
<a href="https://jeroenjanssens.com/img/plotnine-grammar-of-graphics-for-python/plotnine-unnamed-chunk-10-1.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/usy6WKmMth-302.webp 302w, https://jeroenjanssens.com/img/usy6WKmMth-453.webp 453w, https://jeroenjanssens.com/img/usy6WKmMth-604.webp 604w, https://jeroenjanssens.com/img/usy6WKmMth-907.webp 907w, https://jeroenjanssens.com/img/usy6WKmMth-1209.webp 1209w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/usy6WKmMth-302.webp 302w, https://jeroenjanssens.com/img/usy6WKmMth-453.webp 453w, https://jeroenjanssens.com/img/usy6WKmMth-604.webp 604w, https://jeroenjanssens.com/img/usy6WKmMth-907.webp 907w, https://jeroenjanssens.com/img/usy6WKmMth-1209.webp 1209w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/usy6WKmMth-302.jpeg 302w, https://jeroenjanssens.com/img/usy6WKmMth-453.jpeg 453w, https://jeroenjanssens.com/img/usy6WKmMth-604.jpeg 604w, https://jeroenjanssens.com/img/usy6WKmMth-907.jpeg 907w, https://jeroenjanssens.com/img/usy6WKmMth-1209.jpeg 1209w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/usy6WKmMth-302.jpeg 302w, https://jeroenjanssens.com/img/usy6WKmMth-453.jpeg 453w, https://jeroenjanssens.com/img/usy6WKmMth-604.jpeg 604w, https://jeroenjanssens.com/img/usy6WKmMth-907.jpeg 907w, https://jeroenjanssens.com/img/usy6WKmMth-1209.jpeg 1209w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/usy6WKmMth-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
</div>
<div class="md:w-1/2">
<figure>
<a href="https://jeroenjanssens.com/img/plotnine-grammar-of-graphics-for-python/plotnine-unnamed-chunk-11-1.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/7c3qY3MPwT-302.webp 302w, https://jeroenjanssens.com/img/7c3qY3MPwT-453.webp 453w, https://jeroenjanssens.com/img/7c3qY3MPwT-604.webp 604w, https://jeroenjanssens.com/img/7c3qY3MPwT-907.webp 907w, https://jeroenjanssens.com/img/7c3qY3MPwT-1209.webp 1209w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/7c3qY3MPwT-302.webp 302w, https://jeroenjanssens.com/img/7c3qY3MPwT-453.webp 453w, https://jeroenjanssens.com/img/7c3qY3MPwT-604.webp 604w, https://jeroenjanssens.com/img/7c3qY3MPwT-907.webp 907w, https://jeroenjanssens.com/img/7c3qY3MPwT-1209.webp 1209w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/7c3qY3MPwT-302.jpeg 302w, https://jeroenjanssens.com/img/7c3qY3MPwT-453.jpeg 453w, https://jeroenjanssens.com/img/7c3qY3MPwT-604.jpeg 604w, https://jeroenjanssens.com/img/7c3qY3MPwT-907.jpeg 907w, https://jeroenjanssens.com/img/7c3qY3MPwT-1209.jpeg 1209w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/7c3qY3MPwT-302.jpeg 302w, https://jeroenjanssens.com/img/7c3qY3MPwT-453.jpeg 453w, https://jeroenjanssens.com/img/7c3qY3MPwT-604.jpeg 604w, https://jeroenjanssens.com/img/7c3qY3MPwT-907.jpeg 907w, https://jeroenjanssens.com/img/7c3qY3MPwT-1209.jpeg 1209w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/7c3qY3MPwT-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
</div>
</div>
<p>What happened to Toyota and Volkswagen? plotnine will only use 13 shapes
at a time. By default, additional groups will go unplotted when you use
the shape aesthetic.</p>
<p>For each aesthetic, you use <code>aes()</code> to associate the name of the
aesthetic with a variable to display. The <code>aes()</code> function gathers
together each of the aesthetic mappings used by a layer and passes them
to the layer’s mapping argument. The syntax highlights a useful insight
about <code>x</code> and <code>y</code>: the x and y locations of a point are themselves
aesthetics, visual properties that you can map to variables to display
information about the data.</p>
<p>Once you map an aesthetic, plotnine takes care of the rest. It selects a
reasonable scale to use with the aesthetic, and it constructs a legend
that explains the mapping between levels and values. For x and y
aesthetics, plotnine does not create a legend, but it creates an axis
line with tick marks and a label. The axis line acts as a legend; it
explains the mapping between locations and values.</p>
<p>You can also <em>set</em> the aesthetic properties of your geom manually. For
example, we can make all of the points in our plot blue:</p>
<pre class="language-python"><code class="language-python">ggplot<span class="token punctuation">(</span>data<span class="token operator">=</span>mpg<span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_point<span class="token punctuation">(</span>mapping<span class="token operator">=</span>aes<span class="token punctuation">(</span>x<span class="token operator">=</span><span class="token string">"displ"</span><span class="token punctuation">,</span> y<span class="token operator">=</span><span class="token string">"hwy"</span><span class="token punctuation">)</span><span class="token punctuation">,</span> color<span class="token operator">=</span><span class="token string">"blue"</span><span class="token punctuation">)</span></code></pre>
<figure>
<a href="https://jeroenjanssens.com/img/plotnine-grammar-of-graphics-for-python/plotnine-unnamed-chunk-12-1.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/hRR4hCQmUP-302.webp 302w, https://jeroenjanssens.com/img/hRR4hCQmUP-453.webp 453w, https://jeroenjanssens.com/img/hRR4hCQmUP-604.webp 604w, https://jeroenjanssens.com/img/hRR4hCQmUP-907.webp 907w, https://jeroenjanssens.com/img/hRR4hCQmUP-1209.webp 1209w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/hRR4hCQmUP-302.webp 302w, https://jeroenjanssens.com/img/hRR4hCQmUP-453.webp 453w, https://jeroenjanssens.com/img/hRR4hCQmUP-604.webp 604w, https://jeroenjanssens.com/img/hRR4hCQmUP-907.webp 907w, https://jeroenjanssens.com/img/hRR4hCQmUP-1209.webp 1209w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/hRR4hCQmUP-302.jpeg 302w, https://jeroenjanssens.com/img/hRR4hCQmUP-453.jpeg 453w, https://jeroenjanssens.com/img/hRR4hCQmUP-604.jpeg 604w, https://jeroenjanssens.com/img/hRR4hCQmUP-907.jpeg 907w, https://jeroenjanssens.com/img/hRR4hCQmUP-1209.jpeg 1209w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/hRR4hCQmUP-302.jpeg 302w, https://jeroenjanssens.com/img/hRR4hCQmUP-453.jpeg 453w, https://jeroenjanssens.com/img/hRR4hCQmUP-604.jpeg 604w, https://jeroenjanssens.com/img/hRR4hCQmUP-907.jpeg 907w, https://jeroenjanssens.com/img/hRR4hCQmUP-1209.jpeg 1209w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/hRR4hCQmUP-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
<p>Here, the color doesn’t convey information about a variable, but only
changes the appearance of the plot. To set an aesthetic manually, set
the aesthetic by name as an argument of your geom function; i.e. it goes
<em>outside</em> of <code>aes()</code>. You’ll need to pick a level that makes sense for
that aesthetic:</p>
<ul>
<li>The name of a color as a string.</li>
<li>The size of a point in mm.</li>
<li>The shape of a point as a character or number, as shown below.</li>
</ul>
<figure>
<a href="https://jeroenjanssens.com/img/plotnine-grammar-of-graphics-for-python/plotnine-shapes-1.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/EHomht5kpS-302.webp 302w, https://jeroenjanssens.com/img/EHomht5kpS-453.webp 453w, https://jeroenjanssens.com/img/EHomht5kpS-604.webp 604w, https://jeroenjanssens.com/img/EHomht5kpS-907.webp 907w, https://jeroenjanssens.com/img/EHomht5kpS-1209.webp 1209w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/EHomht5kpS-302.webp 302w, https://jeroenjanssens.com/img/EHomht5kpS-453.webp 453w, https://jeroenjanssens.com/img/EHomht5kpS-604.webp 604w, https://jeroenjanssens.com/img/EHomht5kpS-907.webp 907w, https://jeroenjanssens.com/img/EHomht5kpS-1209.webp 1209w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/EHomht5kpS-302.jpeg 302w, https://jeroenjanssens.com/img/EHomht5kpS-453.jpeg 453w, https://jeroenjanssens.com/img/EHomht5kpS-604.jpeg 604w, https://jeroenjanssens.com/img/EHomht5kpS-907.jpeg 907w, https://jeroenjanssens.com/img/EHomht5kpS-1209.jpeg 1209w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/EHomht5kpS-302.jpeg 302w, https://jeroenjanssens.com/img/EHomht5kpS-453.jpeg 453w, https://jeroenjanssens.com/img/EHomht5kpS-604.jpeg 604w, https://jeroenjanssens.com/img/EHomht5kpS-907.jpeg 907w, https://jeroenjanssens.com/img/EHomht5kpS-1209.jpeg 1209w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/EHomht5kpS-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
<h3><a href="https://r4ds.had.co.nz/data-visualisation.html#exercises-1">3.3.1</a> Exercises<sup class="footnote-ref"><a href="https://jeroenjanssens.com/plotnine/#fn8" id="fnref8">[8]</a></sup></h3>
<ol>
<li>
<p>Which variables in <code>mpg</code> are categorical? Which variables are
continuous? (Hint: type <code>?mpg</code> to read the documentation for the
dataset). How can you see this information when you run <code>mpg</code>?</p>
</li>
<li>
<p>Map a continuous variable to <code>color</code>, <code>size</code>, and <code>shape</code>. How do
these aesthetics behave differently for categorical vs. continuous
variables?</p>
</li>
<li>
<p>What happens if you map the same variable to multiple aesthetics?</p>
</li>
<li>
<p>What does the <code>stroke</code> aesthetic do? What shapes does it work with?
(Hint: use <code>?geom_point</code>)</p>
</li>
<li>
<p>What happens if you map an aesthetic to something other than a
variable name, like <code>aes(colour="displ < 5")</code>? Note, you’ll also
need to specify x and y.</p>
</li>
</ol>
<h2><a href="https://r4ds.had.co.nz/data-visualisation.html#common-problems">3.4</a> Common problems</h2>
<p>As you start to run Python code, you’re likely to run into problems.
Don’t worry — it happens to everyone. I have been writing Python code
for years, and every day I still write code that doesn’t work!</p>
<p>Start by carefully comparing the code that you’re running to the code in
the book. Python is extremely picky, and a misplaced character can make
all the difference. Make sure that every <code>(</code> is matched with a <code>)</code> and
every <code>"</code> is paired with another <code>"</code>.</p>
<p>One common problem when creating plotnine graphics is to forget the <code>\</code>:
it has to come at the end of the line. In other words, make sure you
haven’t accidentally written code like this:</p>
<pre class="language-r"><code class="language-r">ggplot<span class="token punctuation">(</span>data<span class="token operator">=</span>mpg<span class="token punctuation">)</span> <span class="token operator">+</span><br />geom_point<span class="token punctuation">(</span>mapping<span class="token operator">=</span>aes<span class="token punctuation">(</span>x<span class="token operator">=</span>displ<span class="token punctuation">,</span> y<span class="token operator">=</span>hwy<span class="token punctuation">)</span><span class="token punctuation">)</span></code></pre>
<p>Alternatively, if you wrap the entire expression in parentheses then you
can leave out the <code>\</code>:</p>
<pre class="language-r"><code class="language-r"><span class="token punctuation">(</span>ggplot<span class="token punctuation">(</span>data<span class="token operator">=</span>mpg<span class="token punctuation">)</span> <span class="token operator">+</span><br />geom_point<span class="token punctuation">(</span>mapping<span class="token operator">=</span>aes<span class="token punctuation">(</span>x<span class="token operator">=</span>displ<span class="token punctuation">,</span> y<span class="token operator">=</span>hwy<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">)</span></code></pre>
<p>If you’re still stuck, try the help. You can get help about any Python
function by running <code>?function_name</code>. Don’t worry if the help doesn’t
seem that helpful - instead skip down to the examples and look for code
that matches what you’re trying to do.</p>
<p>If that doesn’t help, carefully read the error message. Sometimes the
answer will be buried there! But when you’re new to Python, the answer
might be in the error message but you don’t yet know how to understand
it. Another great tool is Google: try googling the error message, as
it’s likely someone else has had the same problem, and has gotten help
online.</p>
<h2><a href="https://r4ds.had.co.nz/data-visualisation.html#facets">3.5</a> Facets</h2>
<p>One way to add additional variables is with aesthetics. Another way,
particularly useful for categorical variables, is to split your plot
into <strong>facets</strong>, subplots that each display one subset of the data.</p>
<p>To facet your plot by a single variable, use <code>facet_wrap()</code>. The first
argument of <code>facet_wrap()</code> should be a formula, which you create with
<code>~</code> followed by a variable name (here “formula” is the name of a data
structure in Python, not a synonym for “equation”). The variable that
you pass to <code>facet_wrap()</code> should be discrete.</p>
<pre class="language-python"><code class="language-python">ggplot<span class="token punctuation">(</span>data<span class="token operator">=</span>mpg<span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_point<span class="token punctuation">(</span>mapping<span class="token operator">=</span>aes<span class="token punctuation">(</span>x<span class="token operator">=</span><span class="token string">"displ"</span><span class="token punctuation">,</span> y<span class="token operator">=</span><span class="token string">"hwy"</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />facet_wrap<span class="token punctuation">(</span><span class="token string">"class"</span><span class="token punctuation">,</span> nrow<span class="token operator">=</span><span class="token number">2</span><span class="token punctuation">)</span></code></pre>
<figure>
<a href="https://jeroenjanssens.com/img/plotnine-grammar-of-graphics-for-python/plotnine-unnamed-chunk-13-1.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/4sI8bPv-OP-302.webp 302w, https://jeroenjanssens.com/img/4sI8bPv-OP-453.webp 453w, https://jeroenjanssens.com/img/4sI8bPv-OP-604.webp 604w, https://jeroenjanssens.com/img/4sI8bPv-OP-907.webp 907w, https://jeroenjanssens.com/img/4sI8bPv-OP-1209.webp 1209w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/4sI8bPv-OP-302.webp 302w, https://jeroenjanssens.com/img/4sI8bPv-OP-453.webp 453w, https://jeroenjanssens.com/img/4sI8bPv-OP-604.webp 604w, https://jeroenjanssens.com/img/4sI8bPv-OP-907.webp 907w, https://jeroenjanssens.com/img/4sI8bPv-OP-1209.webp 1209w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/4sI8bPv-OP-302.jpeg 302w, https://jeroenjanssens.com/img/4sI8bPv-OP-453.jpeg 453w, https://jeroenjanssens.com/img/4sI8bPv-OP-604.jpeg 604w, https://jeroenjanssens.com/img/4sI8bPv-OP-907.jpeg 907w, https://jeroenjanssens.com/img/4sI8bPv-OP-1209.jpeg 1209w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/4sI8bPv-OP-302.jpeg 302w, https://jeroenjanssens.com/img/4sI8bPv-OP-453.jpeg 453w, https://jeroenjanssens.com/img/4sI8bPv-OP-604.jpeg 604w, https://jeroenjanssens.com/img/4sI8bPv-OP-907.jpeg 907w, https://jeroenjanssens.com/img/4sI8bPv-OP-1209.jpeg 1209w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/4sI8bPv-OP-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
<p>To facet your plot on the combination of two variables, add
<code>facet_grid()</code> to your plot call. The first argument of <code>facet_grid()</code>
is also a formula. This time the formula should contain two variable
names separated by a <code>~</code>.</p>
<pre class="language-python"><code class="language-python">ggplot<span class="token punctuation">(</span>data<span class="token operator">=</span>mpg<span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_point<span class="token punctuation">(</span>mapping<span class="token operator">=</span>aes<span class="token punctuation">(</span>x<span class="token operator">=</span><span class="token string">"displ"</span><span class="token punctuation">,</span> y<span class="token operator">=</span><span class="token string">"hwy"</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />facet_grid<span class="token punctuation">(</span><span class="token string">"drv ~ cyl"</span><span class="token punctuation">)</span></code></pre>
<figure>
<a href="https://jeroenjanssens.com/img/plotnine-grammar-of-graphics-for-python/plotnine-unnamed-chunk-14-1.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/M697bkN9BM-302.webp 302w, https://jeroenjanssens.com/img/M697bkN9BM-453.webp 453w, https://jeroenjanssens.com/img/M697bkN9BM-604.webp 604w, https://jeroenjanssens.com/img/M697bkN9BM-907.webp 907w, https://jeroenjanssens.com/img/M697bkN9BM-1209.webp 1209w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/M697bkN9BM-302.webp 302w, https://jeroenjanssens.com/img/M697bkN9BM-453.webp 453w, https://jeroenjanssens.com/img/M697bkN9BM-604.webp 604w, https://jeroenjanssens.com/img/M697bkN9BM-907.webp 907w, https://jeroenjanssens.com/img/M697bkN9BM-1209.webp 1209w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/M697bkN9BM-302.jpeg 302w, https://jeroenjanssens.com/img/M697bkN9BM-453.jpeg 453w, https://jeroenjanssens.com/img/M697bkN9BM-604.jpeg 604w, https://jeroenjanssens.com/img/M697bkN9BM-907.jpeg 907w, https://jeroenjanssens.com/img/M697bkN9BM-1209.jpeg 1209w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/M697bkN9BM-302.jpeg 302w, https://jeroenjanssens.com/img/M697bkN9BM-453.jpeg 453w, https://jeroenjanssens.com/img/M697bkN9BM-604.jpeg 604w, https://jeroenjanssens.com/img/M697bkN9BM-907.jpeg 907w, https://jeroenjanssens.com/img/M697bkN9BM-1209.jpeg 1209w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/M697bkN9BM-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
<p>If you prefer to not facet in the rows or columns dimension, use a <code>.</code>
instead of a variable name, e.g. <code>+ facet_grid(". ~ cyl")</code>.</p>
<h3><a href="https://r4ds.had.co.nz/data-visualisation.html#exercises-2">3.5.1</a> Exercises</h3>
<ol>
<li>
<p>What happens if you facet on a continuous variable?</p>
</li>
<li>
<p>What do the empty cells in plot with <code>facet_grid("drv ~ cyl")</code> mean?
How do they relate to this plot?</p>
<pre class="language-python"><code class="language-python">ggplot<span class="token punctuation">(</span>data<span class="token operator">=</span>mpg<span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_point<span class="token punctuation">(</span>mapping<span class="token operator">=</span>aes<span class="token punctuation">(</span>x<span class="token operator">=</span><span class="token string">"drv"</span><span class="token punctuation">,</span> y<span class="token operator">=</span><span class="token string">"cyl"</span><span class="token punctuation">)</span><span class="token punctuation">)</span></code></pre>
</li>
<li>
<p>What plots does the following code make? What does <code>.</code> do?</p>
<pre class="language-python"><code class="language-python">ggplot<span class="token punctuation">(</span>data<span class="token operator">=</span>mpg<span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_point<span class="token punctuation">(</span>mapping<span class="token operator">=</span>aes<span class="token punctuation">(</span>x<span class="token operator">=</span><span class="token string">"displ"</span><span class="token punctuation">,</span> y<span class="token operator">=</span><span class="token string">"hwy"</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />facet_grid<span class="token punctuation">(</span><span class="token string">"drv ~ ."</span><span class="token punctuation">)</span><br /><br />ggplot<span class="token punctuation">(</span>data<span class="token operator">=</span>mpg<span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_point<span class="token punctuation">(</span>mapping<span class="token operator">=</span>aes<span class="token punctuation">(</span>x<span class="token operator">=</span><span class="token string">"displ"</span><span class="token punctuation">,</span> y<span class="token operator">=</span><span class="token string">"hwy"</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />facet_grid<span class="token punctuation">(</span><span class="token string">". ~ cyl"</span><span class="token punctuation">)</span></code></pre>
</li>
<li>
<p>Take the first faceted plot in this section:</p>
<pre class="language-python"><code class="language-python">ggplot<span class="token punctuation">(</span>data<span class="token operator">=</span>mpg<span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_point<span class="token punctuation">(</span>mapping<span class="token operator">=</span>aes<span class="token punctuation">(</span>x<span class="token operator">=</span><span class="token string">"displ"</span><span class="token punctuation">,</span> y<span class="token operator">=</span><span class="token string">"hwy"</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />facet_wrap<span class="token punctuation">(</span><span class="token string">"class"</span><span class="token punctuation">,</span> nrow<span class="token operator">=</span><span class="token number">2</span><span class="token punctuation">)</span></code></pre>
<p>What are the advantages to using faceting instead of the colour
aesthetic? What are the disadvantages? How might the balance change
if you had a larger dataset?</p>
</li>
<li>
<p>Read <code>?facet_wrap</code>. What does <code>nrow</code> do? What does <code>ncol</code> do? What
other options control the layout of the individual panels? Why
doesn’t <code>facet_grid()</code> have <code>nrow</code> and <code>ncol</code> arguments?</p>
</li>
<li>
<p>When using <code>facet_grid()</code> you should usually put the variable with
more unique levels in the columns. Why?</p>
</li>
</ol>
<h2><a href="https://r4ds.had.co.nz/data-visualisation.html#geometric-objects">3.6</a> Geometric objects</h2>
<p>How are these two plots similar?</p>
<div class="flex flex-wrap md:flex-row md:flex-no-wrap mb-4">
<div class="md:w-1/2 ">
<figure>
<a href="https://jeroenjanssens.com/img/plotnine-grammar-of-graphics-for-python/plotnine-unnamed-chunk-18-1.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/EauI05WqZD-302.webp 302w, https://jeroenjanssens.com/img/EauI05WqZD-453.webp 453w, https://jeroenjanssens.com/img/EauI05WqZD-604.webp 604w, https://jeroenjanssens.com/img/EauI05WqZD-907.webp 907w, https://jeroenjanssens.com/img/EauI05WqZD-1209.webp 1209w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/EauI05WqZD-302.webp 302w, https://jeroenjanssens.com/img/EauI05WqZD-453.webp 453w, https://jeroenjanssens.com/img/EauI05WqZD-604.webp 604w, https://jeroenjanssens.com/img/EauI05WqZD-907.webp 907w, https://jeroenjanssens.com/img/EauI05WqZD-1209.webp 1209w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/EauI05WqZD-302.jpeg 302w, https://jeroenjanssens.com/img/EauI05WqZD-453.jpeg 453w, https://jeroenjanssens.com/img/EauI05WqZD-604.jpeg 604w, https://jeroenjanssens.com/img/EauI05WqZD-907.jpeg 907w, https://jeroenjanssens.com/img/EauI05WqZD-1209.jpeg 1209w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/EauI05WqZD-302.jpeg 302w, https://jeroenjanssens.com/img/EauI05WqZD-453.jpeg 453w, https://jeroenjanssens.com/img/EauI05WqZD-604.jpeg 604w, https://jeroenjanssens.com/img/EauI05WqZD-907.jpeg 907w, https://jeroenjanssens.com/img/EauI05WqZD-1209.jpeg 1209w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/EauI05WqZD-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
</div>
<div class="md:w-1/2 ">
<figure>
<a href="https://jeroenjanssens.com/img/plotnine-grammar-of-graphics-for-python/plotnine-unnamed-chunk-19-1.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/fLRgiHBB0l-302.webp 302w, https://jeroenjanssens.com/img/fLRgiHBB0l-453.webp 453w, https://jeroenjanssens.com/img/fLRgiHBB0l-604.webp 604w, https://jeroenjanssens.com/img/fLRgiHBB0l-907.webp 907w, https://jeroenjanssens.com/img/fLRgiHBB0l-1209.webp 1209w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/fLRgiHBB0l-302.webp 302w, https://jeroenjanssens.com/img/fLRgiHBB0l-453.webp 453w, https://jeroenjanssens.com/img/fLRgiHBB0l-604.webp 604w, https://jeroenjanssens.com/img/fLRgiHBB0l-907.webp 907w, https://jeroenjanssens.com/img/fLRgiHBB0l-1209.webp 1209w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/fLRgiHBB0l-302.jpeg 302w, https://jeroenjanssens.com/img/fLRgiHBB0l-453.jpeg 453w, https://jeroenjanssens.com/img/fLRgiHBB0l-604.jpeg 604w, https://jeroenjanssens.com/img/fLRgiHBB0l-907.jpeg 907w, https://jeroenjanssens.com/img/fLRgiHBB0l-1209.jpeg 1209w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/fLRgiHBB0l-302.jpeg 302w, https://jeroenjanssens.com/img/fLRgiHBB0l-453.jpeg 453w, https://jeroenjanssens.com/img/fLRgiHBB0l-604.jpeg 604w, https://jeroenjanssens.com/img/fLRgiHBB0l-907.jpeg 907w, https://jeroenjanssens.com/img/fLRgiHBB0l-1209.jpeg 1209w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/fLRgiHBB0l-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
</div>
</div>
<p>Both plots contain the same x variable, the same y variable, and both
describe the same data. But the plots are not identical. Each plot uses
a different visual object to represent the data. In plotnine syntax, we
say that they use different <strong>geoms</strong>.</p>
<p>A <strong>geom</strong> is the geometrical object that a plot uses to represent data.
People often describe plots by the type of geom that the plot uses. For
example, bar charts use bar geoms, line charts use line geoms, boxplots
use boxplot geoms, and so on. Scatterplots break the trend; they use the
point geom. As we see above, you can use different geoms to plot the
same data. The plot on the left uses the point geom, and the plot on the
right uses the smooth geom, a smooth line fitted to the data.</p>
<p>To change the geom in your plot, change the geom function that you add
to <code>ggplot()</code>. For instance, to make the plots above, you can use this
code:</p>
<pre class="language-python"><code class="language-python"><span class="token comment"># Left</span><br />ggplot<span class="token punctuation">(</span>data<span class="token operator">=</span>mpg<span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_point<span class="token punctuation">(</span>mapping<span class="token operator">=</span>aes<span class="token punctuation">(</span>x<span class="token operator">=</span><span class="token string">"displ"</span><span class="token punctuation">,</span> y<span class="token operator">=</span><span class="token string">"hwy"</span><span class="token punctuation">)</span><span class="token punctuation">)</span><br /><br /><span class="token comment"># Right</span><br />ggplot<span class="token punctuation">(</span>data<span class="token operator">=</span>mpg<span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_smooth<span class="token punctuation">(</span>mapping<span class="token operator">=</span>aes<span class="token punctuation">(</span>x<span class="token operator">=</span><span class="token string">"displ"</span><span class="token punctuation">,</span> y<span class="token operator">=</span><span class="token string">"hwy"</span><span class="token punctuation">)</span><span class="token punctuation">)</span></code></pre>
<p>Every geom function in plotnine takes a <code>mapping</code> argument. However, not
every aesthetic works with every geom. You could set the shape of a
point, but you couldn’t set the “shape” of a line. On the other hand,
you <em>could</em> set the linetype of a line. <code>geom_smooth()</code> will draw a
different line, with a different linetype, for each unique value of the
variable that you map to linetype.</p>
<pre class="language-python"><code class="language-python">ggplot<span class="token punctuation">(</span>data<span class="token operator">=</span>mpg<span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_smooth<span class="token punctuation">(</span>mapping<span class="token operator">=</span>aes<span class="token punctuation">(</span>x<span class="token operator">=</span><span class="token string">"displ"</span><span class="token punctuation">,</span> y<span class="token operator">=</span><span class="token string">"hwy"</span><span class="token punctuation">,</span> linetype<span class="token operator">=</span><span class="token string">"drv"</span><span class="token punctuation">)</span><span class="token punctuation">)</span></code></pre>
<figure>
<a href="https://jeroenjanssens.com/img/plotnine-grammar-of-graphics-for-python/plotnine-unnamed-chunk-21-1.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/oT70T3xe2R-302.webp 302w, https://jeroenjanssens.com/img/oT70T3xe2R-453.webp 453w, https://jeroenjanssens.com/img/oT70T3xe2R-604.webp 604w, https://jeroenjanssens.com/img/oT70T3xe2R-907.webp 907w, https://jeroenjanssens.com/img/oT70T3xe2R-1209.webp 1209w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/oT70T3xe2R-302.webp 302w, https://jeroenjanssens.com/img/oT70T3xe2R-453.webp 453w, https://jeroenjanssens.com/img/oT70T3xe2R-604.webp 604w, https://jeroenjanssens.com/img/oT70T3xe2R-907.webp 907w, https://jeroenjanssens.com/img/oT70T3xe2R-1209.webp 1209w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/oT70T3xe2R-302.jpeg 302w, https://jeroenjanssens.com/img/oT70T3xe2R-453.jpeg 453w, https://jeroenjanssens.com/img/oT70T3xe2R-604.jpeg 604w, https://jeroenjanssens.com/img/oT70T3xe2R-907.jpeg 907w, https://jeroenjanssens.com/img/oT70T3xe2R-1209.jpeg 1209w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/oT70T3xe2R-302.jpeg 302w, https://jeroenjanssens.com/img/oT70T3xe2R-453.jpeg 453w, https://jeroenjanssens.com/img/oT70T3xe2R-604.jpeg 604w, https://jeroenjanssens.com/img/oT70T3xe2R-907.jpeg 907w, https://jeroenjanssens.com/img/oT70T3xe2R-1209.jpeg 1209w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/oT70T3xe2R-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
<p>Here <code>geom_smooth()</code> separates the cars into three lines based on their
<code>drv</code> value, which describes a car’s drivetrain. One line describes all
of the points with a <code>4</code> value, one line describes all of the points
with an <code>f</code> value, and one line describes all of the points with an <code>r</code>
value. Here, <code>4</code> stands for four-wheel drive, <code>f</code> for front-wheel drive,
and <code>r</code> for rear-wheel drive.</p>
<p>If this sounds strange, we can make it more clear by overlaying the
lines on top of the raw data and then coloring everything according to
<code>drv</code>.</p>
<figure>
<a href="https://jeroenjanssens.com/img/plotnine-grammar-of-graphics-for-python/plotnine-unnamed-chunk-22-1.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/aUG32nqEWP-302.webp 302w, https://jeroenjanssens.com/img/aUG32nqEWP-453.webp 453w, https://jeroenjanssens.com/img/aUG32nqEWP-604.webp 604w, https://jeroenjanssens.com/img/aUG32nqEWP-907.webp 907w, https://jeroenjanssens.com/img/aUG32nqEWP-1209.webp 1209w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/aUG32nqEWP-302.webp 302w, https://jeroenjanssens.com/img/aUG32nqEWP-453.webp 453w, https://jeroenjanssens.com/img/aUG32nqEWP-604.webp 604w, https://jeroenjanssens.com/img/aUG32nqEWP-907.webp 907w, https://jeroenjanssens.com/img/aUG32nqEWP-1209.webp 1209w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/aUG32nqEWP-302.jpeg 302w, https://jeroenjanssens.com/img/aUG32nqEWP-453.jpeg 453w, https://jeroenjanssens.com/img/aUG32nqEWP-604.jpeg 604w, https://jeroenjanssens.com/img/aUG32nqEWP-907.jpeg 907w, https://jeroenjanssens.com/img/aUG32nqEWP-1209.jpeg 1209w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/aUG32nqEWP-302.jpeg 302w, https://jeroenjanssens.com/img/aUG32nqEWP-453.jpeg 453w, https://jeroenjanssens.com/img/aUG32nqEWP-604.jpeg 604w, https://jeroenjanssens.com/img/aUG32nqEWP-907.jpeg 907w, https://jeroenjanssens.com/img/aUG32nqEWP-1209.jpeg 1209w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/aUG32nqEWP-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
<p>Notice that this plot contains two geoms in the same graph! If this
makes you excited, buckle up. We will learn how to place multiple geoms
in the same plot very soon.</p>
<p>plotnine provides over 30 geoms. The best way to get a comprehensive
overview is the ggplot2 cheatsheet, which you can find at
<a href="http://rstudio.com/cheatsheets">http://rstudio.com/cheatsheets</a>. To learn more about any single geom,
use help: <code>?geom_smooth</code>.</p>
<p>Many geoms, like <code>geom_smooth()</code>, use a single geometric object to
display multiple rows of data. For these geoms, you can set the <code>group</code>
aesthetic to a categorical variable to draw multiple objects. plotnine
will draw a separate object for each unique value of the grouping
variable. In practice, plotnine will automatically group the data for
these geoms whenever you map an aesthetic to a discrete variable (as in
the <code>linetype</code> example). It is convenient to rely on this feature
because the group aesthetic by itself does not add a legend or
distinguishing features to the geoms.</p>
<pre class="language-python"><code class="language-python">ggplot<span class="token punctuation">(</span>data<span class="token operator">=</span>mpg<span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_smooth<span class="token punctuation">(</span>mapping<span class="token operator">=</span>aes<span class="token punctuation">(</span>x<span class="token operator">=</span><span class="token string">"displ"</span><span class="token punctuation">,</span> y<span class="token operator">=</span><span class="token string">"hwy"</span><span class="token punctuation">)</span><span class="token punctuation">)</span><br /><br />ggplot<span class="token punctuation">(</span>data<span class="token operator">=</span>mpg<span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_smooth<span class="token punctuation">(</span>mapping<span class="token operator">=</span>aes<span class="token punctuation">(</span>x<span class="token operator">=</span><span class="token string">"displ"</span><span class="token punctuation">,</span> y<span class="token operator">=</span><span class="token string">"hwy"</span><span class="token punctuation">,</span> group<span class="token operator">=</span><span class="token string">"drv"</span><span class="token punctuation">)</span><span class="token punctuation">)</span><br /><br />ggplot<span class="token punctuation">(</span>data<span class="token operator">=</span>mpg<span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_smooth<span class="token punctuation">(</span>mapping<span class="token operator">=</span>aes<span class="token punctuation">(</span>x<span class="token operator">=</span><span class="token string">"displ"</span><span class="token punctuation">,</span> y<span class="token operator">=</span><span class="token string">"hwy"</span><span class="token punctuation">,</span> color<span class="token operator">=</span><span class="token string">"drv"</span><span class="token punctuation">)</span><span class="token punctuation">,</span> show_legend<span class="token operator">=</span><span class="token boolean">False</span><span class="token punctuation">)</span></code></pre>
<div class="flex flex-wrap md:flex-row md:flex-no-wrap mb-4">
<div class="mx-auto w-3/4 md:w-1/3 ">
<figure>
<a href="https://jeroenjanssens.com/img/plotnine-grammar-of-graphics-for-python/plotnine-unnamed-chunk-24-1.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/fLRgiHBB0l-302.webp 302w, https://jeroenjanssens.com/img/fLRgiHBB0l-453.webp 453w, https://jeroenjanssens.com/img/fLRgiHBB0l-604.webp 604w, https://jeroenjanssens.com/img/fLRgiHBB0l-907.webp 907w, https://jeroenjanssens.com/img/fLRgiHBB0l-1209.webp 1209w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/fLRgiHBB0l-302.webp 302w, https://jeroenjanssens.com/img/fLRgiHBB0l-453.webp 453w, https://jeroenjanssens.com/img/fLRgiHBB0l-604.webp 604w, https://jeroenjanssens.com/img/fLRgiHBB0l-907.webp 907w, https://jeroenjanssens.com/img/fLRgiHBB0l-1209.webp 1209w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/fLRgiHBB0l-302.jpeg 302w, https://jeroenjanssens.com/img/fLRgiHBB0l-453.jpeg 453w, https://jeroenjanssens.com/img/fLRgiHBB0l-604.jpeg 604w, https://jeroenjanssens.com/img/fLRgiHBB0l-907.jpeg 907w, https://jeroenjanssens.com/img/fLRgiHBB0l-1209.jpeg 1209w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/fLRgiHBB0l-302.jpeg 302w, https://jeroenjanssens.com/img/fLRgiHBB0l-453.jpeg 453w, https://jeroenjanssens.com/img/fLRgiHBB0l-604.jpeg 604w, https://jeroenjanssens.com/img/fLRgiHBB0l-907.jpeg 907w, https://jeroenjanssens.com/img/fLRgiHBB0l-1209.jpeg 1209w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/fLRgiHBB0l-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
</div>
<div class="mx-auto w-3/4 md:w-1/3 ">
<figure>
<a href="https://jeroenjanssens.com/img/plotnine-grammar-of-graphics-for-python/plotnine-unnamed-chunk-25-1.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/CZS6rFs5bW-302.webp 302w, https://jeroenjanssens.com/img/CZS6rFs5bW-453.webp 453w, https://jeroenjanssens.com/img/CZS6rFs5bW-604.webp 604w, https://jeroenjanssens.com/img/CZS6rFs5bW-907.webp 907w, https://jeroenjanssens.com/img/CZS6rFs5bW-1209.webp 1209w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/CZS6rFs5bW-302.webp 302w, https://jeroenjanssens.com/img/CZS6rFs5bW-453.webp 453w, https://jeroenjanssens.com/img/CZS6rFs5bW-604.webp 604w, https://jeroenjanssens.com/img/CZS6rFs5bW-907.webp 907w, https://jeroenjanssens.com/img/CZS6rFs5bW-1209.webp 1209w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/CZS6rFs5bW-302.jpeg 302w, https://jeroenjanssens.com/img/CZS6rFs5bW-453.jpeg 453w, https://jeroenjanssens.com/img/CZS6rFs5bW-604.jpeg 604w, https://jeroenjanssens.com/img/CZS6rFs5bW-907.jpeg 907w, https://jeroenjanssens.com/img/CZS6rFs5bW-1209.jpeg 1209w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/CZS6rFs5bW-302.jpeg 302w, https://jeroenjanssens.com/img/CZS6rFs5bW-453.jpeg 453w, https://jeroenjanssens.com/img/CZS6rFs5bW-604.jpeg 604w, https://jeroenjanssens.com/img/CZS6rFs5bW-907.jpeg 907w, https://jeroenjanssens.com/img/CZS6rFs5bW-1209.jpeg 1209w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/CZS6rFs5bW-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
</div>
<div class="mx-auto w-3/4 md:w-1/3 ">
<figure>
<a href="https://jeroenjanssens.com/img/plotnine-grammar-of-graphics-for-python/plotnine-unnamed-chunk-26-1.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/PEpMT3vbBu-302.webp 302w, https://jeroenjanssens.com/img/PEpMT3vbBu-453.webp 453w, https://jeroenjanssens.com/img/PEpMT3vbBu-604.webp 604w, https://jeroenjanssens.com/img/PEpMT3vbBu-907.webp 907w, https://jeroenjanssens.com/img/PEpMT3vbBu-1209.webp 1209w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/PEpMT3vbBu-302.webp 302w, https://jeroenjanssens.com/img/PEpMT3vbBu-453.webp 453w, https://jeroenjanssens.com/img/PEpMT3vbBu-604.webp 604w, https://jeroenjanssens.com/img/PEpMT3vbBu-907.webp 907w, https://jeroenjanssens.com/img/PEpMT3vbBu-1209.webp 1209w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/PEpMT3vbBu-302.jpeg 302w, https://jeroenjanssens.com/img/PEpMT3vbBu-453.jpeg 453w, https://jeroenjanssens.com/img/PEpMT3vbBu-604.jpeg 604w, https://jeroenjanssens.com/img/PEpMT3vbBu-907.jpeg 907w, https://jeroenjanssens.com/img/PEpMT3vbBu-1209.jpeg 1209w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/PEpMT3vbBu-302.jpeg 302w, https://jeroenjanssens.com/img/PEpMT3vbBu-453.jpeg 453w, https://jeroenjanssens.com/img/PEpMT3vbBu-604.jpeg 604w, https://jeroenjanssens.com/img/PEpMT3vbBu-907.jpeg 907w, https://jeroenjanssens.com/img/PEpMT3vbBu-1209.jpeg 1209w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/PEpMT3vbBu-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
</div>
</div>
<p>To display multiple geoms in the same plot, add multiple geom functions
to <code>ggplot()</code>:</p>
<pre class="language-python"><code class="language-python">ggplot<span class="token punctuation">(</span>data<span class="token operator">=</span>mpg<span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_point<span class="token punctuation">(</span>mapping<span class="token operator">=</span>aes<span class="token punctuation">(</span>x<span class="token operator">=</span><span class="token string">"displ"</span><span class="token punctuation">,</span> y<span class="token operator">=</span><span class="token string">"hwy"</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_smooth<span class="token punctuation">(</span>mapping<span class="token operator">=</span>aes<span class="token punctuation">(</span>x<span class="token operator">=</span><span class="token string">"displ"</span><span class="token punctuation">,</span> y<span class="token operator">=</span><span class="token string">"hwy"</span><span class="token punctuation">)</span><span class="token punctuation">)</span></code></pre>
<figure>
<a href="https://jeroenjanssens.com/img/plotnine-grammar-of-graphics-for-python/plotnine-unnamed-chunk-27-1.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/pZtw2LW8KD-302.webp 302w, https://jeroenjanssens.com/img/pZtw2LW8KD-453.webp 453w, https://jeroenjanssens.com/img/pZtw2LW8KD-604.webp 604w, https://jeroenjanssens.com/img/pZtw2LW8KD-907.webp 907w, https://jeroenjanssens.com/img/pZtw2LW8KD-1209.webp 1209w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/pZtw2LW8KD-302.webp 302w, https://jeroenjanssens.com/img/pZtw2LW8KD-453.webp 453w, https://jeroenjanssens.com/img/pZtw2LW8KD-604.webp 604w, https://jeroenjanssens.com/img/pZtw2LW8KD-907.webp 907w, https://jeroenjanssens.com/img/pZtw2LW8KD-1209.webp 1209w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/pZtw2LW8KD-302.jpeg 302w, https://jeroenjanssens.com/img/pZtw2LW8KD-453.jpeg 453w, https://jeroenjanssens.com/img/pZtw2LW8KD-604.jpeg 604w, https://jeroenjanssens.com/img/pZtw2LW8KD-907.jpeg 907w, https://jeroenjanssens.com/img/pZtw2LW8KD-1209.jpeg 1209w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/pZtw2LW8KD-302.jpeg 302w, https://jeroenjanssens.com/img/pZtw2LW8KD-453.jpeg 453w, https://jeroenjanssens.com/img/pZtw2LW8KD-604.jpeg 604w, https://jeroenjanssens.com/img/pZtw2LW8KD-907.jpeg 907w, https://jeroenjanssens.com/img/pZtw2LW8KD-1209.jpeg 1209w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/pZtw2LW8KD-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
<p>This, however, introduces some duplication in our code. Imagine if you
wanted to change the y-axis to display <code>cty</code> instead of <code>hwy</code>. You’d
need to change the variable in two places, and you might forget to
update one. You can avoid this type of repetition by passing a set of
mappings to <code>ggplot()</code>. plotnine will treat these mappings as global
mappings that apply to each geom in the graph. In other words, this code
will produce the same plot as the previous code:</p>
<pre class="language-python"><code class="language-python">ggplot<span class="token punctuation">(</span>data<span class="token operator">=</span>mpg<span class="token punctuation">,</span> mapping<span class="token operator">=</span>aes<span class="token punctuation">(</span>x<span class="token operator">=</span><span class="token string">"displ"</span><span class="token punctuation">,</span> y<span class="token operator">=</span><span class="token string">"hwy"</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_point<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_smooth<span class="token punctuation">(</span><span class="token punctuation">)</span></code></pre>
<p>If you place mappings in a geom function, plotnine will treat them as
local mappings for the layer. It will use these mappings to extend or
overwrite the global mappings <em>for that layer only</em>. This makes it
possible to display different aesthetics in different layers.</p>
<pre class="language-python"><code class="language-python">ggplot<span class="token punctuation">(</span>data<span class="token operator">=</span>mpg<span class="token punctuation">,</span> mapping<span class="token operator">=</span>aes<span class="token punctuation">(</span>x<span class="token operator">=</span><span class="token string">"displ"</span><span class="token punctuation">,</span> y<span class="token operator">=</span><span class="token string">"hwy"</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_point<span class="token punctuation">(</span>mapping<span class="token operator">=</span>aes<span class="token punctuation">(</span>color<span class="token operator">=</span><span class="token string">"class"</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_smooth<span class="token punctuation">(</span><span class="token punctuation">)</span></code></pre>
<figure>
<a href="https://jeroenjanssens.com/img/plotnine-grammar-of-graphics-for-python/plotnine-unnamed-chunk-29-1.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/dJ_IDDeumn-302.webp 302w, https://jeroenjanssens.com/img/dJ_IDDeumn-453.webp 453w, https://jeroenjanssens.com/img/dJ_IDDeumn-604.webp 604w, https://jeroenjanssens.com/img/dJ_IDDeumn-907.webp 907w, https://jeroenjanssens.com/img/dJ_IDDeumn-1209.webp 1209w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/dJ_IDDeumn-302.webp 302w, https://jeroenjanssens.com/img/dJ_IDDeumn-453.webp 453w, https://jeroenjanssens.com/img/dJ_IDDeumn-604.webp 604w, https://jeroenjanssens.com/img/dJ_IDDeumn-907.webp 907w, https://jeroenjanssens.com/img/dJ_IDDeumn-1209.webp 1209w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/dJ_IDDeumn-302.jpeg 302w, https://jeroenjanssens.com/img/dJ_IDDeumn-453.jpeg 453w, https://jeroenjanssens.com/img/dJ_IDDeumn-604.jpeg 604w, https://jeroenjanssens.com/img/dJ_IDDeumn-907.jpeg 907w, https://jeroenjanssens.com/img/dJ_IDDeumn-1209.jpeg 1209w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/dJ_IDDeumn-302.jpeg 302w, https://jeroenjanssens.com/img/dJ_IDDeumn-453.jpeg 453w, https://jeroenjanssens.com/img/dJ_IDDeumn-604.jpeg 604w, https://jeroenjanssens.com/img/dJ_IDDeumn-907.jpeg 907w, https://jeroenjanssens.com/img/dJ_IDDeumn-1209.jpeg 1209w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/dJ_IDDeumn-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
<p>You can use the same idea to specify different <code>data</code> for each layer.
Here, our smooth line displays just a subset of the <code>mpg</code> dataset, the
subcompact cars. The local data argument in <code>geom_smooth()</code> overrides
the global data argument in <code>ggplot()</code> for that layer only.</p>
<pre class="language-python"><code class="language-python">ggplot<span class="token punctuation">(</span>data<span class="token operator">=</span>mpg<span class="token punctuation">,</span> mapping<span class="token operator">=</span>aes<span class="token punctuation">(</span>x<span class="token operator">=</span><span class="token string">"displ"</span><span class="token punctuation">,</span> y<span class="token operator">=</span><span class="token string">"hwy"</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_point<span class="token punctuation">(</span>mapping<span class="token operator">=</span>aes<span class="token punctuation">(</span>color<span class="token operator">=</span><span class="token string">"class"</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_smooth<span class="token punctuation">(</span>data<span class="token operator">=</span>mpg<span class="token punctuation">.</span>loc<span class="token punctuation">[</span>mpg<span class="token punctuation">[</span><span class="token string">"class"</span><span class="token punctuation">]</span> <span class="token operator">==</span> <span class="token string">"subcompact"</span><span class="token punctuation">]</span><span class="token punctuation">,</span> se<span class="token operator">=</span><span class="token boolean">False</span><span class="token punctuation">)</span></code></pre>
<figure>
<a href="https://jeroenjanssens.com/img/plotnine-grammar-of-graphics-for-python/plotnine-unnamed-chunk-30-1.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/Z47jvQvYjj-302.webp 302w, https://jeroenjanssens.com/img/Z47jvQvYjj-453.webp 453w, https://jeroenjanssens.com/img/Z47jvQvYjj-604.webp 604w, https://jeroenjanssens.com/img/Z47jvQvYjj-907.webp 907w, https://jeroenjanssens.com/img/Z47jvQvYjj-1209.webp 1209w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/Z47jvQvYjj-302.webp 302w, https://jeroenjanssens.com/img/Z47jvQvYjj-453.webp 453w, https://jeroenjanssens.com/img/Z47jvQvYjj-604.webp 604w, https://jeroenjanssens.com/img/Z47jvQvYjj-907.webp 907w, https://jeroenjanssens.com/img/Z47jvQvYjj-1209.webp 1209w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/Z47jvQvYjj-302.jpeg 302w, https://jeroenjanssens.com/img/Z47jvQvYjj-453.jpeg 453w, https://jeroenjanssens.com/img/Z47jvQvYjj-604.jpeg 604w, https://jeroenjanssens.com/img/Z47jvQvYjj-907.jpeg 907w, https://jeroenjanssens.com/img/Z47jvQvYjj-1209.jpeg 1209w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/Z47jvQvYjj-302.jpeg 302w, https://jeroenjanssens.com/img/Z47jvQvYjj-453.jpeg 453w, https://jeroenjanssens.com/img/Z47jvQvYjj-604.jpeg 604w, https://jeroenjanssens.com/img/Z47jvQvYjj-907.jpeg 907w, https://jeroenjanssens.com/img/Z47jvQvYjj-1209.jpeg 1209w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/Z47jvQvYjj-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
<h3><a href="https://r4ds.had.co.nz/data-visualisation.html#exercises-3">3.6.1</a> Exercises</h3>
<ol>
<li>
<p>What geom would you use to draw a line chart? A boxplot? A
histogram? An area chart?</p>
</li>
<li>
<p>Run this code in your head and predict what the output will look
like. Then, run the code in Python and check your predictions.</p>
<pre class="language-python"><code class="language-python">ggplot<span class="token punctuation">(</span>data<span class="token operator">=</span>mpg<span class="token punctuation">,</span> mapping<span class="token operator">=</span>aes<span class="token punctuation">(</span>x<span class="token operator">=</span><span class="token string">"displ"</span><span class="token punctuation">,</span> y<span class="token operator">=</span><span class="token string">"hwy"</span><span class="token punctuation">,</span> color<span class="token operator">=</span><span class="token string">"drv"</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_point<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_smooth<span class="token punctuation">(</span>se<span class="token operator">=</span><span class="token boolean">False</span><span class="token punctuation">)</span></code></pre>
</li>
<li>
<p>What does <code>show_legend=False</code> do? What happens if you remove it? Why
do you think I used it earlier in the chapter?</p>
</li>
<li>
<p>What does the <code>se</code> argument to <code>geom_smooth()</code> do?</p>
</li>
<li>
<p>Will these two graphs look different? Why/why not?</p>
<pre class="language-python"><code class="language-python">ggplot<span class="token punctuation">(</span>data<span class="token operator">=</span>mpg<span class="token punctuation">,</span> mapping<span class="token operator">=</span>aes<span class="token punctuation">(</span>x<span class="token operator">=</span><span class="token string">"displ"</span><span class="token punctuation">,</span> y<span class="token operator">=</span><span class="token string">"hwy"</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_point<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_smooth<span class="token punctuation">(</span><span class="token punctuation">)</span><br /><br />ggplot<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_point<span class="token punctuation">(</span>data<span class="token operator">=</span>mpg<span class="token punctuation">,</span> mapping<span class="token operator">=</span>aes<span class="token punctuation">(</span>x<span class="token operator">=</span><span class="token string">"displ"</span><span class="token punctuation">,</span> y<span class="token operator">=</span><span class="token string">"hwy"</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_smooth<span class="token punctuation">(</span>data<span class="token operator">=</span>mpg<span class="token punctuation">,</span> mapping<span class="token operator">=</span>aes<span class="token punctuation">(</span>x<span class="token operator">=</span><span class="token string">"displ"</span><span class="token punctuation">,</span> y<span class="token operator">=</span><span class="token string">"hwy"</span><span class="token punctuation">)</span><span class="token punctuation">)</span></code></pre>
</li>
<li>
<p>Recreate the Python code necessary to generate the following graphs.</p>
</li>
</ol>
<div class="flex flex-wrap md:flex-row mb-4 ml-8">
<div class="mx-auto w-5/6 md:w-1/2 ">
<figure>
<a href="https://jeroenjanssens.com/img/plotnine-grammar-of-graphics-for-python/plotnine-unnamed-chunk-33-1.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/2mAgA3pEg7-302.webp 302w, https://jeroenjanssens.com/img/2mAgA3pEg7-453.webp 453w, https://jeroenjanssens.com/img/2mAgA3pEg7-604.webp 604w, https://jeroenjanssens.com/img/2mAgA3pEg7-907.webp 907w, https://jeroenjanssens.com/img/2mAgA3pEg7-1209.webp 1209w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/2mAgA3pEg7-302.webp 302w, https://jeroenjanssens.com/img/2mAgA3pEg7-453.webp 453w, https://jeroenjanssens.com/img/2mAgA3pEg7-604.webp 604w, https://jeroenjanssens.com/img/2mAgA3pEg7-907.webp 907w, https://jeroenjanssens.com/img/2mAgA3pEg7-1209.webp 1209w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/2mAgA3pEg7-302.jpeg 302w, https://jeroenjanssens.com/img/2mAgA3pEg7-453.jpeg 453w, https://jeroenjanssens.com/img/2mAgA3pEg7-604.jpeg 604w, https://jeroenjanssens.com/img/2mAgA3pEg7-907.jpeg 907w, https://jeroenjanssens.com/img/2mAgA3pEg7-1209.jpeg 1209w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/2mAgA3pEg7-302.jpeg 302w, https://jeroenjanssens.com/img/2mAgA3pEg7-453.jpeg 453w, https://jeroenjanssens.com/img/2mAgA3pEg7-604.jpeg 604w, https://jeroenjanssens.com/img/2mAgA3pEg7-907.jpeg 907w, https://jeroenjanssens.com/img/2mAgA3pEg7-1209.jpeg 1209w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/2mAgA3pEg7-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
</div>
<div class="mx-auto w-5/6 md:w-1/2 ">
<figure>
<a href="https://jeroenjanssens.com/img/plotnine-grammar-of-graphics-for-python/plotnine-unnamed-chunk-34-1.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/hGEP2l90IO-302.webp 302w, https://jeroenjanssens.com/img/hGEP2l90IO-453.webp 453w, https://jeroenjanssens.com/img/hGEP2l90IO-604.webp 604w, https://jeroenjanssens.com/img/hGEP2l90IO-907.webp 907w, https://jeroenjanssens.com/img/hGEP2l90IO-1209.webp 1209w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/hGEP2l90IO-302.webp 302w, https://jeroenjanssens.com/img/hGEP2l90IO-453.webp 453w, https://jeroenjanssens.com/img/hGEP2l90IO-604.webp 604w, https://jeroenjanssens.com/img/hGEP2l90IO-907.webp 907w, https://jeroenjanssens.com/img/hGEP2l90IO-1209.webp 1209w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/hGEP2l90IO-302.jpeg 302w, https://jeroenjanssens.com/img/hGEP2l90IO-453.jpeg 453w, https://jeroenjanssens.com/img/hGEP2l90IO-604.jpeg 604w, https://jeroenjanssens.com/img/hGEP2l90IO-907.jpeg 907w, https://jeroenjanssens.com/img/hGEP2l90IO-1209.jpeg 1209w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/hGEP2l90IO-302.jpeg 302w, https://jeroenjanssens.com/img/hGEP2l90IO-453.jpeg 453w, https://jeroenjanssens.com/img/hGEP2l90IO-604.jpeg 604w, https://jeroenjanssens.com/img/hGEP2l90IO-907.jpeg 907w, https://jeroenjanssens.com/img/hGEP2l90IO-1209.jpeg 1209w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/hGEP2l90IO-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
</div>
<div class="mx-auto w-5/6 md:w-1/2 ">
<figure>
<a href="https://jeroenjanssens.com/img/plotnine-grammar-of-graphics-for-python/plotnine-unnamed-chunk-35-1.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/peEdl8D6FI-302.webp 302w, https://jeroenjanssens.com/img/peEdl8D6FI-453.webp 453w, https://jeroenjanssens.com/img/peEdl8D6FI-604.webp 604w, https://jeroenjanssens.com/img/peEdl8D6FI-907.webp 907w, https://jeroenjanssens.com/img/peEdl8D6FI-1209.webp 1209w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/peEdl8D6FI-302.webp 302w, https://jeroenjanssens.com/img/peEdl8D6FI-453.webp 453w, https://jeroenjanssens.com/img/peEdl8D6FI-604.webp 604w, https://jeroenjanssens.com/img/peEdl8D6FI-907.webp 907w, https://jeroenjanssens.com/img/peEdl8D6FI-1209.webp 1209w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/peEdl8D6FI-302.jpeg 302w, https://jeroenjanssens.com/img/peEdl8D6FI-453.jpeg 453w, https://jeroenjanssens.com/img/peEdl8D6FI-604.jpeg 604w, https://jeroenjanssens.com/img/peEdl8D6FI-907.jpeg 907w, https://jeroenjanssens.com/img/peEdl8D6FI-1209.jpeg 1209w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/peEdl8D6FI-302.jpeg 302w, https://jeroenjanssens.com/img/peEdl8D6FI-453.jpeg 453w, https://jeroenjanssens.com/img/peEdl8D6FI-604.jpeg 604w, https://jeroenjanssens.com/img/peEdl8D6FI-907.jpeg 907w, https://jeroenjanssens.com/img/peEdl8D6FI-1209.jpeg 1209w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/peEdl8D6FI-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
</div>
<div class="mx-auto w-5/6 md:w-1/2 ">
<figure>
<a href="https://jeroenjanssens.com/img/plotnine-grammar-of-graphics-for-python/plotnine-unnamed-chunk-36-1.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/xsxZ1-zBOO-302.webp 302w, https://jeroenjanssens.com/img/xsxZ1-zBOO-453.webp 453w, https://jeroenjanssens.com/img/xsxZ1-zBOO-604.webp 604w, https://jeroenjanssens.com/img/xsxZ1-zBOO-907.webp 907w, https://jeroenjanssens.com/img/xsxZ1-zBOO-1209.webp 1209w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/xsxZ1-zBOO-302.webp 302w, https://jeroenjanssens.com/img/xsxZ1-zBOO-453.webp 453w, https://jeroenjanssens.com/img/xsxZ1-zBOO-604.webp 604w, https://jeroenjanssens.com/img/xsxZ1-zBOO-907.webp 907w, https://jeroenjanssens.com/img/xsxZ1-zBOO-1209.webp 1209w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/xsxZ1-zBOO-302.jpeg 302w, https://jeroenjanssens.com/img/xsxZ1-zBOO-453.jpeg 453w, https://jeroenjanssens.com/img/xsxZ1-zBOO-604.jpeg 604w, https://jeroenjanssens.com/img/xsxZ1-zBOO-907.jpeg 907w, https://jeroenjanssens.com/img/xsxZ1-zBOO-1209.jpeg 1209w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/xsxZ1-zBOO-302.jpeg 302w, https://jeroenjanssens.com/img/xsxZ1-zBOO-453.jpeg 453w, https://jeroenjanssens.com/img/xsxZ1-zBOO-604.jpeg 604w, https://jeroenjanssens.com/img/xsxZ1-zBOO-907.jpeg 907w, https://jeroenjanssens.com/img/xsxZ1-zBOO-1209.jpeg 1209w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/xsxZ1-zBOO-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
</div>
<div class="mx-auto w-5/6 md:w-1/2 ">
<figure>
<a href="https://jeroenjanssens.com/img/plotnine-grammar-of-graphics-for-python/plotnine-unnamed-chunk-37-1.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/ICQ8JLLI6V-302.webp 302w, https://jeroenjanssens.com/img/ICQ8JLLI6V-453.webp 453w, https://jeroenjanssens.com/img/ICQ8JLLI6V-604.webp 604w, https://jeroenjanssens.com/img/ICQ8JLLI6V-907.webp 907w, https://jeroenjanssens.com/img/ICQ8JLLI6V-1209.webp 1209w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/ICQ8JLLI6V-302.webp 302w, https://jeroenjanssens.com/img/ICQ8JLLI6V-453.webp 453w, https://jeroenjanssens.com/img/ICQ8JLLI6V-604.webp 604w, https://jeroenjanssens.com/img/ICQ8JLLI6V-907.webp 907w, https://jeroenjanssens.com/img/ICQ8JLLI6V-1209.webp 1209w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/ICQ8JLLI6V-302.jpeg 302w, https://jeroenjanssens.com/img/ICQ8JLLI6V-453.jpeg 453w, https://jeroenjanssens.com/img/ICQ8JLLI6V-604.jpeg 604w, https://jeroenjanssens.com/img/ICQ8JLLI6V-907.jpeg 907w, https://jeroenjanssens.com/img/ICQ8JLLI6V-1209.jpeg 1209w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/ICQ8JLLI6V-302.jpeg 302w, https://jeroenjanssens.com/img/ICQ8JLLI6V-453.jpeg 453w, https://jeroenjanssens.com/img/ICQ8JLLI6V-604.jpeg 604w, https://jeroenjanssens.com/img/ICQ8JLLI6V-907.jpeg 907w, https://jeroenjanssens.com/img/ICQ8JLLI6V-1209.jpeg 1209w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/ICQ8JLLI6V-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
</div>
<div class="mx-auto w-5/6 md:w-1/2 ">
<figure>
<a href="https://jeroenjanssens.com/img/plotnine-grammar-of-graphics-for-python/plotnine-unnamed-chunk-38-1.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/kGsQlmvtM5-302.webp 302w, https://jeroenjanssens.com/img/kGsQlmvtM5-453.webp 453w, https://jeroenjanssens.com/img/kGsQlmvtM5-604.webp 604w, https://jeroenjanssens.com/img/kGsQlmvtM5-907.webp 907w, https://jeroenjanssens.com/img/kGsQlmvtM5-1209.webp 1209w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/kGsQlmvtM5-302.webp 302w, https://jeroenjanssens.com/img/kGsQlmvtM5-453.webp 453w, https://jeroenjanssens.com/img/kGsQlmvtM5-604.webp 604w, https://jeroenjanssens.com/img/kGsQlmvtM5-907.webp 907w, https://jeroenjanssens.com/img/kGsQlmvtM5-1209.webp 1209w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/kGsQlmvtM5-302.jpeg 302w, https://jeroenjanssens.com/img/kGsQlmvtM5-453.jpeg 453w, https://jeroenjanssens.com/img/kGsQlmvtM5-604.jpeg 604w, https://jeroenjanssens.com/img/kGsQlmvtM5-907.jpeg 907w, https://jeroenjanssens.com/img/kGsQlmvtM5-1209.jpeg 1209w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/kGsQlmvtM5-302.jpeg 302w, https://jeroenjanssens.com/img/kGsQlmvtM5-453.jpeg 453w, https://jeroenjanssens.com/img/kGsQlmvtM5-604.jpeg 604w, https://jeroenjanssens.com/img/kGsQlmvtM5-907.jpeg 907w, https://jeroenjanssens.com/img/kGsQlmvtM5-1209.jpeg 1209w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/kGsQlmvtM5-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
</div>
</div>
<p>You can learn which stat a geom uses by inspecting the default value for
the <code>stat</code> argument. For example, <code>?geom_bar</code> shows that the default
value for <code>stat</code> is “count”, which means that <code>geom_bar()</code> uses
<code>stat_count()</code>. <code>stat_count()</code> is documented on the same page as
<code>geom_bar()</code>, and if you scroll down you can find a section called
“Computed variables”. That describes how it computes two new variables:
<code>count</code> and <code>prop</code>.</p>
<p>You can generally use geoms and stats interchangeably. For example, you
can recreate the previous plot using <code>stat_count()</code> instead of
<code>geom_bar()</code>:</p>
<pre class="language-python"><code class="language-python">ggplot<span class="token punctuation">(</span>data<span class="token operator">=</span>diamonds<span class="token punctuation">)</span> <span class="token operator">+</span>\<br />stat_count<span class="token punctuation">(</span>mapping<span class="token operator">=</span>aes<span class="token punctuation">(</span>x<span class="token operator">=</span><span class="token string">"cut"</span><span class="token punctuation">)</span><span class="token punctuation">)</span></code></pre>
<figure>
<a href="https://jeroenjanssens.com/img/plotnine-grammar-of-graphics-for-python/plotnine-unnamed-chunk-39-1.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/2Et3_wa-Nb-302.webp 302w, https://jeroenjanssens.com/img/2Et3_wa-Nb-453.webp 453w, https://jeroenjanssens.com/img/2Et3_wa-Nb-604.webp 604w, https://jeroenjanssens.com/img/2Et3_wa-Nb-907.webp 907w, https://jeroenjanssens.com/img/2Et3_wa-Nb-1209.webp 1209w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/2Et3_wa-Nb-302.webp 302w, https://jeroenjanssens.com/img/2Et3_wa-Nb-453.webp 453w, https://jeroenjanssens.com/img/2Et3_wa-Nb-604.webp 604w, https://jeroenjanssens.com/img/2Et3_wa-Nb-907.webp 907w, https://jeroenjanssens.com/img/2Et3_wa-Nb-1209.webp 1209w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/2Et3_wa-Nb-302.jpeg 302w, https://jeroenjanssens.com/img/2Et3_wa-Nb-453.jpeg 453w, https://jeroenjanssens.com/img/2Et3_wa-Nb-604.jpeg 604w, https://jeroenjanssens.com/img/2Et3_wa-Nb-907.jpeg 907w, https://jeroenjanssens.com/img/2Et3_wa-Nb-1209.jpeg 1209w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/2Et3_wa-Nb-302.jpeg 302w, https://jeroenjanssens.com/img/2Et3_wa-Nb-453.jpeg 453w, https://jeroenjanssens.com/img/2Et3_wa-Nb-604.jpeg 604w, https://jeroenjanssens.com/img/2Et3_wa-Nb-907.jpeg 907w, https://jeroenjanssens.com/img/2Et3_wa-Nb-1209.jpeg 1209w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/2Et3_wa-Nb-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
<p>This works because every geom has a default stat; and every stat has a
default geom. This means that you can typically use geoms without
worrying about the underlying statistical transformation. There are
three reasons you might need to use a stat explicitly:</p>
<ol>
<li>
<p>You might want to override the default stat. In the code below, I
change the stat of <code>geom_bar()</code> from count (the default) to
identity. This lets me map the height of the bars to the raw values
of a “y” variable. Unfortunately when people talk about bar charts
casually, they might be referring to this type of bar chart, where
the height of the bar is already present in the data, or the
previous bar chart where the height of the bar is generated by
counting rows.</p>
<pre class="language-python"><code class="language-python">demo <span class="token operator">=</span> pd<span class="token punctuation">.</span>DataFrame<span class="token punctuation">(</span><span class="token punctuation">{</span><span class="token string">"cut"</span><span class="token punctuation">:</span> <span class="token punctuation">[</span><span class="token string">"Fair"</span><span class="token punctuation">,</span> <span class="token string">"Good"</span><span class="token punctuation">,</span> <span class="token string">"Very Good"</span><span class="token punctuation">,</span> <span class="token string">"Premium"</span><span class="token punctuation">,</span> <span class="token string">"Ideal"</span><span class="token punctuation">]</span><span class="token punctuation">,</span><br /> <span class="token string">"freq"</span><span class="token punctuation">:</span> <span class="token punctuation">[</span><span class="token number">1610</span><span class="token punctuation">,</span> <span class="token number">4906</span><span class="token punctuation">,</span> <span class="token number">12082</span><span class="token punctuation">,</span> <span class="token number">13791</span><span class="token punctuation">,</span> <span class="token number">21551</span><span class="token punctuation">]</span><span class="token punctuation">}</span><span class="token punctuation">)</span><br /><br />ggplot<span class="token punctuation">(</span>data<span class="token operator">=</span>demo<span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_bar<span class="token punctuation">(</span>mapping<span class="token operator">=</span>aes<span class="token punctuation">(</span>x<span class="token operator">=</span><span class="token string">"cut"</span><span class="token punctuation">,</span> y<span class="token operator">=</span><span class="token string">"freq"</span><span class="token punctuation">)</span><span class="token punctuation">,</span> stat<span class="token operator">=</span><span class="token string">"identity"</span><span class="token punctuation">)</span></code></pre>
<figure>
<a href="https://jeroenjanssens.com/img/plotnine-grammar-of-graphics-for-python/plotnine-unnamed-chunk-40-1.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/q0i3XkflnL-302.webp 302w, https://jeroenjanssens.com/img/q0i3XkflnL-453.webp 453w, https://jeroenjanssens.com/img/q0i3XkflnL-604.webp 604w, https://jeroenjanssens.com/img/q0i3XkflnL-907.webp 907w, https://jeroenjanssens.com/img/q0i3XkflnL-1209.webp 1209w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/q0i3XkflnL-302.webp 302w, https://jeroenjanssens.com/img/q0i3XkflnL-453.webp 453w, https://jeroenjanssens.com/img/q0i3XkflnL-604.webp 604w, https://jeroenjanssens.com/img/q0i3XkflnL-907.webp 907w, https://jeroenjanssens.com/img/q0i3XkflnL-1209.webp 1209w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/q0i3XkflnL-302.jpeg 302w, https://jeroenjanssens.com/img/q0i3XkflnL-453.jpeg 453w, https://jeroenjanssens.com/img/q0i3XkflnL-604.jpeg 604w, https://jeroenjanssens.com/img/q0i3XkflnL-907.jpeg 907w, https://jeroenjanssens.com/img/q0i3XkflnL-1209.jpeg 1209w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/q0i3XkflnL-302.jpeg 302w, https://jeroenjanssens.com/img/q0i3XkflnL-453.jpeg 453w, https://jeroenjanssens.com/img/q0i3XkflnL-604.jpeg 604w, https://jeroenjanssens.com/img/q0i3XkflnL-907.jpeg 907w, https://jeroenjanssens.com/img/q0i3XkflnL-1209.jpeg 1209w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/q0i3XkflnL-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
</li>
<li>
<p>You might want to override the default mapping from transformed
variables to aesthetics. For example, you might want to display a
bar chart of proportion, rather than count:</p>
<pre class="language-python"><code class="language-python">ggplot<span class="token punctuation">(</span>data<span class="token operator">=</span>diamonds<span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_bar<span class="token punctuation">(</span>mapping<span class="token operator">=</span>aes<span class="token punctuation">(</span>x<span class="token operator">=</span><span class="token string">"cut"</span><span class="token punctuation">,</span> y<span class="token operator">=</span><span class="token string">"..prop.."</span><span class="token punctuation">,</span> group<span class="token operator">=</span><span class="token number">1</span><span class="token punctuation">)</span><span class="token punctuation">)</span></code></pre>
<figure>
<a href="https://jeroenjanssens.com/img/plotnine-grammar-of-graphics-for-python/plotnine-unnamed-chunk-41-1.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/s0QbIK_hj2-302.webp 302w, https://jeroenjanssens.com/img/s0QbIK_hj2-453.webp 453w, https://jeroenjanssens.com/img/s0QbIK_hj2-604.webp 604w, https://jeroenjanssens.com/img/s0QbIK_hj2-907.webp 907w, https://jeroenjanssens.com/img/s0QbIK_hj2-1209.webp 1209w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/s0QbIK_hj2-302.webp 302w, https://jeroenjanssens.com/img/s0QbIK_hj2-453.webp 453w, https://jeroenjanssens.com/img/s0QbIK_hj2-604.webp 604w, https://jeroenjanssens.com/img/s0QbIK_hj2-907.webp 907w, https://jeroenjanssens.com/img/s0QbIK_hj2-1209.webp 1209w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/s0QbIK_hj2-302.jpeg 302w, https://jeroenjanssens.com/img/s0QbIK_hj2-453.jpeg 453w, https://jeroenjanssens.com/img/s0QbIK_hj2-604.jpeg 604w, https://jeroenjanssens.com/img/s0QbIK_hj2-907.jpeg 907w, https://jeroenjanssens.com/img/s0QbIK_hj2-1209.jpeg 1209w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/s0QbIK_hj2-302.jpeg 302w, https://jeroenjanssens.com/img/s0QbIK_hj2-453.jpeg 453w, https://jeroenjanssens.com/img/s0QbIK_hj2-604.jpeg 604w, https://jeroenjanssens.com/img/s0QbIK_hj2-907.jpeg 907w, https://jeroenjanssens.com/img/s0QbIK_hj2-1209.jpeg 1209w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/s0QbIK_hj2-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
To find the variables computed by the stat, look for the help
section titled "computed variables".
</li>
<li>
<p>You might want to draw greater attention to the statistical
transformation in your code. For example, you might use
<code>stat_summary()</code>, which summarises the y values for each unique x
value, to draw attention to the summary that you’re computing:</p>
<pre class="language-python"><code class="language-python">ggplot<span class="token punctuation">(</span>data<span class="token operator">=</span>diamonds<span class="token punctuation">)</span> <span class="token operator">+</span>\<br />stat_summary<span class="token punctuation">(</span><br /> mapping<span class="token operator">=</span>aes<span class="token punctuation">(</span>x<span class="token operator">=</span><span class="token string">"cut"</span><span class="token punctuation">,</span> y<span class="token operator">=</span><span class="token string">"depth"</span><span class="token punctuation">)</span><span class="token punctuation">,</span><br /> fun_ymin<span class="token operator">=</span>np<span class="token punctuation">.</span><span class="token builtin">min</span><span class="token punctuation">,</span><br /> fun_ymax<span class="token operator">=</span>np<span class="token punctuation">.</span><span class="token builtin">max</span><span class="token punctuation">,</span><br /> fun_y<span class="token operator">=</span>np<span class="token punctuation">.</span>median<br /><span class="token punctuation">)</span></code></pre>
<figure>
<a href="https://jeroenjanssens.com/img/plotnine-grammar-of-graphics-for-python/plotnine-unnamed-chunk-42-1.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/pGg4x8kj06-302.webp 302w, https://jeroenjanssens.com/img/pGg4x8kj06-453.webp 453w, https://jeroenjanssens.com/img/pGg4x8kj06-604.webp 604w, https://jeroenjanssens.com/img/pGg4x8kj06-907.webp 907w, https://jeroenjanssens.com/img/pGg4x8kj06-1209.webp 1209w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/pGg4x8kj06-302.webp 302w, https://jeroenjanssens.com/img/pGg4x8kj06-453.webp 453w, https://jeroenjanssens.com/img/pGg4x8kj06-604.webp 604w, https://jeroenjanssens.com/img/pGg4x8kj06-907.webp 907w, https://jeroenjanssens.com/img/pGg4x8kj06-1209.webp 1209w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/pGg4x8kj06-302.jpeg 302w, https://jeroenjanssens.com/img/pGg4x8kj06-453.jpeg 453w, https://jeroenjanssens.com/img/pGg4x8kj06-604.jpeg 604w, https://jeroenjanssens.com/img/pGg4x8kj06-907.jpeg 907w, https://jeroenjanssens.com/img/pGg4x8kj06-1209.jpeg 1209w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/pGg4x8kj06-302.jpeg 302w, https://jeroenjanssens.com/img/pGg4x8kj06-453.jpeg 453w, https://jeroenjanssens.com/img/pGg4x8kj06-604.jpeg 604w, https://jeroenjanssens.com/img/pGg4x8kj06-907.jpeg 907w, https://jeroenjanssens.com/img/pGg4x8kj06-1209.jpeg 1209w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/pGg4x8kj06-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
</li>
</ol>
<p>plotnine provides over 20 stats for you to use. Each stat is a function,
so you can get help in the usual way, e.g. <code>?stat_bin</code>. To see a
complete list of stats, try the ggplot2 cheatsheet.</p>
<h3><a href="https://r4ds.had.co.nz/data-visualisation.html#exercises-4">3.7.1</a> Exercises</h3>
<ol>
<li>
<p>What is the default geom associated with <code>stat_summary()</code>? How could
you rewrite the previous plot to use that geom function instead of
the stat function?</p>
</li>
<li>
<p>What does <code>geom_col()</code> do? How is it different to <code>geom_bar()</code>?</p>
</li>
<li>
<p>Most geoms and stats come in pairs that are almost always used in
concert. Read through the documentation and make a list of all the
pairs. What do they have in common?</p>
</li>
<li>
<p>What variables does <code>stat_smooth()</code> compute? What parameters control
its behaviour?</p>
</li>
<li>
<p>In our proportion bar chart, we need to set <code>group=1</code>. Why? In other
words what is the problem with these two graphs?</p>
<pre class="language-python"><code class="language-python">ggplot<span class="token punctuation">(</span>data<span class="token operator">=</span>diamonds<span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_bar<span class="token punctuation">(</span>mapping<span class="token operator">=</span>aes<span class="token punctuation">(</span>x<span class="token operator">=</span><span class="token string">"cut"</span><span class="token punctuation">,</span> y<span class="token operator">=</span><span class="token string">"..prop.."</span><span class="token punctuation">)</span><span class="token punctuation">)</span><br /><br />ggplot<span class="token punctuation">(</span>data<span class="token operator">=</span>diamonds<span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_bar<span class="token punctuation">(</span>mapping<span class="token operator">=</span>aes<span class="token punctuation">(</span>x<span class="token operator">=</span><span class="token string">"cut"</span><span class="token punctuation">,</span> fill<span class="token operator">=</span><span class="token string">"color"</span><span class="token punctuation">,</span> y<span class="token operator">=</span><span class="token string">"..prop.."</span><span class="token punctuation">)</span><span class="token punctuation">)</span></code></pre>
</li>
</ol>
<h2><a href="https://r4ds.had.co.nz/data-visualisation.html#position-adjustments">3.8</a> Position adjustments</h2>
<p>There’s one more piece of magic associated with bar charts. You can
colour a bar chart using either the <code>colour</code> aesthetic, or, more
usefully, <code>fill</code>:</p>
<pre class="language-python"><code class="language-python"><span class="token comment"># Left</span><br />ggplot<span class="token punctuation">(</span>data<span class="token operator">=</span>diamonds<span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_bar<span class="token punctuation">(</span>mapping<span class="token operator">=</span>aes<span class="token punctuation">(</span>x<span class="token operator">=</span><span class="token string">"cut"</span><span class="token punctuation">,</span> colour<span class="token operator">=</span><span class="token string">"cut"</span><span class="token punctuation">)</span><span class="token punctuation">)</span><br /><br /><span class="token comment"># Right</span><br />ggplot<span class="token punctuation">(</span>data<span class="token operator">=</span>diamonds<span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_bar<span class="token punctuation">(</span>mapping<span class="token operator">=</span>aes<span class="token punctuation">(</span>x<span class="token operator">=</span><span class="token string">"cut"</span><span class="token punctuation">,</span> fill<span class="token operator">=</span><span class="token string">"cut"</span><span class="token punctuation">)</span><span class="token punctuation">)</span></code></pre>
<div class="flex flex-wrap md:flex-row mb-4">
<div class="mx-auto md:w-1/2 ">
<figure>
<a href="https://jeroenjanssens.com/img/plotnine-grammar-of-graphics-for-python/plotnine-unnamed-chunk-45-1.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/PH9OUjUttj-302.webp 302w, https://jeroenjanssens.com/img/PH9OUjUttj-453.webp 453w, https://jeroenjanssens.com/img/PH9OUjUttj-604.webp 604w, https://jeroenjanssens.com/img/PH9OUjUttj-907.webp 907w, https://jeroenjanssens.com/img/PH9OUjUttj-1209.webp 1209w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/PH9OUjUttj-302.webp 302w, https://jeroenjanssens.com/img/PH9OUjUttj-453.webp 453w, https://jeroenjanssens.com/img/PH9OUjUttj-604.webp 604w, https://jeroenjanssens.com/img/PH9OUjUttj-907.webp 907w, https://jeroenjanssens.com/img/PH9OUjUttj-1209.webp 1209w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/PH9OUjUttj-302.jpeg 302w, https://jeroenjanssens.com/img/PH9OUjUttj-453.jpeg 453w, https://jeroenjanssens.com/img/PH9OUjUttj-604.jpeg 604w, https://jeroenjanssens.com/img/PH9OUjUttj-907.jpeg 907w, https://jeroenjanssens.com/img/PH9OUjUttj-1209.jpeg 1209w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/PH9OUjUttj-302.jpeg 302w, https://jeroenjanssens.com/img/PH9OUjUttj-453.jpeg 453w, https://jeroenjanssens.com/img/PH9OUjUttj-604.jpeg 604w, https://jeroenjanssens.com/img/PH9OUjUttj-907.jpeg 907w, https://jeroenjanssens.com/img/PH9OUjUttj-1209.jpeg 1209w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/PH9OUjUttj-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
</div>
<div class="mx-auto md:w-1/2 ">
<figure>
<a href="https://jeroenjanssens.com/img/plotnine-grammar-of-graphics-for-python/plotnine-unnamed-chunk-46-1.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/xAwNM60eVB-302.webp 302w, https://jeroenjanssens.com/img/xAwNM60eVB-453.webp 453w, https://jeroenjanssens.com/img/xAwNM60eVB-604.webp 604w, https://jeroenjanssens.com/img/xAwNM60eVB-907.webp 907w, https://jeroenjanssens.com/img/xAwNM60eVB-1209.webp 1209w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/xAwNM60eVB-302.webp 302w, https://jeroenjanssens.com/img/xAwNM60eVB-453.webp 453w, https://jeroenjanssens.com/img/xAwNM60eVB-604.webp 604w, https://jeroenjanssens.com/img/xAwNM60eVB-907.webp 907w, https://jeroenjanssens.com/img/xAwNM60eVB-1209.webp 1209w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/xAwNM60eVB-302.jpeg 302w, https://jeroenjanssens.com/img/xAwNM60eVB-453.jpeg 453w, https://jeroenjanssens.com/img/xAwNM60eVB-604.jpeg 604w, https://jeroenjanssens.com/img/xAwNM60eVB-907.jpeg 907w, https://jeroenjanssens.com/img/xAwNM60eVB-1209.jpeg 1209w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/xAwNM60eVB-302.jpeg 302w, https://jeroenjanssens.com/img/xAwNM60eVB-453.jpeg 453w, https://jeroenjanssens.com/img/xAwNM60eVB-604.jpeg 604w, https://jeroenjanssens.com/img/xAwNM60eVB-907.jpeg 907w, https://jeroenjanssens.com/img/xAwNM60eVB-1209.jpeg 1209w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/xAwNM60eVB-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
</div>
</div>
<p>Note what happens if you map the fill aesthetic to another variable,
like <code>clarity</code>: the bars are automatically stacked. Each colored
rectangle represents a combination of <code>cut</code> and <code>clarity</code>.</p>
<pre class="language-python"><code class="language-python">ggplot<span class="token punctuation">(</span>data<span class="token operator">=</span>diamonds<span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_bar<span class="token punctuation">(</span>mapping<span class="token operator">=</span>aes<span class="token punctuation">(</span>x<span class="token operator">=</span><span class="token string">"cut"</span><span class="token punctuation">,</span> fill<span class="token operator">=</span><span class="token string">"clarity"</span><span class="token punctuation">)</span><span class="token punctuation">)</span></code></pre>
<figure>
<a href="https://jeroenjanssens.com/img/plotnine-grammar-of-graphics-for-python/plotnine-unnamed-chunk-47-1.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/MMrd-8APLq-302.webp 302w, https://jeroenjanssens.com/img/MMrd-8APLq-453.webp 453w, https://jeroenjanssens.com/img/MMrd-8APLq-604.webp 604w, https://jeroenjanssens.com/img/MMrd-8APLq-907.webp 907w, https://jeroenjanssens.com/img/MMrd-8APLq-1209.webp 1209w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/MMrd-8APLq-302.webp 302w, https://jeroenjanssens.com/img/MMrd-8APLq-453.webp 453w, https://jeroenjanssens.com/img/MMrd-8APLq-604.webp 604w, https://jeroenjanssens.com/img/MMrd-8APLq-907.webp 907w, https://jeroenjanssens.com/img/MMrd-8APLq-1209.webp 1209w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/MMrd-8APLq-302.jpeg 302w, https://jeroenjanssens.com/img/MMrd-8APLq-453.jpeg 453w, https://jeroenjanssens.com/img/MMrd-8APLq-604.jpeg 604w, https://jeroenjanssens.com/img/MMrd-8APLq-907.jpeg 907w, https://jeroenjanssens.com/img/MMrd-8APLq-1209.jpeg 1209w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/MMrd-8APLq-302.jpeg 302w, https://jeroenjanssens.com/img/MMrd-8APLq-453.jpeg 453w, https://jeroenjanssens.com/img/MMrd-8APLq-604.jpeg 604w, https://jeroenjanssens.com/img/MMrd-8APLq-907.jpeg 907w, https://jeroenjanssens.com/img/MMrd-8APLq-1209.jpeg 1209w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/MMrd-8APLq-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
<p>The stacking is performed automatically by the <strong>position adjustment</strong>
specified by the <code>position</code> argument. If you don’t want a stacked bar
chart, you can use one of three other options: <code>"identity"</code>, <code>"dodge"</code>
or <code>"fill"</code>.</p>
<ul>
<li>
<p><code>position="identity"</code> will place each object exactly where it falls in
the context of the graph. This is not very useful for bars, because it
overlaps them. To see that overlapping we either need to make the bars
slightly transparent by setting <code>alpha</code> to a small value, or
completely transparent by setting <code>fill=None</code>.</p>
<pre class="language-python"><code class="language-python"><span class="token comment"># Left</span><br />ggplot<span class="token punctuation">(</span>data<span class="token operator">=</span>diamonds<span class="token punctuation">,</span> mapping<span class="token operator">=</span>aes<span class="token punctuation">(</span>x<span class="token operator">=</span><span class="token string">"cut"</span><span class="token punctuation">,</span> fill<span class="token operator">=</span><span class="token string">"clarity"</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_bar<span class="token punctuation">(</span>alpha<span class="token operator">=</span><span class="token number">1</span><span class="token operator">/</span><span class="token number">5</span><span class="token punctuation">,</span> position<span class="token operator">=</span><span class="token string">"identity"</span><span class="token punctuation">)</span><br /><br /><span class="token comment"># Right</span><br />ggplot<span class="token punctuation">(</span>data<span class="token operator">=</span>diamonds<span class="token punctuation">,</span> mapping<span class="token operator">=</span>aes<span class="token punctuation">(</span>x<span class="token operator">=</span><span class="token string">"cut"</span><span class="token punctuation">,</span> colour<span class="token operator">=</span><span class="token string">"clarity"</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_bar<span class="token punctuation">(</span>fill<span class="token operator">=</span><span class="token boolean">None</span><span class="token punctuation">,</span> position<span class="token operator">=</span><span class="token string">"identity"</span><span class="token punctuation">)</span></code></pre>
<div class="flex flex-wrap md:flex-row mb-4">
<div class="mx-auto md:w-1/2 ">
<figure>
<a href="https://jeroenjanssens.com/img/plotnine-grammar-of-graphics-for-python/plotnine-unnamed-chunk-49-1.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/lAuWlA2wmT-302.webp 302w, https://jeroenjanssens.com/img/lAuWlA2wmT-453.webp 453w, https://jeroenjanssens.com/img/lAuWlA2wmT-604.webp 604w, https://jeroenjanssens.com/img/lAuWlA2wmT-907.webp 907w, https://jeroenjanssens.com/img/lAuWlA2wmT-1209.webp 1209w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/lAuWlA2wmT-302.webp 302w, https://jeroenjanssens.com/img/lAuWlA2wmT-453.webp 453w, https://jeroenjanssens.com/img/lAuWlA2wmT-604.webp 604w, https://jeroenjanssens.com/img/lAuWlA2wmT-907.webp 907w, https://jeroenjanssens.com/img/lAuWlA2wmT-1209.webp 1209w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/lAuWlA2wmT-302.jpeg 302w, https://jeroenjanssens.com/img/lAuWlA2wmT-453.jpeg 453w, https://jeroenjanssens.com/img/lAuWlA2wmT-604.jpeg 604w, https://jeroenjanssens.com/img/lAuWlA2wmT-907.jpeg 907w, https://jeroenjanssens.com/img/lAuWlA2wmT-1209.jpeg 1209w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/lAuWlA2wmT-302.jpeg 302w, https://jeroenjanssens.com/img/lAuWlA2wmT-453.jpeg 453w, https://jeroenjanssens.com/img/lAuWlA2wmT-604.jpeg 604w, https://jeroenjanssens.com/img/lAuWlA2wmT-907.jpeg 907w, https://jeroenjanssens.com/img/lAuWlA2wmT-1209.jpeg 1209w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/lAuWlA2wmT-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
</div>
<div class="mx-auto md:w-1/2 ">
<figure>
<a href="https://jeroenjanssens.com/img/plotnine-grammar-of-graphics-for-python/plotnine-unnamed-chunk-50-1.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/dkyTPqk-R7-302.webp 302w, https://jeroenjanssens.com/img/dkyTPqk-R7-453.webp 453w, https://jeroenjanssens.com/img/dkyTPqk-R7-604.webp 604w, https://jeroenjanssens.com/img/dkyTPqk-R7-907.webp 907w, https://jeroenjanssens.com/img/dkyTPqk-R7-1209.webp 1209w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/dkyTPqk-R7-302.webp 302w, https://jeroenjanssens.com/img/dkyTPqk-R7-453.webp 453w, https://jeroenjanssens.com/img/dkyTPqk-R7-604.webp 604w, https://jeroenjanssens.com/img/dkyTPqk-R7-907.webp 907w, https://jeroenjanssens.com/img/dkyTPqk-R7-1209.webp 1209w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/dkyTPqk-R7-302.jpeg 302w, https://jeroenjanssens.com/img/dkyTPqk-R7-453.jpeg 453w, https://jeroenjanssens.com/img/dkyTPqk-R7-604.jpeg 604w, https://jeroenjanssens.com/img/dkyTPqk-R7-907.jpeg 907w, https://jeroenjanssens.com/img/dkyTPqk-R7-1209.jpeg 1209w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/dkyTPqk-R7-302.jpeg 302w, https://jeroenjanssens.com/img/dkyTPqk-R7-453.jpeg 453w, https://jeroenjanssens.com/img/dkyTPqk-R7-604.jpeg 604w, https://jeroenjanssens.com/img/dkyTPqk-R7-907.jpeg 907w, https://jeroenjanssens.com/img/dkyTPqk-R7-1209.jpeg 1209w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/dkyTPqk-R7-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
</div>
</div>
The identity position adjustment is more useful for 2d geoms, like points,
where it is the default.
</li>
<li>
<p><code>position="fill"</code> works like stacking, but makes each set of stacked
bars the same height. This makes it easier to compare proportions
across groups.</p>
<pre class="language-python"><code class="language-python">ggplot<span class="token punctuation">(</span>data<span class="token operator">=</span>diamonds<span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_bar<span class="token punctuation">(</span>mapping<span class="token operator">=</span>aes<span class="token punctuation">(</span>x<span class="token operator">=</span><span class="token string">"cut"</span><span class="token punctuation">,</span> fill<span class="token operator">=</span><span class="token string">"clarity"</span><span class="token punctuation">)</span><span class="token punctuation">,</span> position<span class="token operator">=</span><span class="token string">"fill"</span><span class="token punctuation">)</span></code></pre>
<figure>
<a href="https://jeroenjanssens.com/img/plotnine-grammar-of-graphics-for-python/plotnine-unnamed-chunk-51-1.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/Gdnw8A69yU-302.webp 302w, https://jeroenjanssens.com/img/Gdnw8A69yU-453.webp 453w, https://jeroenjanssens.com/img/Gdnw8A69yU-604.webp 604w, https://jeroenjanssens.com/img/Gdnw8A69yU-907.webp 907w, https://jeroenjanssens.com/img/Gdnw8A69yU-1209.webp 1209w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/Gdnw8A69yU-302.webp 302w, https://jeroenjanssens.com/img/Gdnw8A69yU-453.webp 453w, https://jeroenjanssens.com/img/Gdnw8A69yU-604.webp 604w, https://jeroenjanssens.com/img/Gdnw8A69yU-907.webp 907w, https://jeroenjanssens.com/img/Gdnw8A69yU-1209.webp 1209w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/Gdnw8A69yU-302.jpeg 302w, https://jeroenjanssens.com/img/Gdnw8A69yU-453.jpeg 453w, https://jeroenjanssens.com/img/Gdnw8A69yU-604.jpeg 604w, https://jeroenjanssens.com/img/Gdnw8A69yU-907.jpeg 907w, https://jeroenjanssens.com/img/Gdnw8A69yU-1209.jpeg 1209w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/Gdnw8A69yU-302.jpeg 302w, https://jeroenjanssens.com/img/Gdnw8A69yU-453.jpeg 453w, https://jeroenjanssens.com/img/Gdnw8A69yU-604.jpeg 604w, https://jeroenjanssens.com/img/Gdnw8A69yU-907.jpeg 907w, https://jeroenjanssens.com/img/Gdnw8A69yU-1209.jpeg 1209w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/Gdnw8A69yU-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
</li>
<li>
<p><code>position="dodge"</code> places overlapping objects directly <em>beside</em> one
another. This makes it easier to compare individual values.</p>
<pre class="language-python"><code class="language-python">ggplot<span class="token punctuation">(</span>data<span class="token operator">=</span>diamonds<span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_bar<span class="token punctuation">(</span>mapping<span class="token operator">=</span>aes<span class="token punctuation">(</span>x<span class="token operator">=</span><span class="token string">"cut"</span><span class="token punctuation">,</span> fill<span class="token operator">=</span><span class="token string">"clarity"</span><span class="token punctuation">)</span><span class="token punctuation">,</span> position<span class="token operator">=</span><span class="token string">"dodge"</span><span class="token punctuation">)</span></code></pre>
<figure>
<a href="https://jeroenjanssens.com/img/plotnine-grammar-of-graphics-for-python/plotnine-unnamed-chunk-52-1.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/-G15UISoPS-302.webp 302w, https://jeroenjanssens.com/img/-G15UISoPS-453.webp 453w, https://jeroenjanssens.com/img/-G15UISoPS-604.webp 604w, https://jeroenjanssens.com/img/-G15UISoPS-907.webp 907w, https://jeroenjanssens.com/img/-G15UISoPS-1209.webp 1209w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/-G15UISoPS-302.webp 302w, https://jeroenjanssens.com/img/-G15UISoPS-453.webp 453w, https://jeroenjanssens.com/img/-G15UISoPS-604.webp 604w, https://jeroenjanssens.com/img/-G15UISoPS-907.webp 907w, https://jeroenjanssens.com/img/-G15UISoPS-1209.webp 1209w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/-G15UISoPS-302.jpeg 302w, https://jeroenjanssens.com/img/-G15UISoPS-453.jpeg 453w, https://jeroenjanssens.com/img/-G15UISoPS-604.jpeg 604w, https://jeroenjanssens.com/img/-G15UISoPS-907.jpeg 907w, https://jeroenjanssens.com/img/-G15UISoPS-1209.jpeg 1209w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/-G15UISoPS-302.jpeg 302w, https://jeroenjanssens.com/img/-G15UISoPS-453.jpeg 453w, https://jeroenjanssens.com/img/-G15UISoPS-604.jpeg 604w, https://jeroenjanssens.com/img/-G15UISoPS-907.jpeg 907w, https://jeroenjanssens.com/img/-G15UISoPS-1209.jpeg 1209w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/-G15UISoPS-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
</li>
</ul>
<p>There’s one other type of adjustment that’s not useful for bar charts,
but it can be very useful for scatterplots. Recall our first
scatterplot. Did you notice that the plot displays only 126 points, even
though there are 234 observations in the dataset?</p>
<figure>
<a href="https://jeroenjanssens.com/img/plotnine-grammar-of-graphics-for-python/plotnine-unnamed-chunk-53-1.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/EauI05WqZD-302.webp 302w, https://jeroenjanssens.com/img/EauI05WqZD-453.webp 453w, https://jeroenjanssens.com/img/EauI05WqZD-604.webp 604w, https://jeroenjanssens.com/img/EauI05WqZD-907.webp 907w, https://jeroenjanssens.com/img/EauI05WqZD-1209.webp 1209w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/EauI05WqZD-302.webp 302w, https://jeroenjanssens.com/img/EauI05WqZD-453.webp 453w, https://jeroenjanssens.com/img/EauI05WqZD-604.webp 604w, https://jeroenjanssens.com/img/EauI05WqZD-907.webp 907w, https://jeroenjanssens.com/img/EauI05WqZD-1209.webp 1209w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/EauI05WqZD-302.jpeg 302w, https://jeroenjanssens.com/img/EauI05WqZD-453.jpeg 453w, https://jeroenjanssens.com/img/EauI05WqZD-604.jpeg 604w, https://jeroenjanssens.com/img/EauI05WqZD-907.jpeg 907w, https://jeroenjanssens.com/img/EauI05WqZD-1209.jpeg 1209w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/EauI05WqZD-302.jpeg 302w, https://jeroenjanssens.com/img/EauI05WqZD-453.jpeg 453w, https://jeroenjanssens.com/img/EauI05WqZD-604.jpeg 604w, https://jeroenjanssens.com/img/EauI05WqZD-907.jpeg 907w, https://jeroenjanssens.com/img/EauI05WqZD-1209.jpeg 1209w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/EauI05WqZD-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
<p>The values of <code>hwy</code> and <code>displ</code> are rounded so the points appear on a
grid and many points overlap each other. This problem is known as
<strong>overplotting</strong>. This arrangement makes it hard to see where the mass
of the data is. Are the data points spread equally throughout the graph,
or is there one special combination of <code>hwy</code> and <code>displ</code> that contains
109 values?</p>
<p>You can avoid this gridding by setting the position adjustment to
“jitter”. <code>position="jitter"</code> adds a small amount of random noise to
each point. This spreads the points out because no two points are likely
to receive the same amount of random noise.</p>
<pre class="language-python"><code class="language-python">ggplot<span class="token punctuation">(</span>data<span class="token operator">=</span>mpg<span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_point<span class="token punctuation">(</span>mapping<span class="token operator">=</span>aes<span class="token punctuation">(</span>x<span class="token operator">=</span><span class="token string">"displ"</span><span class="token punctuation">,</span> y<span class="token operator">=</span><span class="token string">"hwy"</span><span class="token punctuation">)</span><span class="token punctuation">,</span> position<span class="token operator">=</span><span class="token string">"jitter"</span><span class="token punctuation">)</span></code></pre>
<figure>
<a href="https://jeroenjanssens.com/img/plotnine-grammar-of-graphics-for-python/plotnine-unnamed-chunk-54-1.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/1G-yV_xZ50-302.webp 302w, https://jeroenjanssens.com/img/1G-yV_xZ50-453.webp 453w, https://jeroenjanssens.com/img/1G-yV_xZ50-604.webp 604w, https://jeroenjanssens.com/img/1G-yV_xZ50-907.webp 907w, https://jeroenjanssens.com/img/1G-yV_xZ50-1209.webp 1209w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/1G-yV_xZ50-302.webp 302w, https://jeroenjanssens.com/img/1G-yV_xZ50-453.webp 453w, https://jeroenjanssens.com/img/1G-yV_xZ50-604.webp 604w, https://jeroenjanssens.com/img/1G-yV_xZ50-907.webp 907w, https://jeroenjanssens.com/img/1G-yV_xZ50-1209.webp 1209w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/1G-yV_xZ50-302.jpeg 302w, https://jeroenjanssens.com/img/1G-yV_xZ50-453.jpeg 453w, https://jeroenjanssens.com/img/1G-yV_xZ50-604.jpeg 604w, https://jeroenjanssens.com/img/1G-yV_xZ50-907.jpeg 907w, https://jeroenjanssens.com/img/1G-yV_xZ50-1209.jpeg 1209w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/1G-yV_xZ50-302.jpeg 302w, https://jeroenjanssens.com/img/1G-yV_xZ50-453.jpeg 453w, https://jeroenjanssens.com/img/1G-yV_xZ50-604.jpeg 604w, https://jeroenjanssens.com/img/1G-yV_xZ50-907.jpeg 907w, https://jeroenjanssens.com/img/1G-yV_xZ50-1209.jpeg 1209w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/1G-yV_xZ50-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
<p>Adding randomness seems like a strange way to improve your plot, but
while it makes your graph less accurate at small scales, it makes your
graph <em>more</em> revealing at large scales. Because this is such a useful
operation, plotnine comes with a shorthand for
<code>geom_point(position="jitter")</code>: <code>geom_jitter()</code>.</p>
<p>To learn more about a position adjustment, look up the help page
associated with each adjustment: <code>?position_dodge</code>, <code>?position_fill</code>,
<code>?position_identity</code>, <code>?position_jitter</code>, and <code>?position_stack</code>.</p>
<h3><a href="https://r4ds.had.co.nz/data-visualisation.html#exercises-5">3.8.1</a> Exercises</h3>
<ol>
<li>
<p>What is the problem with this plot? How could you improve it?</p>
<pre class="language-python"><code class="language-python">ggplot<span class="token punctuation">(</span>data<span class="token operator">=</span>mpg<span class="token punctuation">,</span> mapping<span class="token operator">=</span>aes<span class="token punctuation">(</span>x<span class="token operator">=</span><span class="token string">"cty"</span><span class="token punctuation">,</span> y<span class="token operator">=</span><span class="token string">"hwy"</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_point<span class="token punctuation">(</span><span class="token punctuation">)</span></code></pre>
<figure>
<a href="https://jeroenjanssens.com/img/plotnine-grammar-of-graphics-for-python/plotnine-unnamed-chunk-55-1.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/UjKbVEPc1T-302.webp 302w, https://jeroenjanssens.com/img/UjKbVEPc1T-453.webp 453w, https://jeroenjanssens.com/img/UjKbVEPc1T-604.webp 604w, https://jeroenjanssens.com/img/UjKbVEPc1T-907.webp 907w, https://jeroenjanssens.com/img/UjKbVEPc1T-1209.webp 1209w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/UjKbVEPc1T-302.webp 302w, https://jeroenjanssens.com/img/UjKbVEPc1T-453.webp 453w, https://jeroenjanssens.com/img/UjKbVEPc1T-604.webp 604w, https://jeroenjanssens.com/img/UjKbVEPc1T-907.webp 907w, https://jeroenjanssens.com/img/UjKbVEPc1T-1209.webp 1209w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/UjKbVEPc1T-302.jpeg 302w, https://jeroenjanssens.com/img/UjKbVEPc1T-453.jpeg 453w, https://jeroenjanssens.com/img/UjKbVEPc1T-604.jpeg 604w, https://jeroenjanssens.com/img/UjKbVEPc1T-907.jpeg 907w, https://jeroenjanssens.com/img/UjKbVEPc1T-1209.jpeg 1209w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/UjKbVEPc1T-302.jpeg 302w, https://jeroenjanssens.com/img/UjKbVEPc1T-453.jpeg 453w, https://jeroenjanssens.com/img/UjKbVEPc1T-604.jpeg 604w, https://jeroenjanssens.com/img/UjKbVEPc1T-907.jpeg 907w, https://jeroenjanssens.com/img/UjKbVEPc1T-1209.jpeg 1209w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/UjKbVEPc1T-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
</li>
<li>
<p>What parameters to <code>geom_jitter()</code> control the amount of jittering?</p>
</li>
<li>
<p>Compare and contrast <code>geom_jitter()</code> with <code>geom_count()</code>.</p>
</li>
<li>
<p>What’s the default position adjustment for <code>geom_boxplot()</code>? Create
a visualisation of the <code>mpg</code> dataset that demonstrates it.</p>
</li>
</ol>
<h2><a href="https://r4ds.had.co.nz/data-visualisation.html#coordinate-systems">3.9</a> Coordinate systems</h2>
<p>Coordinate systems are probably the most complicated part of plotnine.
The default coordinate system is the Cartesian coordinate system where
the x and y positions act independently to determine the location of
each point. There is one other coordinate system that is occasionally
helpful.<sup class="footnote-ref"><a href="https://jeroenjanssens.com/plotnine/#fn9" id="fnref9">[9]</a></sup></p>
<ul>
<li>
<p><code>coord_flip()</code> switches the x and y axes. This is useful (for
example), if you want horizontal boxplots. It’s also useful for long
labels: it’s hard to get them to fit without overlapping on the
x-axis.</p>
<pre class="language-python"><code class="language-python"><span class="token comment"># Left</span><br />ggplot<span class="token punctuation">(</span>data<span class="token operator">=</span>mpg<span class="token punctuation">,</span> mapping<span class="token operator">=</span>aes<span class="token punctuation">(</span>x<span class="token operator">=</span><span class="token string">"class"</span><span class="token punctuation">,</span> y<span class="token operator">=</span><span class="token string">"hwy"</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_boxplot<span class="token punctuation">(</span><span class="token punctuation">)</span><br /><br /><span class="token comment"># Right</span><br />ggplot<span class="token punctuation">(</span>data<span class="token operator">=</span>mpg<span class="token punctuation">,</span> mapping<span class="token operator">=</span>aes<span class="token punctuation">(</span>x<span class="token operator">=</span><span class="token string">"class"</span><span class="token punctuation">,</span> y<span class="token operator">=</span><span class="token string">"hwy"</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_boxplot<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />coord_flip<span class="token punctuation">(</span><span class="token punctuation">)</span></code></pre>
<div class="flex flex-wrap md:flex-row mb-4">
<div class="mx-auto md:w-1/2 md:pl-8 md:pr-16">
<figure>
<a href="https://jeroenjanssens.com/img/plotnine-grammar-of-graphics-for-python/plotnine-unnamed-chunk-57-1.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/UEwzgE-m3n-302.webp 302w, https://jeroenjanssens.com/img/UEwzgE-m3n-453.webp 453w, https://jeroenjanssens.com/img/UEwzgE-m3n-604.webp 604w, https://jeroenjanssens.com/img/UEwzgE-m3n-907.webp 907w, https://jeroenjanssens.com/img/UEwzgE-m3n-1209.webp 1209w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/UEwzgE-m3n-302.webp 302w, https://jeroenjanssens.com/img/UEwzgE-m3n-453.webp 453w, https://jeroenjanssens.com/img/UEwzgE-m3n-604.webp 604w, https://jeroenjanssens.com/img/UEwzgE-m3n-907.webp 907w, https://jeroenjanssens.com/img/UEwzgE-m3n-1209.webp 1209w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/UEwzgE-m3n-302.jpeg 302w, https://jeroenjanssens.com/img/UEwzgE-m3n-453.jpeg 453w, https://jeroenjanssens.com/img/UEwzgE-m3n-604.jpeg 604w, https://jeroenjanssens.com/img/UEwzgE-m3n-907.jpeg 907w, https://jeroenjanssens.com/img/UEwzgE-m3n-1209.jpeg 1209w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/UEwzgE-m3n-302.jpeg 302w, https://jeroenjanssens.com/img/UEwzgE-m3n-453.jpeg 453w, https://jeroenjanssens.com/img/UEwzgE-m3n-604.jpeg 604w, https://jeroenjanssens.com/img/UEwzgE-m3n-907.jpeg 907w, https://jeroenjanssens.com/img/UEwzgE-m3n-1209.jpeg 1209w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/UEwzgE-m3n-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
</div>
<div class="mx-auto md:w-1/2 md:pr-8 md:pl-0">
<figure>
<a href="https://jeroenjanssens.com/img/plotnine-grammar-of-graphics-for-python/plotnine-unnamed-chunk-58-1.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/26aHdQAYi3-302.webp 302w, https://jeroenjanssens.com/img/26aHdQAYi3-453.webp 453w, https://jeroenjanssens.com/img/26aHdQAYi3-604.webp 604w, https://jeroenjanssens.com/img/26aHdQAYi3-907.webp 907w, https://jeroenjanssens.com/img/26aHdQAYi3-1209.webp 1209w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/26aHdQAYi3-302.webp 302w, https://jeroenjanssens.com/img/26aHdQAYi3-453.webp 453w, https://jeroenjanssens.com/img/26aHdQAYi3-604.webp 604w, https://jeroenjanssens.com/img/26aHdQAYi3-907.webp 907w, https://jeroenjanssens.com/img/26aHdQAYi3-1209.webp 1209w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/26aHdQAYi3-302.jpeg 302w, https://jeroenjanssens.com/img/26aHdQAYi3-453.jpeg 453w, https://jeroenjanssens.com/img/26aHdQAYi3-604.jpeg 604w, https://jeroenjanssens.com/img/26aHdQAYi3-907.jpeg 907w, https://jeroenjanssens.com/img/26aHdQAYi3-1209.jpeg 1209w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/26aHdQAYi3-302.jpeg 302w, https://jeroenjanssens.com/img/26aHdQAYi3-453.jpeg 453w, https://jeroenjanssens.com/img/26aHdQAYi3-604.jpeg 604w, https://jeroenjanssens.com/img/26aHdQAYi3-907.jpeg 907w, https://jeroenjanssens.com/img/26aHdQAYi3-1209.jpeg 1209w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/26aHdQAYi3-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
</div>
</div>
</li>
</ul>
<h3><a href="https://r4ds.had.co.nz/data-visualisation.html#exercises-6">3.9.1</a> Exercises</h3>
<ol>
<li>
<p>What does <code>labs()</code> do? Read the documentation.</p>
</li>
<li>
<p>What does the plot below tell you about the relationship between
city and highway mpg? Why is <code>coord_fixed()</code> important? What does
<code>geom_abline()</code> do?</p>
<pre class="language-python"><code class="language-python">ggplot<span class="token punctuation">(</span>data<span class="token operator">=</span>mpg<span class="token punctuation">,</span> mapping<span class="token operator">=</span>aes<span class="token punctuation">(</span>x<span class="token operator">=</span><span class="token string">"cty"</span><span class="token punctuation">,</span> y<span class="token operator">=</span><span class="token string">"hwy"</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_point<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_abline<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />coord_fixed<span class="token punctuation">(</span><span class="token punctuation">)</span></code></pre>
<figure>
<a href="https://jeroenjanssens.com/img/plotnine-grammar-of-graphics-for-python/plotnine-unnamed-chunk-59-1.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/m4TJzR88lm-302.webp 302w, https://jeroenjanssens.com/img/m4TJzR88lm-453.webp 453w, https://jeroenjanssens.com/img/m4TJzR88lm-604.webp 604w, https://jeroenjanssens.com/img/m4TJzR88lm-907.webp 907w, https://jeroenjanssens.com/img/m4TJzR88lm-1209.webp 1209w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/m4TJzR88lm-302.webp 302w, https://jeroenjanssens.com/img/m4TJzR88lm-453.webp 453w, https://jeroenjanssens.com/img/m4TJzR88lm-604.webp 604w, https://jeroenjanssens.com/img/m4TJzR88lm-907.webp 907w, https://jeroenjanssens.com/img/m4TJzR88lm-1209.webp 1209w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/m4TJzR88lm-302.jpeg 302w, https://jeroenjanssens.com/img/m4TJzR88lm-453.jpeg 453w, https://jeroenjanssens.com/img/m4TJzR88lm-604.jpeg 604w, https://jeroenjanssens.com/img/m4TJzR88lm-907.jpeg 907w, https://jeroenjanssens.com/img/m4TJzR88lm-1209.jpeg 1209w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/m4TJzR88lm-302.jpeg 302w, https://jeroenjanssens.com/img/m4TJzR88lm-453.jpeg 453w, https://jeroenjanssens.com/img/m4TJzR88lm-604.jpeg 604w, https://jeroenjanssens.com/img/m4TJzR88lm-907.jpeg 907w, https://jeroenjanssens.com/img/m4TJzR88lm-1209.jpeg 1209w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/m4TJzR88lm-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
</li>
</ol>
<h2><a href="https://r4ds.had.co.nz/data-visualisation.html#the-layered-grammar-of-graphics">3.10</a> The layered grammar of graphics</h2>
<p>In the previous sections, you learned much more than how to make
scatterplots, bar charts, and boxplots. You learned a foundation that
you can use to make <em>any</em> type of plot with plotnine. To see this, let’s
add position adjustments, stats, coordinate systems, and faceting to our
code template:</p>
<pre class="language-text"><code class="language-text">ggplot(data=<DATA>) +\<br /><GEOM_FUNCTION>(<br /> mapping=aes(<MAPPINGS>),<br /> stat=<STAT>,<br /> position=<POSITION><br />) +\<br /><COORDINATE_FUNCTION> +\<br /><FACET_FUNCTION></code></pre>
<p>Our new template takes seven parameters, the bracketed words that appear
in the template. In practice, you rarely need to supply all seven
parameters to make a graph because plotnine will provide useful defaults
for everything except the data, the mappings, and the geom function.</p>
<p>The seven parameters in the template compose the grammar of graphics, a
formal system for building plots. The grammar of graphics is based on
the insight that you can uniquely describe <em>any</em> plot as a combination
of a dataset, a geom, a set of mappings, a stat, a position adjustment,
a coordinate system, and a faceting scheme.</p>
<p>To see how this works, consider how you could build a basic plot from
scratch: you could start with a dataset and then transform it into the
information that you want to display (with a stat).</p>
<figure>
<a href="https://jeroenjanssens.com/img/plotnine-grammar-of-graphics-for-python/plotnine-visualization-grammar-1.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/_0WMzjC8hq-302.webp 302w, https://jeroenjanssens.com/img/_0WMzjC8hq-453.webp 453w, https://jeroenjanssens.com/img/_0WMzjC8hq-604.webp 604w, https://jeroenjanssens.com/img/_0WMzjC8hq-907.webp 907w, https://jeroenjanssens.com/img/_0WMzjC8hq-1209.webp 1209w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/_0WMzjC8hq-302.webp 302w, https://jeroenjanssens.com/img/_0WMzjC8hq-453.webp 453w, https://jeroenjanssens.com/img/_0WMzjC8hq-604.webp 604w, https://jeroenjanssens.com/img/_0WMzjC8hq-907.webp 907w, https://jeroenjanssens.com/img/_0WMzjC8hq-1209.webp 1209w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/_0WMzjC8hq-302.jpeg 302w, https://jeroenjanssens.com/img/_0WMzjC8hq-453.jpeg 453w, https://jeroenjanssens.com/img/_0WMzjC8hq-604.jpeg 604w, https://jeroenjanssens.com/img/_0WMzjC8hq-907.jpeg 907w, https://jeroenjanssens.com/img/_0WMzjC8hq-1209.jpeg 1209w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/_0WMzjC8hq-302.jpeg 302w, https://jeroenjanssens.com/img/_0WMzjC8hq-453.jpeg 453w, https://jeroenjanssens.com/img/_0WMzjC8hq-604.jpeg 604w, https://jeroenjanssens.com/img/_0WMzjC8hq-907.jpeg 907w, https://jeroenjanssens.com/img/_0WMzjC8hq-1209.jpeg 1209w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/_0WMzjC8hq-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
<p>Next, you could choose a geometric object to represent each observation
in the transformed data. You could then use the aesthetic properties of
the geoms to represent variables in the data. You would map the values
of each variable to the levels of an aesthetic.</p>
<figure>
<a href="https://jeroenjanssens.com/img/plotnine-grammar-of-graphics-for-python/plotnine-visualization-grammar-2.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/2nJawudEYs-302.webp 302w, https://jeroenjanssens.com/img/2nJawudEYs-453.webp 453w, https://jeroenjanssens.com/img/2nJawudEYs-604.webp 604w, https://jeroenjanssens.com/img/2nJawudEYs-907.webp 907w, https://jeroenjanssens.com/img/2nJawudEYs-1209.webp 1209w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/2nJawudEYs-302.webp 302w, https://jeroenjanssens.com/img/2nJawudEYs-453.webp 453w, https://jeroenjanssens.com/img/2nJawudEYs-604.webp 604w, https://jeroenjanssens.com/img/2nJawudEYs-907.webp 907w, https://jeroenjanssens.com/img/2nJawudEYs-1209.webp 1209w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/2nJawudEYs-302.jpeg 302w, https://jeroenjanssens.com/img/2nJawudEYs-453.jpeg 453w, https://jeroenjanssens.com/img/2nJawudEYs-604.jpeg 604w, https://jeroenjanssens.com/img/2nJawudEYs-907.jpeg 907w, https://jeroenjanssens.com/img/2nJawudEYs-1209.jpeg 1209w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/2nJawudEYs-302.jpeg 302w, https://jeroenjanssens.com/img/2nJawudEYs-453.jpeg 453w, https://jeroenjanssens.com/img/2nJawudEYs-604.jpeg 604w, https://jeroenjanssens.com/img/2nJawudEYs-907.jpeg 907w, https://jeroenjanssens.com/img/2nJawudEYs-1209.jpeg 1209w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/2nJawudEYs-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
<p>You’d then select a coordinate system to place the geoms into. You’d use
the location of the objects (which is itself an aesthetic property) to
display the values of the x and y variables. At that point, you would
have a complete graph, but you could further adjust the positions of the
geoms within the coordinate system (a position adjustment) or split the
graph into subplots (faceting). You could also extend the plot by adding
one or more additional layers, where each additional layer uses a
dataset, a geom, a set of mappings, a stat, and a position adjustment.</p>
<figure>
<a href="https://jeroenjanssens.com/img/plotnine-grammar-of-graphics-for-python/plotnine-visualization-grammar-3.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/AcnrrrNSah-302.webp 302w, https://jeroenjanssens.com/img/AcnrrrNSah-453.webp 453w, https://jeroenjanssens.com/img/AcnrrrNSah-604.webp 604w, https://jeroenjanssens.com/img/AcnrrrNSah-907.webp 907w, https://jeroenjanssens.com/img/AcnrrrNSah-1209.webp 1209w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/AcnrrrNSah-302.webp 302w, https://jeroenjanssens.com/img/AcnrrrNSah-453.webp 453w, https://jeroenjanssens.com/img/AcnrrrNSah-604.webp 604w, https://jeroenjanssens.com/img/AcnrrrNSah-907.webp 907w, https://jeroenjanssens.com/img/AcnrrrNSah-1209.webp 1209w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/AcnrrrNSah-302.jpeg 302w, https://jeroenjanssens.com/img/AcnrrrNSah-453.jpeg 453w, https://jeroenjanssens.com/img/AcnrrrNSah-604.jpeg 604w, https://jeroenjanssens.com/img/AcnrrrNSah-907.jpeg 907w, https://jeroenjanssens.com/img/AcnrrrNSah-1209.jpeg 1209w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/AcnrrrNSah-302.jpeg 302w, https://jeroenjanssens.com/img/AcnrrrNSah-453.jpeg 453w, https://jeroenjanssens.com/img/AcnrrrNSah-604.jpeg 604w, https://jeroenjanssens.com/img/AcnrrrNSah-907.jpeg 907w, https://jeroenjanssens.com/img/AcnrrrNSah-1209.jpeg 1209w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/AcnrrrNSah-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
<p>You could use this method to build <em>any</em> plot that you imagine. In other
words, you can use the code template that you’ve learned in this chapter
to build hundreds of thousands of unique plots.</p>
<h1><a href="https://r4ds.had.co.nz/graphics-for-communication.html">28</a> Graphics for communication</h1>
<h2><a href="https://r4ds.had.co.nz/graphics-for-communication.html#introduction-19">28.1</a> Introduction</h2>
<p>Now that you understand your data, you need to <em>communicate</em> your
understanding to others. Your audience will likely not share your
background knowledge and will not be deeply invested in the data. To
help others quickly build up a good mental model of the data, you will
need to invest considerable effort in making your plots as
self-explanatory as possible. In this chapter, you’ll learn some of the
tools that plotnine provides to do so.</p>
<p>The rest of this tutorial focuses on the tools you need to create good
graphics. I assume that you know what you want, and just need to know
how to do it. For that reason, I highly recommend pairing this chapter
with a good general visualisation book. I particularly like <a href="https://amzn.com/0321934075"><em>The
Truthful Art</em></a>, by Albert Cairo. It doesn’t
teach the mechanics of creating visualisations, but instead focuses on
what you need to think about in order to create effective graphics.</p>
<h2><a href="https://r4ds.had.co.nz/graphics-for-communication.html#label">28.2</a> Label</h2>
<p>The easiest place to start when turning an exploratory graphic into an
expository graphic is with good labels. You add labels with the <code>labs()</code>
function. This example adds a plot title:</p>
<pre class="language-python"><code class="language-python">ggplot<span class="token punctuation">(</span>mpg<span class="token punctuation">,</span> aes<span class="token punctuation">(</span><span class="token string">"displ"</span><span class="token punctuation">,</span> <span class="token string">"hwy"</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_point<span class="token punctuation">(</span>aes<span class="token punctuation">(</span>color<span class="token operator">=</span><span class="token string">"class"</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_smooth<span class="token punctuation">(</span>se<span class="token operator">=</span><span class="token boolean">False</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />labs<span class="token punctuation">(</span>title<span class="token operator">=</span><span class="token string">"Fuel efficiency generally decreases with engine size"</span><span class="token punctuation">)</span></code></pre>
<figure>
<a href="https://jeroenjanssens.com/img/plotnine-grammar-of-graphics-for-python/plotnine-unnamed-chunk-63-1.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/7_VuALnZX--302.webp 302w, https://jeroenjanssens.com/img/7_VuALnZX--453.webp 453w, https://jeroenjanssens.com/img/7_VuALnZX--604.webp 604w, https://jeroenjanssens.com/img/7_VuALnZX--907.webp 907w, https://jeroenjanssens.com/img/7_VuALnZX--1209.webp 1209w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/7_VuALnZX--302.webp 302w, https://jeroenjanssens.com/img/7_VuALnZX--453.webp 453w, https://jeroenjanssens.com/img/7_VuALnZX--604.webp 604w, https://jeroenjanssens.com/img/7_VuALnZX--907.webp 907w, https://jeroenjanssens.com/img/7_VuALnZX--1209.webp 1209w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/7_VuALnZX--302.jpeg 302w, https://jeroenjanssens.com/img/7_VuALnZX--453.jpeg 453w, https://jeroenjanssens.com/img/7_VuALnZX--604.jpeg 604w, https://jeroenjanssens.com/img/7_VuALnZX--907.jpeg 907w, https://jeroenjanssens.com/img/7_VuALnZX--1209.jpeg 1209w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/7_VuALnZX--302.jpeg 302w, https://jeroenjanssens.com/img/7_VuALnZX--453.jpeg 453w, https://jeroenjanssens.com/img/7_VuALnZX--604.jpeg 604w, https://jeroenjanssens.com/img/7_VuALnZX--907.jpeg 907w, https://jeroenjanssens.com/img/7_VuALnZX--1209.jpeg 1209w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/7_VuALnZX--302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
<p>The purpose of a plot title is to summarise the main finding. Avoid
titles that just describe what the plot is, e.g. “A scatterplot of
engine displacement vs. fuel economy”.</p>
<p>You can also use <code>labs()</code> to replace the axis and legend titles.<sup class="footnote-ref"><a href="https://jeroenjanssens.com/plotnine/#fn10" id="fnref10">[10]</a></sup>
It’s usually a good idea to replace short variable names with more
detailed descriptions, and to include the units.</p>
<pre class="language-python"><code class="language-python">ggplot<span class="token punctuation">(</span>mpg<span class="token punctuation">,</span> aes<span class="token punctuation">(</span><span class="token string">"displ"</span><span class="token punctuation">,</span> <span class="token string">"hwy"</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_point<span class="token punctuation">(</span>aes<span class="token punctuation">(</span>colour<span class="token operator">=</span><span class="token string">"class"</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_smooth<span class="token punctuation">(</span>se<span class="token operator">=</span><span class="token boolean">False</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />labs<span class="token punctuation">(</span>x<span class="token operator">=</span><span class="token string">"Engine displacement (L)"</span><span class="token punctuation">,</span><br /> y<span class="token operator">=</span><span class="token string">"Highway fuel economy (mpg)"</span><span class="token punctuation">,</span><br /> colour<span class="token operator">=</span><span class="token string">"Car type"</span><span class="token punctuation">)</span></code></pre>
<figure>
<a href="https://jeroenjanssens.com/img/plotnine-grammar-of-graphics-for-python/plotnine-unnamed-chunk-64-1.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/Z_GsC7qEhh-302.webp 302w, https://jeroenjanssens.com/img/Z_GsC7qEhh-453.webp 453w, https://jeroenjanssens.com/img/Z_GsC7qEhh-604.webp 604w, https://jeroenjanssens.com/img/Z_GsC7qEhh-907.webp 907w, https://jeroenjanssens.com/img/Z_GsC7qEhh-1209.webp 1209w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/Z_GsC7qEhh-302.webp 302w, https://jeroenjanssens.com/img/Z_GsC7qEhh-453.webp 453w, https://jeroenjanssens.com/img/Z_GsC7qEhh-604.webp 604w, https://jeroenjanssens.com/img/Z_GsC7qEhh-907.webp 907w, https://jeroenjanssens.com/img/Z_GsC7qEhh-1209.webp 1209w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/Z_GsC7qEhh-302.jpeg 302w, https://jeroenjanssens.com/img/Z_GsC7qEhh-453.jpeg 453w, https://jeroenjanssens.com/img/Z_GsC7qEhh-604.jpeg 604w, https://jeroenjanssens.com/img/Z_GsC7qEhh-907.jpeg 907w, https://jeroenjanssens.com/img/Z_GsC7qEhh-1209.jpeg 1209w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/Z_GsC7qEhh-302.jpeg 302w, https://jeroenjanssens.com/img/Z_GsC7qEhh-453.jpeg 453w, https://jeroenjanssens.com/img/Z_GsC7qEhh-604.jpeg 604w, https://jeroenjanssens.com/img/Z_GsC7qEhh-907.jpeg 907w, https://jeroenjanssens.com/img/Z_GsC7qEhh-1209.jpeg 1209w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/Z_GsC7qEhh-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
<p>It’s possible to use mathematical equations instead of text strings. You
have to tell matplotlib, which is used by plotnine to do the actuall
plotting, to use LaTeX for rendering text:</p>
<pre class="language-python"><code class="language-python"><span class="token keyword">from</span> matplotlib <span class="token keyword">import</span> rc<br />rc<span class="token punctuation">(</span><span class="token string">'text'</span><span class="token punctuation">,</span> usetex<span class="token operator">=</span><span class="token boolean">True</span><span class="token punctuation">)</span><br /><br />df <span class="token operator">=</span> pd<span class="token punctuation">.</span>DataFrame<span class="token punctuation">(</span><span class="token punctuation">{</span><span class="token string">"x"</span><span class="token punctuation">:</span> np<span class="token punctuation">.</span>random<span class="token punctuation">.</span>uniform<span class="token punctuation">(</span>size<span class="token operator">=</span><span class="token number">10</span><span class="token punctuation">)</span><span class="token punctuation">,</span><br /> <span class="token string">"y"</span><span class="token punctuation">:</span> np<span class="token punctuation">.</span>random<span class="token punctuation">.</span>uniform<span class="token punctuation">(</span>size<span class="token operator">=</span><span class="token number">10</span><span class="token punctuation">)</span><span class="token punctuation">}</span><span class="token punctuation">)</span><br /><br />ggplot<span class="token punctuation">(</span>df<span class="token punctuation">,</span> aes<span class="token punctuation">(</span><span class="token string">"x"</span><span class="token punctuation">,</span> <span class="token string">"y"</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_point<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />labs<span class="token punctuation">(</span>x<span class="token operator">=</span><span class="token string">"$\\sum_{i = 1}^n{x_i^2}$"</span><span class="token punctuation">,</span><br /> y<span class="token operator">=</span><span class="token string">"$\\alpha + \\beta + \\frac{\\delta}{\\theta}$"</span><span class="token punctuation">)</span></code></pre>
<figure>
<a href="https://jeroenjanssens.com/img/plotnine-grammar-of-graphics-for-python/plotnine-unnamed-chunk-65-1.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/1smV2Y9XhP-302.webp 302w, https://jeroenjanssens.com/img/1smV2Y9XhP-453.webp 453w, https://jeroenjanssens.com/img/1smV2Y9XhP-604.webp 604w, https://jeroenjanssens.com/img/1smV2Y9XhP-907.webp 907w, https://jeroenjanssens.com/img/1smV2Y9XhP-1209.webp 1209w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/1smV2Y9XhP-302.webp 302w, https://jeroenjanssens.com/img/1smV2Y9XhP-453.webp 453w, https://jeroenjanssens.com/img/1smV2Y9XhP-604.webp 604w, https://jeroenjanssens.com/img/1smV2Y9XhP-907.webp 907w, https://jeroenjanssens.com/img/1smV2Y9XhP-1209.webp 1209w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/1smV2Y9XhP-302.jpeg 302w, https://jeroenjanssens.com/img/1smV2Y9XhP-453.jpeg 453w, https://jeroenjanssens.com/img/1smV2Y9XhP-604.jpeg 604w, https://jeroenjanssens.com/img/1smV2Y9XhP-907.jpeg 907w, https://jeroenjanssens.com/img/1smV2Y9XhP-1209.jpeg 1209w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/1smV2Y9XhP-302.jpeg 302w, https://jeroenjanssens.com/img/1smV2Y9XhP-453.jpeg 453w, https://jeroenjanssens.com/img/1smV2Y9XhP-604.jpeg 604w, https://jeroenjanssens.com/img/1smV2Y9XhP-907.jpeg 907w, https://jeroenjanssens.com/img/1smV2Y9XhP-1209.jpeg 1209w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/1smV2Y9XhP-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
<p>See <a href="https://matplotlib.org/3.1.1/tutorials/text/mathtext.html">the matplotlib
documentation</a>
for more information about how to write mathematical equations using
LaTeX.</p>
<h3><a href="https://r4ds.had.co.nz/graphics-for-communication.html#exercises-71">28.2.1</a> Exercises</h3>
<ol>
<li>
<p>Create one plot on the fuel economy data with customised <code>title</code>,
<code>x</code>, <code>y</code>, and <code>colour</code> labels.</p>
</li>
<li>
<p>The <code>geom_smooth()</code> is somewhat misleading because the <code>hwy</code> for
large engines is skewed upwards due to the inclusion of lightweight
sports cars with big engines. Use your modelling tools to fit and
display a better model.</p>
</li>
<li>
<p>Take an exploratory graphic that you’ve created in the last month,
and add an informative title to make it easier for others to
understand.</p>
</li>
</ol>
<h2><a href="https://r4ds.had.co.nz/graphics-for-communication.html#annotations">28.3</a> Annotations</h2>
<p>In addition to labelling major components of your plot, it’s often
useful to label individual observations or groups of observations. The
first tool you have at your disposal is <code>geom_text()</code>. <code>geom_text()</code> is
similar to <code>geom_point()</code>, but it has an additional aesthetic: <code>label</code>.
This makes it possible to add textual labels to your plots.</p>
<p>There are two possible sources of labels. First, you might have a
DataFrame that provides labels. The plot below isn’t terribly useful,
but it illustrates a useful approach: pull out the most efficient car in
each class with pandas, and then label it on the plot:</p>
<pre class="language-python"><code class="language-python">best_in_class <span class="token operator">=</span> mpg\<br /><span class="token punctuation">.</span>sort_values<span class="token punctuation">(</span>by<span class="token operator">=</span><span class="token string">"hwy"</span><span class="token punctuation">,</span> ascending<span class="token operator">=</span><span class="token boolean">False</span><span class="token punctuation">)</span>\<br /><span class="token punctuation">.</span>groupby<span class="token punctuation">(</span><span class="token string">"class"</span><span class="token punctuation">)</span>\<br /><span class="token punctuation">.</span>first<span class="token punctuation">(</span><span class="token punctuation">)</span><br /><br />ggplot<span class="token punctuation">(</span>mpg<span class="token punctuation">,</span> aes<span class="token punctuation">(</span><span class="token string">"displ"</span><span class="token punctuation">,</span> <span class="token string">"hwy"</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_point<span class="token punctuation">(</span>aes<span class="token punctuation">(</span>colour<span class="token operator">=</span><span class="token string">"class"</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_text<span class="token punctuation">(</span>aes<span class="token punctuation">(</span>label<span class="token operator">=</span><span class="token string">"model"</span><span class="token punctuation">)</span><span class="token punctuation">,</span> data<span class="token operator">=</span>best_in_class<span class="token punctuation">)</span></code></pre>
<figure>
<a href="https://jeroenjanssens.com/img/plotnine-grammar-of-graphics-for-python/plotnine-unnamed-chunk-67-1.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/0gKU-B8OJc-302.webp 302w, https://jeroenjanssens.com/img/0gKU-B8OJc-453.webp 453w, https://jeroenjanssens.com/img/0gKU-B8OJc-604.webp 604w, https://jeroenjanssens.com/img/0gKU-B8OJc-907.webp 907w, https://jeroenjanssens.com/img/0gKU-B8OJc-1209.webp 1209w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/0gKU-B8OJc-302.webp 302w, https://jeroenjanssens.com/img/0gKU-B8OJc-453.webp 453w, https://jeroenjanssens.com/img/0gKU-B8OJc-604.webp 604w, https://jeroenjanssens.com/img/0gKU-B8OJc-907.webp 907w, https://jeroenjanssens.com/img/0gKU-B8OJc-1209.webp 1209w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/0gKU-B8OJc-302.jpeg 302w, https://jeroenjanssens.com/img/0gKU-B8OJc-453.jpeg 453w, https://jeroenjanssens.com/img/0gKU-B8OJc-604.jpeg 604w, https://jeroenjanssens.com/img/0gKU-B8OJc-907.jpeg 907w, https://jeroenjanssens.com/img/0gKU-B8OJc-1209.jpeg 1209w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/0gKU-B8OJc-302.jpeg 302w, https://jeroenjanssens.com/img/0gKU-B8OJc-453.jpeg 453w, https://jeroenjanssens.com/img/0gKU-B8OJc-604.jpeg 604w, https://jeroenjanssens.com/img/0gKU-B8OJc-907.jpeg 907w, https://jeroenjanssens.com/img/0gKU-B8OJc-1209.jpeg 1209w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/0gKU-B8OJc-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
<p>This is hard to read because the labels overlap with each other, and
with the points. We can make things a little better by switching to
<code>geom_label()</code> which draws a rectangle behind the text. We also use the
<code>nudge_y</code> parameter to move the labels slightly above the corresponding
points:</p>
<pre class="language-python"><code class="language-python">ggplot<span class="token punctuation">(</span>mpg<span class="token punctuation">,</span> aes<span class="token punctuation">(</span><span class="token string">"displ"</span><span class="token punctuation">,</span> <span class="token string">"hwy"</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_point<span class="token punctuation">(</span>aes<span class="token punctuation">(</span>colour<span class="token operator">=</span><span class="token string">"class"</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_label<span class="token punctuation">(</span>aes<span class="token punctuation">(</span>label<span class="token operator">=</span><span class="token string">"model"</span><span class="token punctuation">)</span><span class="token punctuation">,</span> data<span class="token operator">=</span>best_in_class<span class="token punctuation">,</span> nudge_y<span class="token operator">=</span><span class="token number">2</span><span class="token punctuation">,</span> alpha<span class="token operator">=</span><span class="token number">0.5</span><span class="token punctuation">)</span></code></pre>
<figure>
<a href="https://jeroenjanssens.com/img/plotnine-grammar-of-graphics-for-python/plotnine-unnamed-chunk-68-1.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/Ss3u1xvbXH-302.webp 302w, https://jeroenjanssens.com/img/Ss3u1xvbXH-453.webp 453w, https://jeroenjanssens.com/img/Ss3u1xvbXH-604.webp 604w, https://jeroenjanssens.com/img/Ss3u1xvbXH-907.webp 907w, https://jeroenjanssens.com/img/Ss3u1xvbXH-1209.webp 1209w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/Ss3u1xvbXH-302.webp 302w, https://jeroenjanssens.com/img/Ss3u1xvbXH-453.webp 453w, https://jeroenjanssens.com/img/Ss3u1xvbXH-604.webp 604w, https://jeroenjanssens.com/img/Ss3u1xvbXH-907.webp 907w, https://jeroenjanssens.com/img/Ss3u1xvbXH-1209.webp 1209w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/Ss3u1xvbXH-302.jpeg 302w, https://jeroenjanssens.com/img/Ss3u1xvbXH-453.jpeg 453w, https://jeroenjanssens.com/img/Ss3u1xvbXH-604.jpeg 604w, https://jeroenjanssens.com/img/Ss3u1xvbXH-907.jpeg 907w, https://jeroenjanssens.com/img/Ss3u1xvbXH-1209.jpeg 1209w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/Ss3u1xvbXH-302.jpeg 302w, https://jeroenjanssens.com/img/Ss3u1xvbXH-453.jpeg 453w, https://jeroenjanssens.com/img/Ss3u1xvbXH-604.jpeg 604w, https://jeroenjanssens.com/img/Ss3u1xvbXH-907.jpeg 907w, https://jeroenjanssens.com/img/Ss3u1xvbXH-1209.jpeg 1209w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/Ss3u1xvbXH-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
<p>That helps a bit, but if you look closely in the top-left hand corner,
you’ll notice that there are two labels practically on top of each
other. This happens because the highway mileage and displacement for the
best cars in the compact and subcompact categories are exactly the same.
There’s no way that we can fix these by applying the same transformation
for every label. Instead, we can use the adjust_text argument. This
useful argument, which employs the adjustText package under the hood,
will automatically adjust labels so that they don’t overlap:</p>
<pre class="language-python"><code class="language-python">ggplot<span class="token punctuation">(</span>mpg<span class="token punctuation">,</span> aes<span class="token punctuation">(</span><span class="token string">"displ"</span><span class="token punctuation">,</span> <span class="token string">"hwy"</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_point<span class="token punctuation">(</span>aes<span class="token punctuation">(</span>colour<span class="token operator">=</span><span class="token string">"class"</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_point<span class="token punctuation">(</span>data<span class="token operator">=</span>best_in_class<span class="token punctuation">,</span> fill<span class="token operator">=</span><span class="token string">'none'</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_label<span class="token punctuation">(</span>aes<span class="token punctuation">(</span>label<span class="token operator">=</span><span class="token string">"model"</span><span class="token punctuation">)</span><span class="token punctuation">,</span> data<span class="token operator">=</span>best_in_class<span class="token punctuation">,</span> adjust_text<span class="token operator">=</span><span class="token punctuation">{</span><br /> <span class="token string">'expand_points'</span><span class="token punctuation">:</span> <span class="token punctuation">(</span><span class="token number">1.5</span><span class="token punctuation">,</span> <span class="token number">1.5</span><span class="token punctuation">)</span><span class="token punctuation">,</span><br /> <span class="token string">'arrowprops'</span><span class="token punctuation">:</span> <span class="token punctuation">{</span><br /> <span class="token string">'arrowstyle'</span><span class="token punctuation">:</span> <span class="token string">'-'</span><br /> <span class="token punctuation">}</span><span class="token punctuation">}</span><span class="token punctuation">)</span></code></pre>
<figure>
<a href="https://jeroenjanssens.com/img/plotnine-grammar-of-graphics-for-python/plotnine-unnamed-chunk-69-1.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/-PX14f2E1v-302.webp 302w, https://jeroenjanssens.com/img/-PX14f2E1v-453.webp 453w, https://jeroenjanssens.com/img/-PX14f2E1v-604.webp 604w, https://jeroenjanssens.com/img/-PX14f2E1v-907.webp 907w, https://jeroenjanssens.com/img/-PX14f2E1v-1209.webp 1209w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/-PX14f2E1v-302.webp 302w, https://jeroenjanssens.com/img/-PX14f2E1v-453.webp 453w, https://jeroenjanssens.com/img/-PX14f2E1v-604.webp 604w, https://jeroenjanssens.com/img/-PX14f2E1v-907.webp 907w, https://jeroenjanssens.com/img/-PX14f2E1v-1209.webp 1209w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/-PX14f2E1v-302.jpeg 302w, https://jeroenjanssens.com/img/-PX14f2E1v-453.jpeg 453w, https://jeroenjanssens.com/img/-PX14f2E1v-604.jpeg 604w, https://jeroenjanssens.com/img/-PX14f2E1v-907.jpeg 907w, https://jeroenjanssens.com/img/-PX14f2E1v-1209.jpeg 1209w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/-PX14f2E1v-302.jpeg 302w, https://jeroenjanssens.com/img/-PX14f2E1v-453.jpeg 453w, https://jeroenjanssens.com/img/-PX14f2E1v-604.jpeg 604w, https://jeroenjanssens.com/img/-PX14f2E1v-907.jpeg 907w, https://jeroenjanssens.com/img/-PX14f2E1v-1209.jpeg 1209w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/-PX14f2E1v-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
<p>Note another handy technique used here: I added a second layer of large,
hollow points to highlight the points that I’ve labelled.</p>
<p>You can sometimes use the same idea to replace the legend with labels
placed directly on the plot. It’s not wonderful for this plot, but it
isn’t too bad.<sup class="footnote-ref"><a href="https://jeroenjanssens.com/plotnine/#fn11" id="fnref11">[11]</a></sup> (<code>theme(legend_position="none"</code>) turns the legend
off — we’ll talk about it more shortly.)</p>
<pre class="language-python"><code class="language-python">class_avg <span class="token operator">=</span> mpg\<br /><span class="token punctuation">.</span>groupby<span class="token punctuation">(</span><span class="token string">"class"</span><span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token string">"displ"</span><span class="token punctuation">,</span><span class="token string">"hwy"</span><span class="token punctuation">]</span><span class="token punctuation">.</span>median<span class="token punctuation">(</span><span class="token punctuation">)</span>\<br /><span class="token punctuation">.</span>reset_index<span class="token punctuation">(</span><span class="token punctuation">)</span><br /><br />ggplot<span class="token punctuation">(</span>mpg<span class="token punctuation">,</span> aes<span class="token punctuation">(</span><span class="token string">"displ"</span><span class="token punctuation">,</span> <span class="token string">"hwy"</span><span class="token punctuation">,</span> colour<span class="token operator">=</span><span class="token string">"class"</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_point<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_label<span class="token punctuation">(</span>aes<span class="token punctuation">(</span>label<span class="token operator">=</span><span class="token string">"class"</span><span class="token punctuation">)</span><span class="token punctuation">,</span> data<span class="token operator">=</span>class_avg<span class="token punctuation">,</span> size<span class="token operator">=</span><span class="token number">16</span><span class="token punctuation">,</span> label_size<span class="token operator">=</span><span class="token number">0</span><span class="token punctuation">,</span> adjust_text<span class="token operator">=</span><span class="token punctuation">{</span><span class="token string">'expand_points'</span><span class="token punctuation">:</span> <span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">,</span> <span class="token number">0</span><span class="token punctuation">)</span><span class="token punctuation">}</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_point<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />theme<span class="token punctuation">(</span>legend_position<span class="token operator">=</span><span class="token string">"none"</span><span class="token punctuation">)</span></code></pre>
<figure>
<a href="https://jeroenjanssens.com/img/plotnine-grammar-of-graphics-for-python/plotnine-unnamed-chunk-70-1.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/tG2b4uuKXq-302.webp 302w, https://jeroenjanssens.com/img/tG2b4uuKXq-453.webp 453w, https://jeroenjanssens.com/img/tG2b4uuKXq-604.webp 604w, https://jeroenjanssens.com/img/tG2b4uuKXq-907.webp 907w, https://jeroenjanssens.com/img/tG2b4uuKXq-1209.webp 1209w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/tG2b4uuKXq-302.webp 302w, https://jeroenjanssens.com/img/tG2b4uuKXq-453.webp 453w, https://jeroenjanssens.com/img/tG2b4uuKXq-604.webp 604w, https://jeroenjanssens.com/img/tG2b4uuKXq-907.webp 907w, https://jeroenjanssens.com/img/tG2b4uuKXq-1209.webp 1209w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/tG2b4uuKXq-302.jpeg 302w, https://jeroenjanssens.com/img/tG2b4uuKXq-453.jpeg 453w, https://jeroenjanssens.com/img/tG2b4uuKXq-604.jpeg 604w, https://jeroenjanssens.com/img/tG2b4uuKXq-907.jpeg 907w, https://jeroenjanssens.com/img/tG2b4uuKXq-1209.jpeg 1209w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/tG2b4uuKXq-302.jpeg 302w, https://jeroenjanssens.com/img/tG2b4uuKXq-453.jpeg 453w, https://jeroenjanssens.com/img/tG2b4uuKXq-604.jpeg 604w, https://jeroenjanssens.com/img/tG2b4uuKXq-907.jpeg 907w, https://jeroenjanssens.com/img/tG2b4uuKXq-1209.jpeg 1209w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/tG2b4uuKXq-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
<p>Alternatively, you might just want to add a single label to the plot,
but you’ll still need to create a DataFrame. Often, you want the label
in the corner of the plot, so it’s convenient to create a new DataFrame
using <code>pd.DataFrame()</code> and <code>max()</code> to compute the maximum values of x
and y.</p>
<pre class="language-python"><code class="language-python">label <span class="token operator">=</span> pd<span class="token punctuation">.</span>DataFrame<span class="token punctuation">(</span><span class="token punctuation">{</span><span class="token string">"displ"</span><span class="token punctuation">:</span> <span class="token punctuation">[</span>mpg<span class="token punctuation">.</span>displ<span class="token punctuation">.</span><span class="token builtin">max</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">]</span><span class="token punctuation">,</span><br /> <span class="token string">"hwy"</span><span class="token punctuation">:</span> <span class="token punctuation">[</span>mpg<span class="token punctuation">.</span>hwy<span class="token punctuation">.</span><span class="token builtin">max</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">]</span><span class="token punctuation">,</span><br /> <span class="token string">"label"</span><span class="token punctuation">:</span> <span class="token string">"Increasing engine size is \nrelated to decreasing fuel economy."</span><span class="token punctuation">}</span><span class="token punctuation">)</span><br /><br />ggplot<span class="token punctuation">(</span>mpg<span class="token punctuation">,</span> aes<span class="token punctuation">(</span><span class="token string">"displ"</span><span class="token punctuation">,</span> <span class="token string">"hwy"</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_point<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_text<span class="token punctuation">(</span>aes<span class="token punctuation">(</span>label<span class="token operator">=</span><span class="token string">"label"</span><span class="token punctuation">)</span><span class="token punctuation">,</span> data<span class="token operator">=</span>label<span class="token punctuation">,</span> va<span class="token operator">=</span><span class="token string">"top"</span><span class="token punctuation">,</span> ha<span class="token operator">=</span><span class="token string">"right"</span><span class="token punctuation">)</span></code></pre>
<figure>
<a href="https://jeroenjanssens.com/img/plotnine-grammar-of-graphics-for-python/plotnine-unnamed-chunk-71-1.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/gLFGeu-c0u-302.webp 302w, https://jeroenjanssens.com/img/gLFGeu-c0u-453.webp 453w, https://jeroenjanssens.com/img/gLFGeu-c0u-604.webp 604w, https://jeroenjanssens.com/img/gLFGeu-c0u-907.webp 907w, https://jeroenjanssens.com/img/gLFGeu-c0u-1209.webp 1209w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/gLFGeu-c0u-302.webp 302w, https://jeroenjanssens.com/img/gLFGeu-c0u-453.webp 453w, https://jeroenjanssens.com/img/gLFGeu-c0u-604.webp 604w, https://jeroenjanssens.com/img/gLFGeu-c0u-907.webp 907w, https://jeroenjanssens.com/img/gLFGeu-c0u-1209.webp 1209w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/gLFGeu-c0u-302.jpeg 302w, https://jeroenjanssens.com/img/gLFGeu-c0u-453.jpeg 453w, https://jeroenjanssens.com/img/gLFGeu-c0u-604.jpeg 604w, https://jeroenjanssens.com/img/gLFGeu-c0u-907.jpeg 907w, https://jeroenjanssens.com/img/gLFGeu-c0u-1209.jpeg 1209w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/gLFGeu-c0u-302.jpeg 302w, https://jeroenjanssens.com/img/gLFGeu-c0u-453.jpeg 453w, https://jeroenjanssens.com/img/gLFGeu-c0u-604.jpeg 604w, https://jeroenjanssens.com/img/gLFGeu-c0u-907.jpeg 907w, https://jeroenjanssens.com/img/gLFGeu-c0u-1209.jpeg 1209w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/gLFGeu-c0u-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
<p>If you want to place the text exactly on the borders of the plot, you
can use <code>+np.Inf</code> and <code>-np.Inf</code>:</p>
<pre class="language-python"><code class="language-python">label <span class="token operator">=</span> pd<span class="token punctuation">.</span>DataFrame<span class="token punctuation">(</span><span class="token punctuation">{</span><span class="token string">"displ"</span><span class="token punctuation">:</span> <span class="token punctuation">[</span>np<span class="token punctuation">.</span>Inf<span class="token punctuation">]</span><span class="token punctuation">,</span><br /> <span class="token string">"hwy"</span><span class="token punctuation">:</span> <span class="token punctuation">[</span>np<span class="token punctuation">.</span>Inf<span class="token punctuation">]</span><span class="token punctuation">,</span><br /> <span class="token string">"label"</span><span class="token punctuation">:</span> <span class="token string">"Increasing engine size is \nrelated to decreasing fuel economy."</span><span class="token punctuation">}</span><span class="token punctuation">)</span><br /><br />ggplot<span class="token punctuation">(</span>mpg<span class="token punctuation">,</span> aes<span class="token punctuation">(</span><span class="token string">"displ"</span><span class="token punctuation">,</span> <span class="token string">"hwy"</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_point<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_text<span class="token punctuation">(</span>aes<span class="token punctuation">(</span>label<span class="token operator">=</span><span class="token string">"label"</span><span class="token punctuation">)</span><span class="token punctuation">,</span> data<span class="token operator">=</span>label<span class="token punctuation">,</span> va<span class="token operator">=</span><span class="token string">"top"</span><span class="token punctuation">,</span> ha<span class="token operator">=</span><span class="token string">"right"</span><span class="token punctuation">)</span></code></pre>
<figure>
<a href="https://jeroenjanssens.com/img/plotnine-grammar-of-graphics-for-python/plotnine-unnamed-chunk-72-1.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/H1_ohZ6HbL-302.webp 302w, https://jeroenjanssens.com/img/H1_ohZ6HbL-453.webp 453w, https://jeroenjanssens.com/img/H1_ohZ6HbL-604.webp 604w, https://jeroenjanssens.com/img/H1_ohZ6HbL-907.webp 907w, https://jeroenjanssens.com/img/H1_ohZ6HbL-1209.webp 1209w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/H1_ohZ6HbL-302.webp 302w, https://jeroenjanssens.com/img/H1_ohZ6HbL-453.webp 453w, https://jeroenjanssens.com/img/H1_ohZ6HbL-604.webp 604w, https://jeroenjanssens.com/img/H1_ohZ6HbL-907.webp 907w, https://jeroenjanssens.com/img/H1_ohZ6HbL-1209.webp 1209w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/H1_ohZ6HbL-302.jpeg 302w, https://jeroenjanssens.com/img/H1_ohZ6HbL-453.jpeg 453w, https://jeroenjanssens.com/img/H1_ohZ6HbL-604.jpeg 604w, https://jeroenjanssens.com/img/H1_ohZ6HbL-907.jpeg 907w, https://jeroenjanssens.com/img/H1_ohZ6HbL-1209.jpeg 1209w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/H1_ohZ6HbL-302.jpeg 302w, https://jeroenjanssens.com/img/H1_ohZ6HbL-453.jpeg 453w, https://jeroenjanssens.com/img/H1_ohZ6HbL-604.jpeg 604w, https://jeroenjanssens.com/img/H1_ohZ6HbL-907.jpeg 907w, https://jeroenjanssens.com/img/H1_ohZ6HbL-1209.jpeg 1209w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/H1_ohZ6HbL-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
<p>In these examples, I manually broke the label up into lines using
<code>"\n"</code>. Another approach is to use the <code>fill</code> function from the
<code>textwrap</code> module to automatically add line breaks, given the number of
characters you want per line:</p>
<pre class="language-python"><code class="language-python"><span class="token keyword">from</span> textwrap <span class="token keyword">import</span> fill<br /><br /><span class="token keyword">print</span><span class="token punctuation">(</span>fill<span class="token punctuation">(</span><span class="token string">"Increasing engine size is related to decreasing fuel economy."</span><span class="token punctuation">,</span> width<span class="token operator">=</span><span class="token number">40</span><span class="token punctuation">)</span><span class="token punctuation">)</span></code></pre>
<pre class="language-text"><code class="language-text">Increasing engine size is related to<br />decreasing fuel economy.</code></pre>
<p>Note the use of <code>ha</code> and <code>va</code> to control the alignment of the label. The
figure below shows all nine possible combinations.</p>
<figure>
<a href="https://jeroenjanssens.com/img/plotnine-grammar-of-graphics-for-python/plotnine-unnamed-chunk-74-1.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/fa-qZP8yC7-302.webp 302w, https://jeroenjanssens.com/img/fa-qZP8yC7-453.webp 453w, https://jeroenjanssens.com/img/fa-qZP8yC7-604.webp 604w, https://jeroenjanssens.com/img/fa-qZP8yC7-907.webp 907w, https://jeroenjanssens.com/img/fa-qZP8yC7-1209.webp 1209w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/fa-qZP8yC7-302.webp 302w, https://jeroenjanssens.com/img/fa-qZP8yC7-453.webp 453w, https://jeroenjanssens.com/img/fa-qZP8yC7-604.webp 604w, https://jeroenjanssens.com/img/fa-qZP8yC7-907.webp 907w, https://jeroenjanssens.com/img/fa-qZP8yC7-1209.webp 1209w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/fa-qZP8yC7-302.jpeg 302w, https://jeroenjanssens.com/img/fa-qZP8yC7-453.jpeg 453w, https://jeroenjanssens.com/img/fa-qZP8yC7-604.jpeg 604w, https://jeroenjanssens.com/img/fa-qZP8yC7-907.jpeg 907w, https://jeroenjanssens.com/img/fa-qZP8yC7-1209.jpeg 1209w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/fa-qZP8yC7-302.jpeg 302w, https://jeroenjanssens.com/img/fa-qZP8yC7-453.jpeg 453w, https://jeroenjanssens.com/img/fa-qZP8yC7-604.jpeg 604w, https://jeroenjanssens.com/img/fa-qZP8yC7-907.jpeg 907w, https://jeroenjanssens.com/img/fa-qZP8yC7-1209.jpeg 1209w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/fa-qZP8yC7-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
<p>Remember, in addition to <code>geom_text()</code>, you have many other geoms in
plotnine available to help annotate your plot. A few ideas:</p>
<ul>
<li>
<p>Use <code>geom_hline()</code> and <code>geom_vline()</code> to add reference lines. I often
make them thick (<code>size=2</code>) and white (<code>colour="white"</code>), and draw them
underneath the primary data layer. That makes them easy to see,
without drawing attention away from the data.</p>
</li>
<li>
<p>Use <code>geom_rect()</code> to draw a rectangle around points of interest. The
boundaries of the rectangle are defined by aesthetics <code>xmin</code>, <code>xmax</code>,
<code>ymin</code>, <code>ymax</code>.</p>
</li>
<li>
<p>Use <code>geom_segment()</code> with the <code>arrow</code> argument to draw attention to a
point with an arrow. Use aesthetics <code>x</code> and <code>y</code> to define the starting
location, and <code>xend</code> and <code>yend</code> to define the end location.</p>
</li>
</ul>
<p>The only limit is your imagination (and your patience with positioning
annotations to be aesthetically pleasing)!</p>
<h3><a href="https://r4ds.had.co.nz/graphics-for-communication.html#exercises-72">28.3.1</a> Exercises</h3>
<ol>
<li>
<p>Use <code>geom_text()</code> with infinite positions to place text at the four
corners of the plot.</p>
</li>
<li>
<p>Read the documentation for <code>annotate()</code>. How can you use it to add a
text label to a plot without having to create a DataFrame?</p>
</li>
<li>
<p>How do labels with <code>geom_text()</code> interact with faceting? How can you
add a label to a single facet? How can you put a different label in
each facet? (Hint: think about the underlying data.)</p>
</li>
<li>
<p>What arguments to <code>geom_label()</code> control the appearance of the
background box?</p>
</li>
<li>
<p>What are the four arguments to <code>arrow()</code>? How do they work? Create a
series of plots that demonstrate the most important options.</p>
</li>
</ol>
<h2><a href="https://r4ds.had.co.nz/graphics-for-communication.html#scales">28.4</a> Scales</h2>
<p>The third way you can make your plot better for communication is to
adjust the scales. Scales control the mapping from data values to things
that you can perceive. Normally, plotnine automatically adds scales for
you. For example, when you type:</p>
<pre class="language-python"><code class="language-python">ggplot<span class="token punctuation">(</span>mpg<span class="token punctuation">,</span> aes<span class="token punctuation">(</span><span class="token string">"displ"</span><span class="token punctuation">,</span> <span class="token string">"hwy"</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_point<span class="token punctuation">(</span>aes<span class="token punctuation">(</span>colour<span class="token operator">=</span><span class="token string">"class"</span><span class="token punctuation">)</span><span class="token punctuation">)</span></code></pre>
<p>plotnine automatically adds default scales behind the scenes:</p>
<pre class="language-python"><code class="language-python">ggplot<span class="token punctuation">(</span>mpg<span class="token punctuation">,</span> aes<span class="token punctuation">(</span><span class="token string">"displ"</span><span class="token punctuation">,</span> <span class="token string">"hwy"</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_point<span class="token punctuation">(</span>aes<span class="token punctuation">(</span>colour<span class="token operator">=</span><span class="token string">"class"</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />scale_x_continuous<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />scale_y_continuous<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />scale_colour_discrete<span class="token punctuation">(</span><span class="token punctuation">)</span></code></pre>
<p>Note the naming scheme for scales: <code>scale_</code> followed by the name of the
aesthetic, then <code>_</code>, then the name of the scale. The default scales are
named according to the type of variable they align with: continuous,
discrete, datetime, or date. There are lots of non-default scales which
you’ll learn about below.</p>
<p>The default scales have been carefully chosen to do a good job for a
wide range of inputs. Nevertheless, you might want to override the
defaults for two reasons:</p>
<ul>
<li>
<p>You might want to tweak some of the parameters of the default scale.
This allows you to do things like change the breaks on the axes, or
the key labels on the legend.</p>
</li>
<li>
<p>You might want to replace the scale altogether, and use a completely
different algorithm. Often you can do better than the default because
you know more about the data.</p>
</li>
</ul>
<h3><a href="https://r4ds.had.co.nz/graphics-for-communication.html#axis-ticks-and-legend-keys">28.4.1</a> Axis ticks and legend keys</h3>
<p>There are two primary arguments that affect the appearance of the ticks
on the axes and the keys on the legend: <code>breaks</code> and <code>labels</code>. Breaks
controls the position of the ticks, or the values associated with the
keys. Labels controls the text label associated with each tick/key. The
most common use of <code>breaks</code> is to override the default choice:</p>
<pre class="language-python"><code class="language-python">ggplot<span class="token punctuation">(</span>mpg<span class="token punctuation">,</span> aes<span class="token punctuation">(</span><span class="token string">"displ"</span><span class="token punctuation">,</span> <span class="token string">"hwy"</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_point<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />scale_y_continuous<span class="token punctuation">(</span>breaks<span class="token operator">=</span><span class="token builtin">range</span><span class="token punctuation">(</span><span class="token number">15</span><span class="token punctuation">,</span> <span class="token number">45</span><span class="token punctuation">,</span> <span class="token number">5</span><span class="token punctuation">)</span><span class="token punctuation">)</span></code></pre>
<figure>
<a href="https://jeroenjanssens.com/img/plotnine-grammar-of-graphics-for-python/plotnine-unnamed-chunk-75-1.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/VlsP0JJJU1-302.webp 302w, https://jeroenjanssens.com/img/VlsP0JJJU1-453.webp 453w, https://jeroenjanssens.com/img/VlsP0JJJU1-604.webp 604w, https://jeroenjanssens.com/img/VlsP0JJJU1-907.webp 907w, https://jeroenjanssens.com/img/VlsP0JJJU1-1209.webp 1209w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/VlsP0JJJU1-302.webp 302w, https://jeroenjanssens.com/img/VlsP0JJJU1-453.webp 453w, https://jeroenjanssens.com/img/VlsP0JJJU1-604.webp 604w, https://jeroenjanssens.com/img/VlsP0JJJU1-907.webp 907w, https://jeroenjanssens.com/img/VlsP0JJJU1-1209.webp 1209w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/VlsP0JJJU1-302.jpeg 302w, https://jeroenjanssens.com/img/VlsP0JJJU1-453.jpeg 453w, https://jeroenjanssens.com/img/VlsP0JJJU1-604.jpeg 604w, https://jeroenjanssens.com/img/VlsP0JJJU1-907.jpeg 907w, https://jeroenjanssens.com/img/VlsP0JJJU1-1209.jpeg 1209w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/VlsP0JJJU1-302.jpeg 302w, https://jeroenjanssens.com/img/VlsP0JJJU1-453.jpeg 453w, https://jeroenjanssens.com/img/VlsP0JJJU1-604.jpeg 604w, https://jeroenjanssens.com/img/VlsP0JJJU1-907.jpeg 907w, https://jeroenjanssens.com/img/VlsP0JJJU1-1209.jpeg 1209w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/VlsP0JJJU1-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
<p>You can use <code>labels</code> in the same way (a list of strings the same length
as <code>breaks</code>), but you can also suppress the labels altogether by passing
a list of empty strings. This is useful for maps, or for publishing
plots where you can’t share the absolute numbers. Note that the list of
labels needs to be of the same length as the list of values, so a helper
function like <code>no_labels</code> is convenient<sup class="footnote-ref"><a href="https://jeroenjanssens.com/plotnine/#fn12" id="fnref12">[12]</a></sup>:</p>
<pre class="language-python"><code class="language-python"><span class="token keyword">def</span> <span class="token function">no_labels</span><span class="token punctuation">(</span>values<span class="token punctuation">)</span><span class="token punctuation">:</span><br /> <span class="token keyword">return</span> <span class="token punctuation">[</span><span class="token string">""</span><span class="token punctuation">]</span> <span class="token operator">*</span> <span class="token builtin">len</span><span class="token punctuation">(</span>values<span class="token punctuation">)</span><br /><br />ggplot<span class="token punctuation">(</span>mpg<span class="token punctuation">,</span> aes<span class="token punctuation">(</span><span class="token string">"displ"</span><span class="token punctuation">,</span> <span class="token string">"hwy"</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_point<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />scale_x_continuous<span class="token punctuation">(</span>labels<span class="token operator">=</span>no_labels<span class="token punctuation">)</span> <span class="token operator">+</span>\<br />scale_y_continuous<span class="token punctuation">(</span>labels<span class="token operator">=</span>no_labels<span class="token punctuation">)</span></code></pre>
<figure>
<a href="https://jeroenjanssens.com/img/plotnine-grammar-of-graphics-for-python/plotnine-unnamed-chunk-76-1.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/gVsw7uFgLH-302.webp 302w, https://jeroenjanssens.com/img/gVsw7uFgLH-453.webp 453w, https://jeroenjanssens.com/img/gVsw7uFgLH-604.webp 604w, https://jeroenjanssens.com/img/gVsw7uFgLH-907.webp 907w, https://jeroenjanssens.com/img/gVsw7uFgLH-1209.webp 1209w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/gVsw7uFgLH-302.webp 302w, https://jeroenjanssens.com/img/gVsw7uFgLH-453.webp 453w, https://jeroenjanssens.com/img/gVsw7uFgLH-604.webp 604w, https://jeroenjanssens.com/img/gVsw7uFgLH-907.webp 907w, https://jeroenjanssens.com/img/gVsw7uFgLH-1209.webp 1209w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/gVsw7uFgLH-302.jpeg 302w, https://jeroenjanssens.com/img/gVsw7uFgLH-453.jpeg 453w, https://jeroenjanssens.com/img/gVsw7uFgLH-604.jpeg 604w, https://jeroenjanssens.com/img/gVsw7uFgLH-907.jpeg 907w, https://jeroenjanssens.com/img/gVsw7uFgLH-1209.jpeg 1209w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/gVsw7uFgLH-302.jpeg 302w, https://jeroenjanssens.com/img/gVsw7uFgLH-453.jpeg 453w, https://jeroenjanssens.com/img/gVsw7uFgLH-604.jpeg 604w, https://jeroenjanssens.com/img/gVsw7uFgLH-907.jpeg 907w, https://jeroenjanssens.com/img/gVsw7uFgLH-1209.jpeg 1209w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/gVsw7uFgLH-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
<p>You can also use <code>breaks</code> and <code>labels</code> to control the appearance of
legends. Collectively axes and legends are called <strong>guides</strong>. Axes are
used for x and y aesthetics; legends are used for everything else.</p>
<p>Another use of <code>breaks</code> is when you have relatively few data points and
want to highlight exactly where the observations occur. For example,
take this plot that shows when each US president started and ended their
term.</p>
<pre class="language-python"><code class="language-python">presidential<span class="token punctuation">[</span><span class="token string">"id"</span><span class="token punctuation">]</span> <span class="token operator">=</span> <span class="token number">34</span> <span class="token operator">+</span> presidential<span class="token punctuation">.</span>index<br /><br />ggplot<span class="token punctuation">(</span>presidential<span class="token punctuation">,</span> aes<span class="token punctuation">(</span><span class="token string">"start"</span><span class="token punctuation">,</span> <span class="token string">"id"</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_point<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_segment<span class="token punctuation">(</span>aes<span class="token punctuation">(</span>xend<span class="token operator">=</span><span class="token string">"end"</span><span class="token punctuation">,</span> yend<span class="token operator">=</span><span class="token string">"id"</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />scale_x_date<span class="token punctuation">(</span>name<span class="token operator">=</span><span class="token string">""</span><span class="token punctuation">,</span> breaks<span class="token operator">=</span>presidential<span class="token punctuation">.</span>start<span class="token punctuation">,</span> date_labels<span class="token operator">=</span><span class="token string">"'%y"</span><span class="token punctuation">)</span></code></pre>
<figure>
<a href="https://jeroenjanssens.com/img/plotnine-grammar-of-graphics-for-python/plotnine-unnamed-chunk-77-1.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/AzCdvCCUma-302.webp 302w, https://jeroenjanssens.com/img/AzCdvCCUma-453.webp 453w, https://jeroenjanssens.com/img/AzCdvCCUma-604.webp 604w, https://jeroenjanssens.com/img/AzCdvCCUma-907.webp 907w, https://jeroenjanssens.com/img/AzCdvCCUma-1209.webp 1209w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/AzCdvCCUma-302.webp 302w, https://jeroenjanssens.com/img/AzCdvCCUma-453.webp 453w, https://jeroenjanssens.com/img/AzCdvCCUma-604.webp 604w, https://jeroenjanssens.com/img/AzCdvCCUma-907.webp 907w, https://jeroenjanssens.com/img/AzCdvCCUma-1209.webp 1209w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/AzCdvCCUma-302.jpeg 302w, https://jeroenjanssens.com/img/AzCdvCCUma-453.jpeg 453w, https://jeroenjanssens.com/img/AzCdvCCUma-604.jpeg 604w, https://jeroenjanssens.com/img/AzCdvCCUma-907.jpeg 907w, https://jeroenjanssens.com/img/AzCdvCCUma-1209.jpeg 1209w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/AzCdvCCUma-302.jpeg 302w, https://jeroenjanssens.com/img/AzCdvCCUma-453.jpeg 453w, https://jeroenjanssens.com/img/AzCdvCCUma-604.jpeg 604w, https://jeroenjanssens.com/img/AzCdvCCUma-907.jpeg 907w, https://jeroenjanssens.com/img/AzCdvCCUma-1209.jpeg 1209w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/AzCdvCCUma-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
<p>Note that the specification of breaks and labels for date and datetime
scales is a little different:</p>
<ul>
<li>
<p><code>date_labels</code> takes a format specification, in the same form as
<code>time.strptime()</code>.</p>
</li>
<li>
<p><code>date_breaks</code> (not shown here), takes a string like “2 days” or “1
month”.</p>
</li>
</ul>
<h3><a href="https://r4ds.had.co.nz/graphics-for-communication.html#legend-layout">28.4.2</a> Legend layout</h3>
<p>You will most often use <code>breaks</code> and <code>labels</code> to tweak the axes. While
they both also work for legends, there are a few other techniques you
are more likely to use.</p>
<p>To control the overall position of the legend, you need to use a
<code>theme()</code> setting. We’ll come back to themes at the end of the chapter,
but in brief, they control the non-data parts of the plot. The theme
setting <code>legend_position</code> controls where the legend is drawn.
Unfortunately, in order to position the legend correctly on the left or
the bottom, we have to be a bit more explicit. Just using “left” and
“bottom” may cause the legend to overlap the axis labels. Your milage
may vary.</p>
<pre class="language-python"><code class="language-python">base <span class="token operator">=</span> ggplot<span class="token punctuation">(</span>mpg<span class="token punctuation">,</span> aes<span class="token punctuation">(</span><span class="token string">"displ"</span><span class="token punctuation">,</span> <span class="token string">"hwy"</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_point<span class="token punctuation">(</span>aes<span class="token punctuation">(</span>colour<span class="token operator">=</span><span class="token string">"class"</span><span class="token punctuation">)</span><span class="token punctuation">)</span><br /><br />base <span class="token operator">+</span> theme<span class="token punctuation">(</span>legend_position<span class="token operator">=</span><span class="token string">"right"</span><span class="token punctuation">)</span> <span class="token comment"># the default</span><br />base <span class="token operator">+</span> theme<span class="token punctuation">(</span>subplots_adjust<span class="token operator">=</span><span class="token punctuation">{</span><span class="token string">'left'</span><span class="token punctuation">:</span> <span class="token number">0.3</span><span class="token punctuation">}</span><span class="token punctuation">)</span> <span class="token operator">+</span> theme<span class="token punctuation">(</span>legend_position<span class="token operator">=</span><span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">,</span> <span class="token number">0.5</span><span class="token punctuation">)</span><span class="token punctuation">)</span><br />base <span class="token operator">+</span> theme<span class="token punctuation">(</span>legend_position<span class="token operator">=</span><span class="token string">"top"</span><span class="token punctuation">)</span><br />base <span class="token operator">+</span> theme<span class="token punctuation">(</span>subplots_adjust<span class="token operator">=</span><span class="token punctuation">{</span><span class="token string">'bottom'</span><span class="token punctuation">:</span> <span class="token number">0.3</span><span class="token punctuation">}</span><span class="token punctuation">,</span> legend_position<span class="token operator">=</span><span class="token punctuation">(</span><span class="token number">.5</span><span class="token punctuation">,</span> <span class="token number">0</span><span class="token punctuation">)</span><span class="token punctuation">,</span> legend_direction<span class="token operator">=</span><span class="token string">'horizontal'</span><span class="token punctuation">)</span></code></pre>
<div class="flex flex-wrap md:flex-row mb-4">
<div class="mx-auto md:w-1/2 ">
<figure>
<a href="https://jeroenjanssens.com/img/plotnine-grammar-of-graphics-for-python/plotnine-unnamed-chunk-79-1.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/8IJIL4Xx4Y-302.webp 302w, https://jeroenjanssens.com/img/8IJIL4Xx4Y-453.webp 453w, https://jeroenjanssens.com/img/8IJIL4Xx4Y-604.webp 604w, https://jeroenjanssens.com/img/8IJIL4Xx4Y-907.webp 907w, https://jeroenjanssens.com/img/8IJIL4Xx4Y-1209.webp 1209w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/8IJIL4Xx4Y-302.webp 302w, https://jeroenjanssens.com/img/8IJIL4Xx4Y-453.webp 453w, https://jeroenjanssens.com/img/8IJIL4Xx4Y-604.webp 604w, https://jeroenjanssens.com/img/8IJIL4Xx4Y-907.webp 907w, https://jeroenjanssens.com/img/8IJIL4Xx4Y-1209.webp 1209w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/8IJIL4Xx4Y-302.jpeg 302w, https://jeroenjanssens.com/img/8IJIL4Xx4Y-453.jpeg 453w, https://jeroenjanssens.com/img/8IJIL4Xx4Y-604.jpeg 604w, https://jeroenjanssens.com/img/8IJIL4Xx4Y-907.jpeg 907w, https://jeroenjanssens.com/img/8IJIL4Xx4Y-1209.jpeg 1209w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/8IJIL4Xx4Y-302.jpeg 302w, https://jeroenjanssens.com/img/8IJIL4Xx4Y-453.jpeg 453w, https://jeroenjanssens.com/img/8IJIL4Xx4Y-604.jpeg 604w, https://jeroenjanssens.com/img/8IJIL4Xx4Y-907.jpeg 907w, https://jeroenjanssens.com/img/8IJIL4Xx4Y-1209.jpeg 1209w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/8IJIL4Xx4Y-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
</div>
<div class="mx-auto md:w-1/2 ">
<figure>
<a href="https://jeroenjanssens.com/img/plotnine-grammar-of-graphics-for-python/plotnine-unnamed-chunk-80-1.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/jr1cHowSra-302.webp 302w, https://jeroenjanssens.com/img/jr1cHowSra-453.webp 453w, https://jeroenjanssens.com/img/jr1cHowSra-604.webp 604w, https://jeroenjanssens.com/img/jr1cHowSra-907.webp 907w, https://jeroenjanssens.com/img/jr1cHowSra-1209.webp 1209w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/jr1cHowSra-302.webp 302w, https://jeroenjanssens.com/img/jr1cHowSra-453.webp 453w, https://jeroenjanssens.com/img/jr1cHowSra-604.webp 604w, https://jeroenjanssens.com/img/jr1cHowSra-907.webp 907w, https://jeroenjanssens.com/img/jr1cHowSra-1209.webp 1209w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/jr1cHowSra-302.jpeg 302w, https://jeroenjanssens.com/img/jr1cHowSra-453.jpeg 453w, https://jeroenjanssens.com/img/jr1cHowSra-604.jpeg 604w, https://jeroenjanssens.com/img/jr1cHowSra-907.jpeg 907w, https://jeroenjanssens.com/img/jr1cHowSra-1209.jpeg 1209w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/jr1cHowSra-302.jpeg 302w, https://jeroenjanssens.com/img/jr1cHowSra-453.jpeg 453w, https://jeroenjanssens.com/img/jr1cHowSra-604.jpeg 604w, https://jeroenjanssens.com/img/jr1cHowSra-907.jpeg 907w, https://jeroenjanssens.com/img/jr1cHowSra-1209.jpeg 1209w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/jr1cHowSra-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
</div>
<div class="mx-auto md:w-1/2 ">
<figure>
<a href="https://jeroenjanssens.com/img/plotnine-grammar-of-graphics-for-python/plotnine-unnamed-chunk-81-1.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/r4huxiUkxb-302.webp 302w, https://jeroenjanssens.com/img/r4huxiUkxb-453.webp 453w, https://jeroenjanssens.com/img/r4huxiUkxb-604.webp 604w, https://jeroenjanssens.com/img/r4huxiUkxb-907.webp 907w, https://jeroenjanssens.com/img/r4huxiUkxb-1209.webp 1209w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/r4huxiUkxb-302.webp 302w, https://jeroenjanssens.com/img/r4huxiUkxb-453.webp 453w, https://jeroenjanssens.com/img/r4huxiUkxb-604.webp 604w, https://jeroenjanssens.com/img/r4huxiUkxb-907.webp 907w, https://jeroenjanssens.com/img/r4huxiUkxb-1209.webp 1209w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/r4huxiUkxb-302.jpeg 302w, https://jeroenjanssens.com/img/r4huxiUkxb-453.jpeg 453w, https://jeroenjanssens.com/img/r4huxiUkxb-604.jpeg 604w, https://jeroenjanssens.com/img/r4huxiUkxb-907.jpeg 907w, https://jeroenjanssens.com/img/r4huxiUkxb-1209.jpeg 1209w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/r4huxiUkxb-302.jpeg 302w, https://jeroenjanssens.com/img/r4huxiUkxb-453.jpeg 453w, https://jeroenjanssens.com/img/r4huxiUkxb-604.jpeg 604w, https://jeroenjanssens.com/img/r4huxiUkxb-907.jpeg 907w, https://jeroenjanssens.com/img/r4huxiUkxb-1209.jpeg 1209w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/r4huxiUkxb-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
</div>
<div class="mx-auto md:w-1/2 ">
<figure>
<a href="https://jeroenjanssens.com/img/plotnine-grammar-of-graphics-for-python/plotnine-unnamed-chunk-82-1.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/O6p8xIyYnk-302.webp 302w, https://jeroenjanssens.com/img/O6p8xIyYnk-453.webp 453w, https://jeroenjanssens.com/img/O6p8xIyYnk-604.webp 604w, https://jeroenjanssens.com/img/O6p8xIyYnk-907.webp 907w, https://jeroenjanssens.com/img/O6p8xIyYnk-1209.webp 1209w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/O6p8xIyYnk-302.webp 302w, https://jeroenjanssens.com/img/O6p8xIyYnk-453.webp 453w, https://jeroenjanssens.com/img/O6p8xIyYnk-604.webp 604w, https://jeroenjanssens.com/img/O6p8xIyYnk-907.webp 907w, https://jeroenjanssens.com/img/O6p8xIyYnk-1209.webp 1209w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/O6p8xIyYnk-302.jpeg 302w, https://jeroenjanssens.com/img/O6p8xIyYnk-453.jpeg 453w, https://jeroenjanssens.com/img/O6p8xIyYnk-604.jpeg 604w, https://jeroenjanssens.com/img/O6p8xIyYnk-907.jpeg 907w, https://jeroenjanssens.com/img/O6p8xIyYnk-1209.jpeg 1209w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/O6p8xIyYnk-302.jpeg 302w, https://jeroenjanssens.com/img/O6p8xIyYnk-453.jpeg 453w, https://jeroenjanssens.com/img/O6p8xIyYnk-604.jpeg 604w, https://jeroenjanssens.com/img/O6p8xIyYnk-907.jpeg 907w, https://jeroenjanssens.com/img/O6p8xIyYnk-1209.jpeg 1209w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/O6p8xIyYnk-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
</div>
</div>
<p>You can also use <code>legend_position="none"</code> to suppress the display of the
legend altogether.</p>
<p>To control the display of individual legends, use <code>guides()</code> along with
<code>guide_legend()</code> or <code>guide_colourbar()</code>. The following example shows two
important settings: controlling the number of rows the legend uses with
<code>nrow</code>, and overriding one of the aesthetics to make the points bigger.
This is particularly useful if you have used a low <code>alpha</code> to display
many points on a plot.</p>
<pre class="language-python"><code class="language-python">ggplot<span class="token punctuation">(</span>mpg<span class="token punctuation">,</span> aes<span class="token punctuation">(</span><span class="token string">"displ"</span><span class="token punctuation">,</span> <span class="token string">"hwy"</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_point<span class="token punctuation">(</span>aes<span class="token punctuation">(</span>colour<span class="token operator">=</span><span class="token string">"class"</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_smooth<span class="token punctuation">(</span>se<span class="token operator">=</span><span class="token boolean">False</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />theme<span class="token punctuation">(</span>legend_position<span class="token operator">=</span><span class="token string">"bottom"</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />guides<span class="token punctuation">(</span>colour<span class="token operator">=</span>guide_legend<span class="token punctuation">(</span>nrow<span class="token operator">=</span><span class="token number">1</span><span class="token punctuation">,</span> override_aes<span class="token operator">=</span><span class="token punctuation">{</span><span class="token string">"size"</span><span class="token punctuation">:</span> <span class="token number">4</span><span class="token punctuation">}</span><span class="token punctuation">)</span><span class="token punctuation">)</span></code></pre>
<figure>
<a href="https://jeroenjanssens.com/img/plotnine-grammar-of-graphics-for-python/plotnine-unnamed-chunk-83-1.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/LHl-kmtAo4-302.webp 302w, https://jeroenjanssens.com/img/LHl-kmtAo4-453.webp 453w, https://jeroenjanssens.com/img/LHl-kmtAo4-604.webp 604w, https://jeroenjanssens.com/img/LHl-kmtAo4-907.webp 907w, https://jeroenjanssens.com/img/LHl-kmtAo4-1209.webp 1209w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/LHl-kmtAo4-302.webp 302w, https://jeroenjanssens.com/img/LHl-kmtAo4-453.webp 453w, https://jeroenjanssens.com/img/LHl-kmtAo4-604.webp 604w, https://jeroenjanssens.com/img/LHl-kmtAo4-907.webp 907w, https://jeroenjanssens.com/img/LHl-kmtAo4-1209.webp 1209w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/LHl-kmtAo4-302.jpeg 302w, https://jeroenjanssens.com/img/LHl-kmtAo4-453.jpeg 453w, https://jeroenjanssens.com/img/LHl-kmtAo4-604.jpeg 604w, https://jeroenjanssens.com/img/LHl-kmtAo4-907.jpeg 907w, https://jeroenjanssens.com/img/LHl-kmtAo4-1209.jpeg 1209w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/LHl-kmtAo4-302.jpeg 302w, https://jeroenjanssens.com/img/LHl-kmtAo4-453.jpeg 453w, https://jeroenjanssens.com/img/LHl-kmtAo4-604.jpeg 604w, https://jeroenjanssens.com/img/LHl-kmtAo4-907.jpeg 907w, https://jeroenjanssens.com/img/LHl-kmtAo4-1209.jpeg 1209w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/LHl-kmtAo4-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
<h3><a href="https://r4ds.had.co.nz/graphics-for-communication.html#replacing-a-scale">28.4.3</a> Replacing a scale</h3>
<p>Instead of just tweaking the details a little, you can instead replace
the scale altogether. There are two types of scales you’re mostly likely
to want to switch out: continuous position scales and colour scales.
Fortunately, the same principles apply to all the other aesthetics, so
once you’ve mastered position and colour, you’ll be able to quickly pick
up other scale replacements.</p>
<p>It’s very useful to plot transformations of your variable. For example,
with the <code>diamonds</code> DataFrame, it’s easier to see the precise
relationship between <code>carat</code> and <code>price</code> if we log transform them:</p>
<pre class="language-python"><code class="language-python">ggplot<span class="token punctuation">(</span>diamonds<span class="token punctuation">,</span> aes<span class="token punctuation">(</span><span class="token string">"carat"</span><span class="token punctuation">,</span> <span class="token string">"price"</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_bin2d<span class="token punctuation">(</span><span class="token punctuation">)</span><br /><br />ggplot<span class="token punctuation">(</span>diamonds<span class="token punctuation">,</span> aes<span class="token punctuation">(</span><span class="token string">"np.log10(carat)"</span><span class="token punctuation">,</span> <span class="token string">"np.log10(price)"</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_bin2d<span class="token punctuation">(</span><span class="token punctuation">)</span></code></pre>
<div class="flex flex-wrap md:flex-row mb-4">
<div class="mx-auto md:w-1/2 ">
<figure>
<a href="https://jeroenjanssens.com/img/plotnine-grammar-of-graphics-for-python/plotnine-unnamed-chunk-85-1.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/QZB8Gr4e6O-302.webp 302w, https://jeroenjanssens.com/img/QZB8Gr4e6O-453.webp 453w, https://jeroenjanssens.com/img/QZB8Gr4e6O-604.webp 604w, https://jeroenjanssens.com/img/QZB8Gr4e6O-907.webp 907w, https://jeroenjanssens.com/img/QZB8Gr4e6O-1209.webp 1209w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/QZB8Gr4e6O-302.webp 302w, https://jeroenjanssens.com/img/QZB8Gr4e6O-453.webp 453w, https://jeroenjanssens.com/img/QZB8Gr4e6O-604.webp 604w, https://jeroenjanssens.com/img/QZB8Gr4e6O-907.webp 907w, https://jeroenjanssens.com/img/QZB8Gr4e6O-1209.webp 1209w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/QZB8Gr4e6O-302.jpeg 302w, https://jeroenjanssens.com/img/QZB8Gr4e6O-453.jpeg 453w, https://jeroenjanssens.com/img/QZB8Gr4e6O-604.jpeg 604w, https://jeroenjanssens.com/img/QZB8Gr4e6O-907.jpeg 907w, https://jeroenjanssens.com/img/QZB8Gr4e6O-1209.jpeg 1209w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/QZB8Gr4e6O-302.jpeg 302w, https://jeroenjanssens.com/img/QZB8Gr4e6O-453.jpeg 453w, https://jeroenjanssens.com/img/QZB8Gr4e6O-604.jpeg 604w, https://jeroenjanssens.com/img/QZB8Gr4e6O-907.jpeg 907w, https://jeroenjanssens.com/img/QZB8Gr4e6O-1209.jpeg 1209w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/QZB8Gr4e6O-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
</div>
<div class="mx-auto md:w-1/2 ">
<figure>
<a href="https://jeroenjanssens.com/img/plotnine-grammar-of-graphics-for-python/plotnine-unnamed-chunk-86-1.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/2Aizc9ZCOt-302.webp 302w, https://jeroenjanssens.com/img/2Aizc9ZCOt-453.webp 453w, https://jeroenjanssens.com/img/2Aizc9ZCOt-604.webp 604w, https://jeroenjanssens.com/img/2Aizc9ZCOt-907.webp 907w, https://jeroenjanssens.com/img/2Aizc9ZCOt-1209.webp 1209w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/2Aizc9ZCOt-302.webp 302w, https://jeroenjanssens.com/img/2Aizc9ZCOt-453.webp 453w, https://jeroenjanssens.com/img/2Aizc9ZCOt-604.webp 604w, https://jeroenjanssens.com/img/2Aizc9ZCOt-907.webp 907w, https://jeroenjanssens.com/img/2Aizc9ZCOt-1209.webp 1209w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/2Aizc9ZCOt-302.jpeg 302w, https://jeroenjanssens.com/img/2Aizc9ZCOt-453.jpeg 453w, https://jeroenjanssens.com/img/2Aizc9ZCOt-604.jpeg 604w, https://jeroenjanssens.com/img/2Aizc9ZCOt-907.jpeg 907w, https://jeroenjanssens.com/img/2Aizc9ZCOt-1209.jpeg 1209w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/2Aizc9ZCOt-302.jpeg 302w, https://jeroenjanssens.com/img/2Aizc9ZCOt-453.jpeg 453w, https://jeroenjanssens.com/img/2Aizc9ZCOt-604.jpeg 604w, https://jeroenjanssens.com/img/2Aizc9ZCOt-907.jpeg 907w, https://jeroenjanssens.com/img/2Aizc9ZCOt-1209.jpeg 1209w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/2Aizc9ZCOt-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
</div>
</div>
<p>However, the disadvantage of this transformation is that the axes are
now labelled with the transformed values, making it hard to interpret
the plot. Instead of doing the transformation in the aesthetic mapping,
we can instead do it with the scale. This is visually identical, except
the axes are labelled on the original data scale.</p>
<pre class="language-python"><code class="language-python">ggplot<span class="token punctuation">(</span>diamonds<span class="token punctuation">,</span> aes<span class="token punctuation">(</span><span class="token string">"carat"</span><span class="token punctuation">,</span> <span class="token string">"price"</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_bin2d<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />scale_x_log10<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />scale_y_log10<span class="token punctuation">(</span><span class="token punctuation">)</span></code></pre>
<figure>
<a href="https://jeroenjanssens.com/img/plotnine-grammar-of-graphics-for-python/plotnine-unnamed-chunk-87-1.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/PE4IaqQ6Sa-302.webp 302w, https://jeroenjanssens.com/img/PE4IaqQ6Sa-453.webp 453w, https://jeroenjanssens.com/img/PE4IaqQ6Sa-604.webp 604w, https://jeroenjanssens.com/img/PE4IaqQ6Sa-907.webp 907w, https://jeroenjanssens.com/img/PE4IaqQ6Sa-1209.webp 1209w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/PE4IaqQ6Sa-302.webp 302w, https://jeroenjanssens.com/img/PE4IaqQ6Sa-453.webp 453w, https://jeroenjanssens.com/img/PE4IaqQ6Sa-604.webp 604w, https://jeroenjanssens.com/img/PE4IaqQ6Sa-907.webp 907w, https://jeroenjanssens.com/img/PE4IaqQ6Sa-1209.webp 1209w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/PE4IaqQ6Sa-302.jpeg 302w, https://jeroenjanssens.com/img/PE4IaqQ6Sa-453.jpeg 453w, https://jeroenjanssens.com/img/PE4IaqQ6Sa-604.jpeg 604w, https://jeroenjanssens.com/img/PE4IaqQ6Sa-907.jpeg 907w, https://jeroenjanssens.com/img/PE4IaqQ6Sa-1209.jpeg 1209w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/PE4IaqQ6Sa-302.jpeg 302w, https://jeroenjanssens.com/img/PE4IaqQ6Sa-453.jpeg 453w, https://jeroenjanssens.com/img/PE4IaqQ6Sa-604.jpeg 604w, https://jeroenjanssens.com/img/PE4IaqQ6Sa-907.jpeg 907w, https://jeroenjanssens.com/img/PE4IaqQ6Sa-1209.jpeg 1209w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/PE4IaqQ6Sa-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
<p>Another scale that is frequently customised is colour. The default
categorical scale picks colours that are evenly spaced around the colour
wheel. Useful alternatives are the ColorBrewer scales which have been
hand tuned to work better for people with common types of colour
blindness. The two plots below look similar, but there is enough
difference in the shades of red and green that the dots on the right can
be distinguished even by people with red-green colour blindness.</p>
<pre class="language-python"><code class="language-python">ggplot<span class="token punctuation">(</span>mpg<span class="token punctuation">,</span> aes<span class="token punctuation">(</span><span class="token string">"displ"</span><span class="token punctuation">,</span> <span class="token string">"hwy"</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_point<span class="token punctuation">(</span>aes<span class="token punctuation">(</span>color<span class="token operator">=</span><span class="token string">"drv"</span><span class="token punctuation">)</span><span class="token punctuation">)</span><br /><br />ggplot<span class="token punctuation">(</span>mpg<span class="token punctuation">,</span> aes<span class="token punctuation">(</span><span class="token string">"displ"</span><span class="token punctuation">,</span> <span class="token string">"hwy"</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_point<span class="token punctuation">(</span>aes<span class="token punctuation">(</span>color<span class="token operator">=</span><span class="token string">"drv"</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />scale_colour_brewer<span class="token punctuation">(</span><span class="token builtin">type</span><span class="token operator">=</span><span class="token string">"qual"</span><span class="token punctuation">,</span> palette<span class="token operator">=</span><span class="token string">"Set1"</span><span class="token punctuation">)</span></code></pre>
<div class="flex flex-wrap md:flex-row mb-4">
<div class="mx-auto md:w-1/2 ">
<figure>
<a href="https://jeroenjanssens.com/img/plotnine-grammar-of-graphics-for-python/plotnine-unnamed-chunk-89-1.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/sdamiVraUJ-302.webp 302w, https://jeroenjanssens.com/img/sdamiVraUJ-453.webp 453w, https://jeroenjanssens.com/img/sdamiVraUJ-604.webp 604w, https://jeroenjanssens.com/img/sdamiVraUJ-907.webp 907w, https://jeroenjanssens.com/img/sdamiVraUJ-1209.webp 1209w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/sdamiVraUJ-302.webp 302w, https://jeroenjanssens.com/img/sdamiVraUJ-453.webp 453w, https://jeroenjanssens.com/img/sdamiVraUJ-604.webp 604w, https://jeroenjanssens.com/img/sdamiVraUJ-907.webp 907w, https://jeroenjanssens.com/img/sdamiVraUJ-1209.webp 1209w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/sdamiVraUJ-302.jpeg 302w, https://jeroenjanssens.com/img/sdamiVraUJ-453.jpeg 453w, https://jeroenjanssens.com/img/sdamiVraUJ-604.jpeg 604w, https://jeroenjanssens.com/img/sdamiVraUJ-907.jpeg 907w, https://jeroenjanssens.com/img/sdamiVraUJ-1209.jpeg 1209w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/sdamiVraUJ-302.jpeg 302w, https://jeroenjanssens.com/img/sdamiVraUJ-453.jpeg 453w, https://jeroenjanssens.com/img/sdamiVraUJ-604.jpeg 604w, https://jeroenjanssens.com/img/sdamiVraUJ-907.jpeg 907w, https://jeroenjanssens.com/img/sdamiVraUJ-1209.jpeg 1209w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/sdamiVraUJ-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
</div>
<div class="mx-auto md:w-1/2 ">
<figure>
<a href="https://jeroenjanssens.com/img/plotnine-grammar-of-graphics-for-python/plotnine-unnamed-chunk-90-1.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/RKL-OsgHAi-302.webp 302w, https://jeroenjanssens.com/img/RKL-OsgHAi-453.webp 453w, https://jeroenjanssens.com/img/RKL-OsgHAi-604.webp 604w, https://jeroenjanssens.com/img/RKL-OsgHAi-907.webp 907w, https://jeroenjanssens.com/img/RKL-OsgHAi-1209.webp 1209w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/RKL-OsgHAi-302.webp 302w, https://jeroenjanssens.com/img/RKL-OsgHAi-453.webp 453w, https://jeroenjanssens.com/img/RKL-OsgHAi-604.webp 604w, https://jeroenjanssens.com/img/RKL-OsgHAi-907.webp 907w, https://jeroenjanssens.com/img/RKL-OsgHAi-1209.webp 1209w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/RKL-OsgHAi-302.jpeg 302w, https://jeroenjanssens.com/img/RKL-OsgHAi-453.jpeg 453w, https://jeroenjanssens.com/img/RKL-OsgHAi-604.jpeg 604w, https://jeroenjanssens.com/img/RKL-OsgHAi-907.jpeg 907w, https://jeroenjanssens.com/img/RKL-OsgHAi-1209.jpeg 1209w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/RKL-OsgHAi-302.jpeg 302w, https://jeroenjanssens.com/img/RKL-OsgHAi-453.jpeg 453w, https://jeroenjanssens.com/img/RKL-OsgHAi-604.jpeg 604w, https://jeroenjanssens.com/img/RKL-OsgHAi-907.jpeg 907w, https://jeroenjanssens.com/img/RKL-OsgHAi-1209.jpeg 1209w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/RKL-OsgHAi-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
</div>
</div>
<p>Don’t forget simpler techniques. If there are just a few colours, you
can add a redundant shape mapping. This will also help ensure your plot
is interpretable in black and white.</p>
<pre class="language-python"><code class="language-python">ggplot<span class="token punctuation">(</span>mpg<span class="token punctuation">,</span> aes<span class="token punctuation">(</span><span class="token string">"displ"</span><span class="token punctuation">,</span> <span class="token string">"hwy"</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_point<span class="token punctuation">(</span>aes<span class="token punctuation">(</span>color<span class="token operator">=</span><span class="token string">"drv"</span><span class="token punctuation">,</span> shape<span class="token operator">=</span><span class="token string">"drv"</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />scale_colour_brewer<span class="token punctuation">(</span><span class="token builtin">type</span><span class="token operator">=</span><span class="token string">"qual"</span><span class="token punctuation">,</span> palette<span class="token operator">=</span><span class="token string">"Set1"</span><span class="token punctuation">)</span></code></pre>
<figure>
<a href="https://jeroenjanssens.com/img/plotnine-grammar-of-graphics-for-python/plotnine-unnamed-chunk-91-1.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/A0LJTeMQWp-302.webp 302w, https://jeroenjanssens.com/img/A0LJTeMQWp-453.webp 453w, https://jeroenjanssens.com/img/A0LJTeMQWp-604.webp 604w, https://jeroenjanssens.com/img/A0LJTeMQWp-907.webp 907w, https://jeroenjanssens.com/img/A0LJTeMQWp-1209.webp 1209w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/A0LJTeMQWp-302.webp 302w, https://jeroenjanssens.com/img/A0LJTeMQWp-453.webp 453w, https://jeroenjanssens.com/img/A0LJTeMQWp-604.webp 604w, https://jeroenjanssens.com/img/A0LJTeMQWp-907.webp 907w, https://jeroenjanssens.com/img/A0LJTeMQWp-1209.webp 1209w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/A0LJTeMQWp-302.jpeg 302w, https://jeroenjanssens.com/img/A0LJTeMQWp-453.jpeg 453w, https://jeroenjanssens.com/img/A0LJTeMQWp-604.jpeg 604w, https://jeroenjanssens.com/img/A0LJTeMQWp-907.jpeg 907w, https://jeroenjanssens.com/img/A0LJTeMQWp-1209.jpeg 1209w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/A0LJTeMQWp-302.jpeg 302w, https://jeroenjanssens.com/img/A0LJTeMQWp-453.jpeg 453w, https://jeroenjanssens.com/img/A0LJTeMQWp-604.jpeg 604w, https://jeroenjanssens.com/img/A0LJTeMQWp-907.jpeg 907w, https://jeroenjanssens.com/img/A0LJTeMQWp-1209.jpeg 1209w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/A0LJTeMQWp-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
<p>The ColorBrewer scales are documented online at
<a href="http://colorbrewer2.org/">http://colorbrewer2.org/</a> and made available in Python via the
<strong>mizani</strong> package, by Hassan Kibirige. The figure below shows the
complete list of all palettes. The sequential (top) and diverging
(bottom) palettes are particularly useful if your categorical values are
ordered, or have a “middle”. This often arises if you’ve used <code>pd.cut()</code>
to make a continuous variable into a categorical variable.</p>
<figure>
<a href="https://jeroenjanssens.com/img/plotnine-grammar-of-graphics-for-python/plotnine-brewer-1.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.5)" srcset="https://jeroenjanssens.com/img/knZHpgBzru-168.webp 168w, https://jeroenjanssens.com/img/knZHpgBzru-252.webp 252w, https://jeroenjanssens.com/img/knZHpgBzru-336.webp 336w, https://jeroenjanssens.com/img/knZHpgBzru-504.webp 504w, https://jeroenjanssens.com/img/knZHpgBzru-672.webp 672w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.5)" srcset="https://jeroenjanssens.com/img/knZHpgBzru-168.webp 168w, https://jeroenjanssens.com/img/knZHpgBzru-252.webp 252w, https://jeroenjanssens.com/img/knZHpgBzru-336.webp 336w, https://jeroenjanssens.com/img/knZHpgBzru-504.webp 504w, https://jeroenjanssens.com/img/knZHpgBzru-672.webp 672w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.5)" srcset="https://jeroenjanssens.com/img/knZHpgBzru-168.jpeg 168w, https://jeroenjanssens.com/img/knZHpgBzru-252.jpeg 252w, https://jeroenjanssens.com/img/knZHpgBzru-336.jpeg 336w, https://jeroenjanssens.com/img/knZHpgBzru-504.jpeg 504w, https://jeroenjanssens.com/img/knZHpgBzru-672.jpeg 672w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.5)" srcset="https://jeroenjanssens.com/img/knZHpgBzru-168.jpeg 168w, https://jeroenjanssens.com/img/knZHpgBzru-252.jpeg 252w, https://jeroenjanssens.com/img/knZHpgBzru-336.jpeg 336w, https://jeroenjanssens.com/img/knZHpgBzru-504.jpeg 504w, https://jeroenjanssens.com/img/knZHpgBzru-672.jpeg 672w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 50%;" src="https://jeroenjanssens.com/img/knZHpgBzru-168.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
<p>When you have a predefined mapping between values and colours, use
<code>scale_colour_manual()</code>. For example, if we map presidential party to
colour, we want to use the standard mapping of red for Republicans and
blue for Democrats:</p>
<pre class="language-python"><code class="language-python">presidential<span class="token punctuation">[</span><span class="token string">"id"</span><span class="token punctuation">]</span> <span class="token operator">=</span> <span class="token number">34</span> <span class="token operator">+</span> presidential<span class="token punctuation">.</span>index<br /><br />ggplot<span class="token punctuation">(</span>presidential<span class="token punctuation">,</span> aes<span class="token punctuation">(</span><span class="token string">"start"</span><span class="token punctuation">,</span> <span class="token string">"id"</span><span class="token punctuation">,</span> colour<span class="token operator">=</span><span class="token string">"party"</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_point<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_segment<span class="token punctuation">(</span>aes<span class="token punctuation">(</span>xend<span class="token operator">=</span><span class="token string">"end"</span><span class="token punctuation">,</span> yend<span class="token operator">=</span><span class="token string">"id"</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />scale_colour_manual<span class="token punctuation">(</span>values<span class="token operator">=</span><span class="token punctuation">[</span><span class="token string">"red"</span><span class="token punctuation">,</span> <span class="token string">"blue"</span><span class="token punctuation">]</span><span class="token punctuation">,</span> limits<span class="token operator">=</span><span class="token punctuation">[</span><span class="token string">"Republican"</span><span class="token punctuation">,</span> <span class="token string">"Democratic"</span><span class="token punctuation">]</span><span class="token punctuation">)</span></code></pre>
<figure>
<a href="https://jeroenjanssens.com/img/plotnine-grammar-of-graphics-for-python/plotnine-unnamed-chunk-93-1.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/PSn6ujYQ2V-302.webp 302w, https://jeroenjanssens.com/img/PSn6ujYQ2V-453.webp 453w, https://jeroenjanssens.com/img/PSn6ujYQ2V-604.webp 604w, https://jeroenjanssens.com/img/PSn6ujYQ2V-907.webp 907w, https://jeroenjanssens.com/img/PSn6ujYQ2V-1209.webp 1209w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/PSn6ujYQ2V-302.webp 302w, https://jeroenjanssens.com/img/PSn6ujYQ2V-453.webp 453w, https://jeroenjanssens.com/img/PSn6ujYQ2V-604.webp 604w, https://jeroenjanssens.com/img/PSn6ujYQ2V-907.webp 907w, https://jeroenjanssens.com/img/PSn6ujYQ2V-1209.webp 1209w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/PSn6ujYQ2V-302.jpeg 302w, https://jeroenjanssens.com/img/PSn6ujYQ2V-453.jpeg 453w, https://jeroenjanssens.com/img/PSn6ujYQ2V-604.jpeg 604w, https://jeroenjanssens.com/img/PSn6ujYQ2V-907.jpeg 907w, https://jeroenjanssens.com/img/PSn6ujYQ2V-1209.jpeg 1209w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/PSn6ujYQ2V-302.jpeg 302w, https://jeroenjanssens.com/img/PSn6ujYQ2V-453.jpeg 453w, https://jeroenjanssens.com/img/PSn6ujYQ2V-604.jpeg 604w, https://jeroenjanssens.com/img/PSn6ujYQ2V-907.jpeg 907w, https://jeroenjanssens.com/img/PSn6ujYQ2V-1209.jpeg 1209w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/PSn6ujYQ2V-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
<p>For continuous colour, you can use the built-in
<code>scale_colour_gradient()</code> or <code>scale_fill_gradient()</code>. If you have a
diverging scale, you can use <code>scale_colour_gradient2()</code>. That allows you
to give, for example, positive and negative values different colours.
That’s sometimes also useful if you want to distinguish points above or
below the mean.</p>
<p>Note that all colour scales come in two variety: <code>scale_colour_x()</code> and
<code>scale_fill_x()</code> for the <code>colour</code> and <code>fill</code> aesthetics respectively
(the colour scales are available in both UK and US spellings).</p>
<h3><a href="https://r4ds.had.co.nz/graphics-for-communication.html#exercises-73">28.4.4</a> Exercises</h3>
<ol>
<li>
<p>Why doesn’t the following code override the default scale?</p>
<pre class="language-python"><code class="language-python">ggplot<span class="token punctuation">(</span>df<span class="token punctuation">,</span> aes<span class="token punctuation">(</span><span class="token string">"x"</span><span class="token punctuation">,</span> <span class="token string">"y"</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_hex<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />scale_colour_gradient<span class="token punctuation">(</span>low<span class="token operator">=</span><span class="token string">"white"</span><span class="token punctuation">,</span> high<span class="token operator">=</span><span class="token string">"red"</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />coord_fixed<span class="token punctuation">(</span><span class="token punctuation">)</span></code></pre>
</li>
<li>
<p>What is the first argument to every scale? How does it compare to
<code>labs()</code>?</p>
</li>
<li>
<p>Change the display of the presidential terms by:</p>
<ol>
<li>Combining the two variants shown above.</li>
<li>Improving the display of the y axis.</li>
<li>Labelling each term with the name of the president.</li>
<li>Adding informative plot labels.</li>
<li>Placing breaks every 4 years (this is trickier than it seems!).</li>
</ol>
</li>
<li>
<p>Use <code>override_aes</code> to make the legend on the following plot easier
to see.</p>
<pre class="language-python"><code class="language-python">ggplot<span class="token punctuation">(</span>diamonds<span class="token punctuation">,</span> aes<span class="token punctuation">(</span><span class="token string">"carat"</span><span class="token punctuation">,</span> <span class="token string">"price"</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_point<span class="token punctuation">(</span>aes<span class="token punctuation">(</span>colour<span class="token operator">=</span><span class="token string">"cut"</span><span class="token punctuation">)</span><span class="token punctuation">,</span> alpha<span class="token operator">=</span><span class="token number">1</span><span class="token operator">/</span><span class="token number">20</span><span class="token punctuation">)</span></code></pre>
<figure>
<a href="https://jeroenjanssens.com/img/plotnine-grammar-of-graphics-for-python/plotnine-unnamed-chunk-94-1.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/864z800n4n-302.webp 302w, https://jeroenjanssens.com/img/864z800n4n-453.webp 453w, https://jeroenjanssens.com/img/864z800n4n-604.webp 604w, https://jeroenjanssens.com/img/864z800n4n-907.webp 907w, https://jeroenjanssens.com/img/864z800n4n-1209.webp 1209w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/864z800n4n-302.webp 302w, https://jeroenjanssens.com/img/864z800n4n-453.webp 453w, https://jeroenjanssens.com/img/864z800n4n-604.webp 604w, https://jeroenjanssens.com/img/864z800n4n-907.webp 907w, https://jeroenjanssens.com/img/864z800n4n-1209.webp 1209w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/864z800n4n-302.jpeg 302w, https://jeroenjanssens.com/img/864z800n4n-453.jpeg 453w, https://jeroenjanssens.com/img/864z800n4n-604.jpeg 604w, https://jeroenjanssens.com/img/864z800n4n-907.jpeg 907w, https://jeroenjanssens.com/img/864z800n4n-1209.jpeg 1209w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/864z800n4n-302.jpeg 302w, https://jeroenjanssens.com/img/864z800n4n-453.jpeg 453w, https://jeroenjanssens.com/img/864z800n4n-604.jpeg 604w, https://jeroenjanssens.com/img/864z800n4n-907.jpeg 907w, https://jeroenjanssens.com/img/864z800n4n-1209.jpeg 1209w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/864z800n4n-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
</li>
</ol>
<h2><a href="https://r4ds.had.co.nz/graphics-for-communication.html#zooming">28.5</a> Zooming</h2>
<p>There are three ways to control the plot limits:</p>
<ol>
<li>Adjusting what data are plotted</li>
<li>Setting the limits in each scale</li>
<li>Setting <code>xlim</code> and <code>ylim</code> in <code>coord_cartesian()</code></li>
</ol>
<p>To zoom in on a region of the plot, it’s generally best to use
<code>coord_cartesian()</code>. Compare the following two plots:</p>
<pre class="language-python"><code class="language-python">ggplot<span class="token punctuation">(</span>mpg<span class="token punctuation">,</span> aes<span class="token punctuation">(</span><span class="token string">"displ"</span><span class="token punctuation">,</span> <span class="token string">"hwy"</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_point<span class="token punctuation">(</span>aes<span class="token punctuation">(</span>color<span class="token operator">=</span><span class="token string">"class"</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_smooth<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />coord_cartesian<span class="token punctuation">(</span>xlim<span class="token operator">=</span><span class="token punctuation">(</span><span class="token number">5</span><span class="token punctuation">,</span> <span class="token number">7</span><span class="token punctuation">)</span><span class="token punctuation">,</span> ylim<span class="token operator">=</span><span class="token punctuation">(</span><span class="token number">10</span><span class="token punctuation">,</span> <span class="token number">30</span><span class="token punctuation">)</span><span class="token punctuation">)</span><br /><br />ggplot<span class="token punctuation">(</span>mpg<span class="token punctuation">.</span>query<span class="token punctuation">(</span><span class="token string">"5 <= displ <= 7 and 10 <= hwy <= 30"</span><span class="token punctuation">)</span><span class="token punctuation">,</span> aes<span class="token punctuation">(</span><span class="token string">"displ"</span><span class="token punctuation">,</span> <span class="token string">"hwy"</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_point<span class="token punctuation">(</span>aes<span class="token punctuation">(</span>color<span class="token operator">=</span><span class="token string">"class"</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_smooth<span class="token punctuation">(</span><span class="token punctuation">)</span></code></pre>
<div class="flex flex-wrap md:flex-row mb-4">
<div class="mx-auto md:w-1/2 ">
<figure>
<a href="https://jeroenjanssens.com/img/plotnine-grammar-of-graphics-for-python/plotnine-unnamed-chunk-96-1.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/qKo1FxFadH-302.webp 302w, https://jeroenjanssens.com/img/qKo1FxFadH-453.webp 453w, https://jeroenjanssens.com/img/qKo1FxFadH-604.webp 604w, https://jeroenjanssens.com/img/qKo1FxFadH-907.webp 907w, https://jeroenjanssens.com/img/qKo1FxFadH-1209.webp 1209w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/qKo1FxFadH-302.webp 302w, https://jeroenjanssens.com/img/qKo1FxFadH-453.webp 453w, https://jeroenjanssens.com/img/qKo1FxFadH-604.webp 604w, https://jeroenjanssens.com/img/qKo1FxFadH-907.webp 907w, https://jeroenjanssens.com/img/qKo1FxFadH-1209.webp 1209w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/qKo1FxFadH-302.jpeg 302w, https://jeroenjanssens.com/img/qKo1FxFadH-453.jpeg 453w, https://jeroenjanssens.com/img/qKo1FxFadH-604.jpeg 604w, https://jeroenjanssens.com/img/qKo1FxFadH-907.jpeg 907w, https://jeroenjanssens.com/img/qKo1FxFadH-1209.jpeg 1209w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/qKo1FxFadH-302.jpeg 302w, https://jeroenjanssens.com/img/qKo1FxFadH-453.jpeg 453w, https://jeroenjanssens.com/img/qKo1FxFadH-604.jpeg 604w, https://jeroenjanssens.com/img/qKo1FxFadH-907.jpeg 907w, https://jeroenjanssens.com/img/qKo1FxFadH-1209.jpeg 1209w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/qKo1FxFadH-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
</div>
<div class="mx-auto md:w-1/2 ">
<figure>
<a href="https://jeroenjanssens.com/img/plotnine-grammar-of-graphics-for-python/plotnine-unnamed-chunk-97-1.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/OlDVZkE7FR-302.webp 302w, https://jeroenjanssens.com/img/OlDVZkE7FR-453.webp 453w, https://jeroenjanssens.com/img/OlDVZkE7FR-604.webp 604w, https://jeroenjanssens.com/img/OlDVZkE7FR-907.webp 907w, https://jeroenjanssens.com/img/OlDVZkE7FR-1209.webp 1209w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/OlDVZkE7FR-302.webp 302w, https://jeroenjanssens.com/img/OlDVZkE7FR-453.webp 453w, https://jeroenjanssens.com/img/OlDVZkE7FR-604.webp 604w, https://jeroenjanssens.com/img/OlDVZkE7FR-907.webp 907w, https://jeroenjanssens.com/img/OlDVZkE7FR-1209.webp 1209w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/OlDVZkE7FR-302.jpeg 302w, https://jeroenjanssens.com/img/OlDVZkE7FR-453.jpeg 453w, https://jeroenjanssens.com/img/OlDVZkE7FR-604.jpeg 604w, https://jeroenjanssens.com/img/OlDVZkE7FR-907.jpeg 907w, https://jeroenjanssens.com/img/OlDVZkE7FR-1209.jpeg 1209w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/OlDVZkE7FR-302.jpeg 302w, https://jeroenjanssens.com/img/OlDVZkE7FR-453.jpeg 453w, https://jeroenjanssens.com/img/OlDVZkE7FR-604.jpeg 604w, https://jeroenjanssens.com/img/OlDVZkE7FR-907.jpeg 907w, https://jeroenjanssens.com/img/OlDVZkE7FR-1209.jpeg 1209w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/OlDVZkE7FR-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
</div>
</div>
<p>You can also set the <code>limits</code> on individual scales. Reducing the limits
is basically equivalent to subsetting the data. It is generally more
useful if you want <em>expand</em> the limits, for example, to match scales
across different plots. For example, if we extract two classes of cars
and plot them separately, it’s difficult to compare the plots because
all three scales (the x-axis, the y-axis, and the colour aesthetic) have
different ranges.</p>
<pre class="language-python"><code class="language-python">mpg<span class="token punctuation">[</span><span class="token string">"drv"</span><span class="token punctuation">]</span> <span class="token operator">=</span> mpg<span class="token punctuation">[</span><span class="token string">"drv"</span><span class="token punctuation">]</span><span class="token punctuation">.</span>astype<span class="token punctuation">(</span><span class="token builtin">str</span><span class="token punctuation">)</span><br />suv <span class="token operator">=</span> mpg<span class="token punctuation">[</span>mpg<span class="token punctuation">[</span><span class="token string">"class"</span><span class="token punctuation">]</span> <span class="token operator">==</span> <span class="token string">"suv"</span><span class="token punctuation">]</span><br />compact <span class="token operator">=</span> mpg<span class="token punctuation">[</span>mpg<span class="token punctuation">[</span><span class="token string">"class"</span><span class="token punctuation">]</span> <span class="token operator">==</span> <span class="token string">"compact"</span><span class="token punctuation">]</span><br /><br />ggplot<span class="token punctuation">(</span>suv<span class="token punctuation">,</span> aes<span class="token punctuation">(</span><span class="token string">"displ"</span><span class="token punctuation">,</span> <span class="token string">"hwy"</span><span class="token punctuation">,</span> colour<span class="token operator">=</span><span class="token string">"drv"</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_point<span class="token punctuation">(</span><span class="token punctuation">)</span><br /><br />ggplot<span class="token punctuation">(</span>compact<span class="token punctuation">,</span> aes<span class="token punctuation">(</span><span class="token string">"displ"</span><span class="token punctuation">,</span> <span class="token string">"hwy"</span><span class="token punctuation">,</span> colour<span class="token operator">=</span><span class="token string">"drv"</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_point<span class="token punctuation">(</span><span class="token punctuation">)</span></code></pre>
<div class="flex flex-wrap md:flex-row mb-4">
<div class="mx-auto md:w-1/2 ">
<figure>
<a href="https://jeroenjanssens.com/img/plotnine-grammar-of-graphics-for-python/plotnine-unnamed-chunk-99-1.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/ECYn0qvW-b-302.webp 302w, https://jeroenjanssens.com/img/ECYn0qvW-b-453.webp 453w, https://jeroenjanssens.com/img/ECYn0qvW-b-604.webp 604w, https://jeroenjanssens.com/img/ECYn0qvW-b-907.webp 907w, https://jeroenjanssens.com/img/ECYn0qvW-b-1209.webp 1209w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/ECYn0qvW-b-302.webp 302w, https://jeroenjanssens.com/img/ECYn0qvW-b-453.webp 453w, https://jeroenjanssens.com/img/ECYn0qvW-b-604.webp 604w, https://jeroenjanssens.com/img/ECYn0qvW-b-907.webp 907w, https://jeroenjanssens.com/img/ECYn0qvW-b-1209.webp 1209w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/ECYn0qvW-b-302.jpeg 302w, https://jeroenjanssens.com/img/ECYn0qvW-b-453.jpeg 453w, https://jeroenjanssens.com/img/ECYn0qvW-b-604.jpeg 604w, https://jeroenjanssens.com/img/ECYn0qvW-b-907.jpeg 907w, https://jeroenjanssens.com/img/ECYn0qvW-b-1209.jpeg 1209w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/ECYn0qvW-b-302.jpeg 302w, https://jeroenjanssens.com/img/ECYn0qvW-b-453.jpeg 453w, https://jeroenjanssens.com/img/ECYn0qvW-b-604.jpeg 604w, https://jeroenjanssens.com/img/ECYn0qvW-b-907.jpeg 907w, https://jeroenjanssens.com/img/ECYn0qvW-b-1209.jpeg 1209w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/ECYn0qvW-b-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
</div>
<div class="mx-auto md:w-1/2 ">
<figure>
<a href="https://jeroenjanssens.com/img/plotnine-grammar-of-graphics-for-python/plotnine-unnamed-chunk-100-1.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/-8F3w81f_u-302.webp 302w, https://jeroenjanssens.com/img/-8F3w81f_u-453.webp 453w, https://jeroenjanssens.com/img/-8F3w81f_u-604.webp 604w, https://jeroenjanssens.com/img/-8F3w81f_u-907.webp 907w, https://jeroenjanssens.com/img/-8F3w81f_u-1209.webp 1209w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/-8F3w81f_u-302.webp 302w, https://jeroenjanssens.com/img/-8F3w81f_u-453.webp 453w, https://jeroenjanssens.com/img/-8F3w81f_u-604.webp 604w, https://jeroenjanssens.com/img/-8F3w81f_u-907.webp 907w, https://jeroenjanssens.com/img/-8F3w81f_u-1209.webp 1209w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/-8F3w81f_u-302.jpeg 302w, https://jeroenjanssens.com/img/-8F3w81f_u-453.jpeg 453w, https://jeroenjanssens.com/img/-8F3w81f_u-604.jpeg 604w, https://jeroenjanssens.com/img/-8F3w81f_u-907.jpeg 907w, https://jeroenjanssens.com/img/-8F3w81f_u-1209.jpeg 1209w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/-8F3w81f_u-302.jpeg 302w, https://jeroenjanssens.com/img/-8F3w81f_u-453.jpeg 453w, https://jeroenjanssens.com/img/-8F3w81f_u-604.jpeg 604w, https://jeroenjanssens.com/img/-8F3w81f_u-907.jpeg 907w, https://jeroenjanssens.com/img/-8F3w81f_u-1209.jpeg 1209w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/-8F3w81f_u-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
</div>
</div>
<p>One way to overcome this problem is to share scales across multiple
plots, training the scales with the <code>limits</code> of the full data.</p>
<pre class="language-python"><code class="language-python">x_scale <span class="token operator">=</span> scale_x_continuous<span class="token punctuation">(</span>limits<span class="token operator">=</span><span class="token punctuation">(</span>mpg<span class="token punctuation">.</span>displ<span class="token punctuation">.</span><span class="token builtin">min</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span> mpg<span class="token punctuation">.</span>displ<span class="token punctuation">.</span><span class="token builtin">max</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">)</span><br />y_scale <span class="token operator">=</span> scale_y_continuous<span class="token punctuation">(</span>limits<span class="token operator">=</span><span class="token punctuation">(</span>mpg<span class="token punctuation">.</span>hwy<span class="token punctuation">.</span><span class="token builtin">min</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span> mpg<span class="token punctuation">.</span>hwy<span class="token punctuation">.</span><span class="token builtin">max</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">)</span><br />col_scale <span class="token operator">=</span> scale_colour_discrete<span class="token punctuation">(</span>limits<span class="token operator">=</span>mpg<span class="token punctuation">.</span>drv<span class="token punctuation">.</span>unique<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span><br /><br />ggplot<span class="token punctuation">(</span>suv<span class="token punctuation">,</span> aes<span class="token punctuation">(</span><span class="token string">"displ"</span><span class="token punctuation">,</span> <span class="token string">"hwy"</span><span class="token punctuation">,</span> colour<span class="token operator">=</span><span class="token string">"drv"</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_point<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />x_scale <span class="token operator">+</span>\<br />y_scale <span class="token operator">+</span>\<br />col_scale<br /><br />ggplot<span class="token punctuation">(</span>compact<span class="token punctuation">,</span> aes<span class="token punctuation">(</span><span class="token string">"displ"</span><span class="token punctuation">,</span> <span class="token string">"hwy"</span><span class="token punctuation">,</span> colour<span class="token operator">=</span><span class="token string">"drv"</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_point<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />x_scale <span class="token operator">+</span>\<br />y_scale <span class="token operator">+</span>\<br />col_scale</code></pre>
<div class="flex flex-wrap md:flex-row mb-4">
<div class="mx-auto md:w-1/2 ">
<figure>
<a href="https://jeroenjanssens.com/img/plotnine-grammar-of-graphics-for-python/plotnine-unnamed-chunk-102-1.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/kil3N36vm3-302.webp 302w, https://jeroenjanssens.com/img/kil3N36vm3-453.webp 453w, https://jeroenjanssens.com/img/kil3N36vm3-604.webp 604w, https://jeroenjanssens.com/img/kil3N36vm3-907.webp 907w, https://jeroenjanssens.com/img/kil3N36vm3-1209.webp 1209w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/kil3N36vm3-302.webp 302w, https://jeroenjanssens.com/img/kil3N36vm3-453.webp 453w, https://jeroenjanssens.com/img/kil3N36vm3-604.webp 604w, https://jeroenjanssens.com/img/kil3N36vm3-907.webp 907w, https://jeroenjanssens.com/img/kil3N36vm3-1209.webp 1209w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/kil3N36vm3-302.jpeg 302w, https://jeroenjanssens.com/img/kil3N36vm3-453.jpeg 453w, https://jeroenjanssens.com/img/kil3N36vm3-604.jpeg 604w, https://jeroenjanssens.com/img/kil3N36vm3-907.jpeg 907w, https://jeroenjanssens.com/img/kil3N36vm3-1209.jpeg 1209w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/kil3N36vm3-302.jpeg 302w, https://jeroenjanssens.com/img/kil3N36vm3-453.jpeg 453w, https://jeroenjanssens.com/img/kil3N36vm3-604.jpeg 604w, https://jeroenjanssens.com/img/kil3N36vm3-907.jpeg 907w, https://jeroenjanssens.com/img/kil3N36vm3-1209.jpeg 1209w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/kil3N36vm3-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
</div>
<div class="mx-auto md:w-1/2 ">
<figure>
<a href="https://jeroenjanssens.com/img/plotnine-grammar-of-graphics-for-python/plotnine-unnamed-chunk-103-1.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/O1w223b8L3-302.webp 302w, https://jeroenjanssens.com/img/O1w223b8L3-453.webp 453w, https://jeroenjanssens.com/img/O1w223b8L3-604.webp 604w, https://jeroenjanssens.com/img/O1w223b8L3-907.webp 907w, https://jeroenjanssens.com/img/O1w223b8L3-1209.webp 1209w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/O1w223b8L3-302.webp 302w, https://jeroenjanssens.com/img/O1w223b8L3-453.webp 453w, https://jeroenjanssens.com/img/O1w223b8L3-604.webp 604w, https://jeroenjanssens.com/img/O1w223b8L3-907.webp 907w, https://jeroenjanssens.com/img/O1w223b8L3-1209.webp 1209w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/O1w223b8L3-302.jpeg 302w, https://jeroenjanssens.com/img/O1w223b8L3-453.jpeg 453w, https://jeroenjanssens.com/img/O1w223b8L3-604.jpeg 604w, https://jeroenjanssens.com/img/O1w223b8L3-907.jpeg 907w, https://jeroenjanssens.com/img/O1w223b8L3-1209.jpeg 1209w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/O1w223b8L3-302.jpeg 302w, https://jeroenjanssens.com/img/O1w223b8L3-453.jpeg 453w, https://jeroenjanssens.com/img/O1w223b8L3-604.jpeg 604w, https://jeroenjanssens.com/img/O1w223b8L3-907.jpeg 907w, https://jeroenjanssens.com/img/O1w223b8L3-1209.jpeg 1209w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/O1w223b8L3-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
</div>
</div>
<p>In this particular case, you could have simply used faceting, but this
technique is useful more generally, if for instance, you want spread
plots over multiple pages of a report.</p>
<h2><a href="https://r4ds.had.co.nz/graphics-for-communication.html#themes">28.6</a> Themes</h2>
<p>Finally, you can customise the non-data elements of your plot with a
theme:</p>
<pre class="language-python"><code class="language-python">ggplot<span class="token punctuation">(</span>mpg<span class="token punctuation">,</span> aes<span class="token punctuation">(</span><span class="token string">"displ"</span><span class="token punctuation">,</span> <span class="token string">"hwy"</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_point<span class="token punctuation">(</span>aes<span class="token punctuation">(</span>color<span class="token operator">=</span><span class="token string">"class"</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />geom_smooth<span class="token punctuation">(</span>se<span class="token operator">=</span><span class="token boolean">False</span><span class="token punctuation">)</span> <span class="token operator">+</span>\<br />theme_xkcd<span class="token punctuation">(</span><span class="token punctuation">)</span></code></pre>
<figure>
<a href="https://jeroenjanssens.com/img/plotnine-grammar-of-graphics-for-python/plotnine-unnamed-chunk-104-1.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/LPEnCfZAlm-302.webp 302w, https://jeroenjanssens.com/img/LPEnCfZAlm-453.webp 453w, https://jeroenjanssens.com/img/LPEnCfZAlm-604.webp 604w, https://jeroenjanssens.com/img/LPEnCfZAlm-907.webp 907w, https://jeroenjanssens.com/img/LPEnCfZAlm-1209.webp 1209w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/LPEnCfZAlm-302.webp 302w, https://jeroenjanssens.com/img/LPEnCfZAlm-453.webp 453w, https://jeroenjanssens.com/img/LPEnCfZAlm-604.webp 604w, https://jeroenjanssens.com/img/LPEnCfZAlm-907.webp 907w, https://jeroenjanssens.com/img/LPEnCfZAlm-1209.webp 1209w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/LPEnCfZAlm-302.jpeg 302w, https://jeroenjanssens.com/img/LPEnCfZAlm-453.jpeg 453w, https://jeroenjanssens.com/img/LPEnCfZAlm-604.jpeg 604w, https://jeroenjanssens.com/img/LPEnCfZAlm-907.jpeg 907w, https://jeroenjanssens.com/img/LPEnCfZAlm-1209.jpeg 1209w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/LPEnCfZAlm-302.jpeg 302w, https://jeroenjanssens.com/img/LPEnCfZAlm-453.jpeg 453w, https://jeroenjanssens.com/img/LPEnCfZAlm-604.jpeg 604w, https://jeroenjanssens.com/img/LPEnCfZAlm-907.jpeg 907w, https://jeroenjanssens.com/img/LPEnCfZAlm-1209.jpeg 1209w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/LPEnCfZAlm-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
<p>plotnine includes twelve themes by default. The figure below shows eight
of those. The
<a href="https://plotnine.readthedocs.io/en/stable/api.html#themes">documentation</a>
lists all available themes.</p>
<figure>
<a href="https://jeroenjanssens.com/img/plotnine-grammar-of-graphics-for-python/plotnine-visualization-themes.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/9b0iHhgK0V-302.webp 302w, https://jeroenjanssens.com/img/9b0iHhgK0V-453.webp 453w, https://jeroenjanssens.com/img/9b0iHhgK0V-604.webp 604w, https://jeroenjanssens.com/img/9b0iHhgK0V-907.webp 907w, https://jeroenjanssens.com/img/9b0iHhgK0V-1209.webp 1209w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/9b0iHhgK0V-302.webp 302w, https://jeroenjanssens.com/img/9b0iHhgK0V-453.webp 453w, https://jeroenjanssens.com/img/9b0iHhgK0V-604.webp 604w, https://jeroenjanssens.com/img/9b0iHhgK0V-907.webp 907w, https://jeroenjanssens.com/img/9b0iHhgK0V-1209.webp 1209w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/9b0iHhgK0V-302.jpeg 302w, https://jeroenjanssens.com/img/9b0iHhgK0V-453.jpeg 453w, https://jeroenjanssens.com/img/9b0iHhgK0V-604.jpeg 604w, https://jeroenjanssens.com/img/9b0iHhgK0V-907.jpeg 907w, https://jeroenjanssens.com/img/9b0iHhgK0V-1209.jpeg 1209w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/9b0iHhgK0V-302.jpeg 302w, https://jeroenjanssens.com/img/9b0iHhgK0V-453.jpeg 453w, https://jeroenjanssens.com/img/9b0iHhgK0V-604.jpeg 604w, https://jeroenjanssens.com/img/9b0iHhgK0V-907.jpeg 907w, https://jeroenjanssens.com/img/9b0iHhgK0V-1209.jpeg 1209w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/9b0iHhgK0V-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
<p>Many people wonder why the default theme has a grey background. This was
a deliberate choice because it puts the data forward while still making
the grid lines visible. The white grid lines are visible (which is
important because they significantly aid position judgements), but they
have little visual impact and we can easily tune them out. The grey
background gives the plot a similar typographic colour to the text,
ensuring that the graphics fit in with the flow of a document without
jumping out with a bright white background. Finally, the grey background
creates a continuous field of colour which ensures that the plot is
perceived as a single visual entity.</p>
<p>It’s also possible to control individual components of each theme, like
the size and colour of the font used for the y axis. Unfortunately, this
level of detail is outside the scope of this book, so you’ll need to
read the <a href="https://amzn.com/331924275X">ggplot2 book</a> for the full
details. You can also create your own themes, if you are trying to match
a particular corporate or journal style.</p>
<h2><a href="https://r4ds.had.co.nz/graphics-for-communication.html#saving-your-plots">28.7</a> Saving your plots</h2>
<p>The best way to get your plots out of Python and into your final
write-up<sup class="footnote-ref"><a href="https://jeroenjanssens.com/plotnine/#fn13" id="fnref13">[13]</a></sup> is with the <code>.save()</code> method. There’s also the <code>ggsave()</code>
function, but the plotnine documentation doesn’t recommend using this.
The <code>.save()</code> method will save the plot to disk. In a Jupyter Notebook
you can refer to the last returned value using <code>_</code>. Alternatively you
first assing your plot to a variable.</p>
<pre class="language-python"><code class="language-python">ggplot<span class="token punctuation">(</span>mpg<span class="token punctuation">,</span> aes<span class="token punctuation">(</span><span class="token string">"displ"</span><span class="token punctuation">,</span> <span class="token string">"hwy"</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token operator">+</span> geom_point<span class="token punctuation">(</span><span class="token punctuation">)</span></code></pre>
<figure>
<a href="https://jeroenjanssens.com/img/plotnine-grammar-of-graphics-for-python/plotnine-unnamed-chunk-106-1.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/EauI05WqZD-302.webp 302w, https://jeroenjanssens.com/img/EauI05WqZD-453.webp 453w, https://jeroenjanssens.com/img/EauI05WqZD-604.webp 604w, https://jeroenjanssens.com/img/EauI05WqZD-907.webp 907w, https://jeroenjanssens.com/img/EauI05WqZD-1209.webp 1209w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/EauI05WqZD-302.webp 302w, https://jeroenjanssens.com/img/EauI05WqZD-453.webp 453w, https://jeroenjanssens.com/img/EauI05WqZD-604.webp 604w, https://jeroenjanssens.com/img/EauI05WqZD-907.webp 907w, https://jeroenjanssens.com/img/EauI05WqZD-1209.webp 1209w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/EauI05WqZD-302.jpeg 302w, https://jeroenjanssens.com/img/EauI05WqZD-453.jpeg 453w, https://jeroenjanssens.com/img/EauI05WqZD-604.jpeg 604w, https://jeroenjanssens.com/img/EauI05WqZD-907.jpeg 907w, https://jeroenjanssens.com/img/EauI05WqZD-1209.jpeg 1209w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/EauI05WqZD-302.jpeg 302w, https://jeroenjanssens.com/img/EauI05WqZD-453.jpeg 453w, https://jeroenjanssens.com/img/EauI05WqZD-604.jpeg 604w, https://jeroenjanssens.com/img/EauI05WqZD-907.jpeg 907w, https://jeroenjanssens.com/img/EauI05WqZD-1209.jpeg 1209w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/EauI05WqZD-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
<pre class="language-python"><code class="language-python">_<span class="token punctuation">.</span>save<span class="token punctuation">(</span><span class="token string">"my-plot.pdf"</span><span class="token punctuation">)</span></code></pre>
<p>If you don’t specify the <code>width</code> and <code>height</code> they will be set to 6.4
and 4.8 inches, respectively. If you don’t specify <code>filename</code>, plotnine
will generate one for you, e.g., “plotnine-save-297120101.pdf”. For
reproducible code, you’ll want to specify them. You can learn more about
the <code>.save()</code> method in the documentation.</p>
<h3><a href="https://r4ds.had.co.nz/graphics-for-communication.html#figure-sizing">28.7.1</a> Figure sizing</h3>
<p>It can be a challenge to get your figure in the right size and shape.
There are four options that control figure sizing: <code>width</code>, <code>height</code>,
<code>units</code>, and <code>dpi</code>.</p>
<p>If you find that you’re having to squint to read the text in your plot,
you need to tweak <code>width</code> and <code>height</code>. If the <code>width</code> is larger than
the size the figure is rendered in the final doc, the text will be too
small; if <code>width</code> is smaller, the text will be too big. You’ll often
need to do a little experimentation to figure out the right ratio
between the <code>width</code> and the eventual width in your document. To
illustrate the principle, the following three plots have <code>width</code> of 4,
6, and 8 respectively (and a height which is 0.618 times the width,
i.e., the golden ratio):</p>
<figure>
<a href="https://jeroenjanssens.com/img/plotnine-grammar-of-graphics-for-python/plotnine-save-width-4.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.7)" srcset="https://jeroenjanssens.com/img/6j8yTb_sYV-235.webp 235w, https://jeroenjanssens.com/img/6j8yTb_sYV-352.webp 352w, https://jeroenjanssens.com/img/6j8yTb_sYV-470.webp 470w, https://jeroenjanssens.com/img/6j8yTb_sYV-705.webp 705w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.7)" srcset="https://jeroenjanssens.com/img/6j8yTb_sYV-235.webp 235w, https://jeroenjanssens.com/img/6j8yTb_sYV-352.webp 352w, https://jeroenjanssens.com/img/6j8yTb_sYV-470.webp 470w, https://jeroenjanssens.com/img/6j8yTb_sYV-705.webp 705w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.7)" srcset="https://jeroenjanssens.com/img/6j8yTb_sYV-235.jpeg 235w, https://jeroenjanssens.com/img/6j8yTb_sYV-352.jpeg 352w, https://jeroenjanssens.com/img/6j8yTb_sYV-470.jpeg 470w, https://jeroenjanssens.com/img/6j8yTb_sYV-705.jpeg 705w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.7)" srcset="https://jeroenjanssens.com/img/6j8yTb_sYV-235.jpeg 235w, https://jeroenjanssens.com/img/6j8yTb_sYV-352.jpeg 352w, https://jeroenjanssens.com/img/6j8yTb_sYV-470.jpeg 470w, https://jeroenjanssens.com/img/6j8yTb_sYV-705.jpeg 705w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 70%;" src="https://jeroenjanssens.com/img/6j8yTb_sYV-235.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
<figure>
<a href="https://jeroenjanssens.com/img/plotnine-grammar-of-graphics-for-python/plotnine-save-width-6.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.7)" srcset="https://jeroenjanssens.com/img/x8KPX7lAOT-235.webp 235w, https://jeroenjanssens.com/img/x8KPX7lAOT-352.webp 352w, https://jeroenjanssens.com/img/x8KPX7lAOT-470.webp 470w, https://jeroenjanssens.com/img/x8KPX7lAOT-705.webp 705w, https://jeroenjanssens.com/img/x8KPX7lAOT-940.webp 940w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.7)" srcset="https://jeroenjanssens.com/img/x8KPX7lAOT-235.webp 235w, https://jeroenjanssens.com/img/x8KPX7lAOT-352.webp 352w, https://jeroenjanssens.com/img/x8KPX7lAOT-470.webp 470w, https://jeroenjanssens.com/img/x8KPX7lAOT-705.webp 705w, https://jeroenjanssens.com/img/x8KPX7lAOT-940.webp 940w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.7)" srcset="https://jeroenjanssens.com/img/x8KPX7lAOT-235.jpeg 235w, https://jeroenjanssens.com/img/x8KPX7lAOT-352.jpeg 352w, https://jeroenjanssens.com/img/x8KPX7lAOT-470.jpeg 470w, https://jeroenjanssens.com/img/x8KPX7lAOT-705.jpeg 705w, https://jeroenjanssens.com/img/x8KPX7lAOT-940.jpeg 940w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.7)" srcset="https://jeroenjanssens.com/img/x8KPX7lAOT-235.jpeg 235w, https://jeroenjanssens.com/img/x8KPX7lAOT-352.jpeg 352w, https://jeroenjanssens.com/img/x8KPX7lAOT-470.jpeg 470w, https://jeroenjanssens.com/img/x8KPX7lAOT-705.jpeg 705w, https://jeroenjanssens.com/img/x8KPX7lAOT-940.jpeg 940w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 70%;" src="https://jeroenjanssens.com/img/x8KPX7lAOT-235.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
<figure>
<a href="https://jeroenjanssens.com/img/plotnine-grammar-of-graphics-for-python/plotnine-save-width-8.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.7)" srcset="https://jeroenjanssens.com/img/iNS-5E_qPn-235.webp 235w, https://jeroenjanssens.com/img/iNS-5E_qPn-352.webp 352w, https://jeroenjanssens.com/img/iNS-5E_qPn-470.webp 470w, https://jeroenjanssens.com/img/iNS-5E_qPn-705.webp 705w, https://jeroenjanssens.com/img/iNS-5E_qPn-940.webp 940w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.7)" srcset="https://jeroenjanssens.com/img/iNS-5E_qPn-235.webp 235w, https://jeroenjanssens.com/img/iNS-5E_qPn-352.webp 352w, https://jeroenjanssens.com/img/iNS-5E_qPn-470.webp 470w, https://jeroenjanssens.com/img/iNS-5E_qPn-705.webp 705w, https://jeroenjanssens.com/img/iNS-5E_qPn-940.webp 940w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.7)" srcset="https://jeroenjanssens.com/img/iNS-5E_qPn-235.jpeg 235w, https://jeroenjanssens.com/img/iNS-5E_qPn-352.jpeg 352w, https://jeroenjanssens.com/img/iNS-5E_qPn-470.jpeg 470w, https://jeroenjanssens.com/img/iNS-5E_qPn-705.jpeg 705w, https://jeroenjanssens.com/img/iNS-5E_qPn-940.jpeg 940w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.7)" srcset="https://jeroenjanssens.com/img/iNS-5E_qPn-235.jpeg 235w, https://jeroenjanssens.com/img/iNS-5E_qPn-352.jpeg 352w, https://jeroenjanssens.com/img/iNS-5E_qPn-470.jpeg 470w, https://jeroenjanssens.com/img/iNS-5E_qPn-705.jpeg 705w, https://jeroenjanssens.com/img/iNS-5E_qPn-940.jpeg 940w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 70%;" src="https://jeroenjanssens.com/img/iNS-5E_qPn-235.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
<h2><a href="https://r4ds.had.co.nz/graphics-for-communication.html#learning-more-4">28.8</a> Learning more</h2>
<p>The absolute best place to learn more is the ggplot2 book: <a href="https://amzn.com/331924275X"><em>ggplot2:
Elegant graphics for data analysis</em></a>. It
goes into much more depth about the underlying theory, and has many more
examples of how to combine the individual pieces to solve practical
problems. Unfortunately, the book is not available online for free,
although you can find the source code at
<a href="https://github.com/hadley/ggplot2-book">https://github.com/hadley/ggplot2-book</a>.</p>
<hr class="footnotes-sep" />
<section class="footnotes">
<ol class="footnotes-list">
<li id="fn1" class="footnote-item"><p>There have been other attempts at porting ggplot2 to Python, such
as <a href="https://github.com/yhat/ggpy">ggpy</a>, but as far as I know, these
are no longer maintained. <a href="https://jeroenjanssens.com/plotnine/#fnref1" class="footnote-backref">↩︎</a></p>
</li>
<li id="fn2" class="footnote-item"><p>If you ever need to translate ggplot2 to plotnine yourself, check
out my <a href="https://jeroenjanssens.com/blog/heuristics-for-translating-ggplot2-to-plotnine">follow-up post containing
heuristics</a>
for doing so. <a href="https://jeroenjanssens.com/plotnine/#fnref2" class="footnote-backref">↩︎</a></p>
</li>
<li id="fn3" class="footnote-item"><p>It’s important to note that this tutorial is not meant to compare
Python and R. The never-ending flame wars between these two
languages are boring and unproductive. <a href="https://jeroenjanssens.com/plotnine/#fnref3" class="footnote-backref">↩︎</a></p>
</li>
<li id="fn4" class="footnote-item"><p>While it’s generally considered to be bad practice to import
everything into the global namespace, I think it’s fine to do this
in an ad-hoc environment such as a notebook as it makes using the
many functions plotnine provides more convenient. An additional
advantage is that the resulting code more closely resembles the
original ggplot2 code. Alternatively, it’s quite common to
<code>import plotnine as p9</code> and prefix every function with <code>p9.</code>. <a href="https://jeroenjanssens.com/plotnine/#fnref4" class="footnote-backref">↩︎</a></p>
</li>
<li id="fn5" class="footnote-item"><p>This tutorial was compiled with a <a href="https://github.com/jeroenjanssens/plotnine">fork from version
0.6.0</a> that fixes an
<a href="https://github.com/has2k1/plotnine/pull/325">issue</a> related to
using <code>ha</code> and <code>va</code> in <code>aes()</code>. <a href="https://jeroenjanssens.com/plotnine/#fnref5" class="footnote-backref">↩︎</a></p>
</li>
<li id="fn6" class="footnote-item"><p>If you dislike the continuation character <code>\</code> then an alternative
syntax is to wrap the entire expression in parentheses so that it’s
not needed. <a href="https://jeroenjanssens.com/plotnine/#fnref6" class="footnote-backref">↩︎</a></p>
</li>
<li id="fn7" class="footnote-item"><p>The original text uses the <code>class</code> variable, but to demonstrate
the same effect we need to use a variable with more distinct values
because plotnine supports more shapes than ggplot2. <a href="https://jeroenjanssens.com/plotnine/#fnref7" class="footnote-backref">↩︎</a></p>
</li>
<li id="fn8" class="footnote-item"><p>The original text has an additional exercise that contains code
which is semantically wrong on purpose, but in plotnine, the
corresponding code is also syntactically wrong. The reason is that
in plotnine, you can only use column names in the aesthetic mapping
and not literal values, e.g., <code>aes(color="blue")</code>. <a href="https://jeroenjanssens.com/plotnine/#fnref8" class="footnote-backref">↩︎</a></p>
</li>
<li id="fn9" class="footnote-item"><p>ggplot2 also has <code>coord_quickmap()</code> for producing maps with the
correct aspect ratio and <code>coord_polar()</code> for using polar
coordinates. plotnine doesn’t yet have these two functions. <a href="https://jeroenjanssens.com/plotnine/#fnref9" class="footnote-backref">↩︎</a></p>
</li>
<li id="fn10" class="footnote-item"><p>In ggplot2, you can also use <code>labs()</code> to add a subtitle and a
caption. <a href="https://jeroenjanssens.com/plotnine/#fnref10" class="footnote-backref">↩︎</a></p>
</li>
<li id="fn11" class="footnote-item"><p>We have to use <code>geom_point()</code> twice here because of an
<a href="https://github.com/has2k1/plotnine/issues/324">issue</a> with the
adjustText package. <a href="https://jeroenjanssens.com/plotnine/#fnref11" class="footnote-backref">↩︎</a></p>
</li>
<li id="fn12" class="footnote-item"><p>In ggplot2 you can write <code>labels = NULL</code> so you don’t need a
helper function. <a href="https://jeroenjanssens.com/plotnine/#fnref12" class="footnote-backref">↩︎</a></p>
</li>
<li id="fn13" class="footnote-item"><p>The original text discusses how to include your plot in R
Markdown. While it’s possible to include Python code and graphics in
an R Markdown document through the <a href="https://rstudio.github.io/reticulate/"><code>reticulate</code>
package</a>, like this tutorial
demonstrates, it’s beyond the scope of this text. If you’re
interested, you can have a look at the <a href="https://github.com/datascienceworkshops/r4ds-python-plotnine">Github
repository</a>
related to this tutorial, which includes the .Rmd source. <a href="https://jeroenjanssens.com/plotnine/#fnref13" class="footnote-backref">↩︎</a></p>
</li>
</ol>
</section>
Convert CSV to Vowpal Wabbit's Input Format2016-03-29T00:00:00Zhttps://jeroenjanssens.com/csv2vw/<p>I’ve created a Python script called <code>csv2vw</code> which, as the name implies,
converts CSV data to <a href="https://github.com/VowpalWabbit/vowpal_wabbit/wiki/Input-format">Vowpall Wabbit’s input
format</a>.
<code>csv2vw</code> is available on GitHub in my <a href="https://github.com/jeroenjanssens/dsutils">dsutils
repository</a>.</p>
<figure>
<a href="https://jeroenjanssens.com/img/csv2vw.jpg">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/8EOyQ3XgVA-302.webp 302w, https://jeroenjanssens.com/img/8EOyQ3XgVA-453.webp 453w, https://jeroenjanssens.com/img/8EOyQ3XgVA-604.webp 604w, https://jeroenjanssens.com/img/8EOyQ3XgVA-907.webp 907w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/8EOyQ3XgVA-302.webp 302w, https://jeroenjanssens.com/img/8EOyQ3XgVA-453.webp 453w, https://jeroenjanssens.com/img/8EOyQ3XgVA-604.webp 604w, https://jeroenjanssens.com/img/8EOyQ3XgVA-907.webp 907w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/8EOyQ3XgVA-302.jpeg 302w, https://jeroenjanssens.com/img/8EOyQ3XgVA-453.jpeg 453w, https://jeroenjanssens.com/img/8EOyQ3XgVA-604.jpeg 604w, https://jeroenjanssens.com/img/8EOyQ3XgVA-907.jpeg 907w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/8EOyQ3XgVA-302.jpeg 302w, https://jeroenjanssens.com/img/8EOyQ3XgVA-453.jpeg 453w, https://jeroenjanssens.com/img/8EOyQ3XgVA-604.jpeg 604w, https://jeroenjanssens.com/img/8EOyQ3XgVA-907.jpeg 907w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/8EOyQ3XgVA-302.jpeg" alt="A screenshot of csv2vw applied to the Iris dataset" loading="lazy" />
</picture></a>
<figcaption>A screenshot of csv2vw applied to the Iris dataset</figcaption>
</figure>
<p>Here are some examples to give you an idea of what it can do:</p>
<h4>Leave label values as is:</h4>
<pre class="language-shell-session"><code class="language-shell-session"><span class="token command"><span class="token shell-symbol important">$</span> <span class="token bash language-bash">csv2vw spam.csv <span class="token parameter variable">--label</span> target</span></span></code></pre>
<h4>Relabel values ‘ham’ to 0 and ‘spam’ to 1:</h4>
<pre class="language-shell-session"><code class="language-shell-session"><span class="token command"><span class="token shell-symbol important">$</span> <span class="token bash language-bash">csv2vw spam.csv <span class="token parameter variable">--label</span> target <span class="token parameter variable">--classes</span> ham,spam</span></span></code></pre>
<h4>Relabel values ‘ham’ to -1 and ‘spam’ to +1 (needed for logistic loss):</h4>
<pre class="language-shell-session"><code class="language-shell-session"><span class="token command"><span class="token shell-symbol important">$</span> <span class="token bash language-bash">csv2vw spam.csv <span class="token parameter variable">--label</span> target <span class="token parameter variable">--classes</span> ham,spam --minus-plus-one</span></span></code></pre>
<h4>Relabel first label value to 0, second to 1, and ignore the rest:</h4>
<pre class="language-shell-session"><code class="language-shell-session"><span class="token command"><span class="token shell-symbol important">$</span> <span class="token bash language-bash">csv2vw iris.csv <span class="token parameter variable">-lspecies</span> --auto-relabel --ignore-extra-classes</span></span></code></pre>
<h4>Relabel first label value to 1, second to 2, and so on:</h4>
<pre class="language-shell-session"><code class="language-shell-session"><span class="token command"><span class="token shell-symbol important">$</span> <span class="token bash language-bash"><span class="token operator"><</span> iris.csv csv2vw <span class="token parameter variable">-lspecies</span> <span class="token parameter variable">--multiclass</span> --auto-relabel</span></span></code></pre>
<h4>Relabel ‘versicolor’ to 1, ‘virginica’ to 2, and ‘setosa’ to 3:</h4>
<pre class="language-shell-session"><code class="language-shell-session"><span class="token command"><span class="token shell-symbol important">$</span> <span class="token bash language-bash"><span class="token operator"><</span> iris.csv csv2vw <span class="token parameter variable">-lspecies</span> <span class="token parameter variable">--multiclass</span> -cversicolor,virginica,setosa</span></span></code></pre>
<p>Note that <code>csv2vw</code> does not support namespaces.</p>
<p>— Jeroen</p>
Anomalies, Concerts, and the Command Line2015-05-18T00:00:00Zhttps://jeroenjanssens.com/datascienceweekly/<p><em>This interview first appeared on
<a href="http://www.datascienceweekly.org/data-scientist-interviews/data-science-at-the-command-line-jeroen-janssens-interview">Data
Science Weekly</a></em> in October 2014.</p>
<p>
We recently caught up with Jeroen Janssens, author of <a href="http://datascienceatthecommandline.com/">Data Science at the Command Line</a>. We were keen to learn more about his background, his recent work at YPlan and his work creating both the book and the (related) <a href="http://datasciencetoolbox.org/">Data Science Toolbox</a> project…
</p>
<p>
<strong>Hi Jeroen, firstly thank you for the interview. Let's start with your background and how you became interested in working with data...</strong>
</p>
<p>
<strong>Q</strong> - What is your 30 second bio?
<br />
<strong>A</strong> - Howdy! My name is <a href="https://twitter.com/jeroenhjanssens">Jeroen</a> and I'm a data scientist. At least I like to think that I am. As a Brooklynite who tries to turn dirty data into pretty plots and meaningful models using a MacBook, I do believe I match at least one of the many definitions of data scientist. Jokes aside, the first time I was given the title of data scientist was in January 2012, when I joined Visual Revenue in New York City. At the time, I was still finishing my Ph.D. in Machine Learning at Tilburg University in the Netherlands. In March 2013, Visual Revenue got acquired by Outbrain, where I stayed for eight months. The third and final startup in New York City where I was allowed to call myself data scientist was YPlan. And now, after a year of developing a recommendation engine for last-minute concerts, sporting events, and wine tastings, I'm excited to tell you that I'll soon be moving back to the Netherlands.
</p>
<p>
<strong>Q</strong> - How did you get interested in working with data?
<br />
<strong>A</strong> - During my undergraduate at University College Maastricht, which is a liberal arts college in the Netherlands, I took a course in Machine Learning. The idea of teaching computers by feeding it data fascinated me. Once I graduated, I wanted to learn more about this excited field, so I continued with an M.Sc. in Artificial Intelligence at Maastricht University, which has a strong focus on Machine Learning.
<br />
</p>
<p>
<strong>Q</strong> - So, what was the first data set you remember working with? What did you do with it?
<br />
<strong>A</strong> - The very first data set was actually one that I created myself, albeit in quite a naughty way. In high school--I must have been fifteen--I managed to create a program in Visual Basic that imitated the lab computers' login screen. When a student tried to log in, an error message would pop up and the username and password would be saved to a file. So, by the end of the day, I had a "data set" of dozens of username/password combinations. Don't worry, I didn't use that data at all; this whole thing was really about the challenge of fooling fellow students. Of course I couldn't keep my mouth shut about this feat, which quickly led to the punishment I deserved: vacuum cleaning all the classrooms for a month. Yes, I'll never forget that data set.
<br />
</p>
<p>
<strong>Q</strong> - I can imagine! Maybe it was that moment, though was there a specific "aha" moment when you realised the power of data?
<br />
<strong>A</strong> - Towards the end of my Ph.D., which focused on anomaly detection, I was looking into meta learning for one-class classifiers. In other words, I wanted to know whether it was possible to predict which one-class classifier would perform best on a new, unseen data set. Besides that, I also wanted to know which characteristics of that data set would be most important.
<br />
<br />
To achieve this, I constructed a so-called meta data set, where its 36 features were characteristics of 255 "regular" data sets (for example, number of data points, dimensionality). I evaluated 19 different one-class classifiers on those 255 data sets. The challenge was then to train a meta classifier on that meta data set, with 19 AUC performance values as the labels.
<br />
<br />
Long story short, because I tried to do too many things at once, I ended up with way too much data to examine. For weeks, I was getting lost in my own data. Eventually I managed to succeed. The lesson I learned was that there's also a thing as too much data; not in the sense of space, but in density, if that makes sense. And more importantly, I also learned to think harder before simply starting a huge computational experiment!
<br />
</p>
<p>
<br />
<strong>Makes sense! Thanks for sharing all that background. Let's switch gears and talk about this past year, where you've been the Senior Data Scientist at YPlan...</strong>
<br />
</p>
<p>
<strong>Q</strong> - Firstly, what is YPlan? How would you describe it to someone not familiar with it?
<br />
<strong>A</strong> - Here's the pitch I've been using for the past year. YPlan is for people who want to go out either tonight or tomorrow, but don't yet know what to do. It's an app for your iPhone or Android phone that shows you a curated list of last-minute events: anything ranging from Broadway shows to bottomless brunches in Brooklyn. If you see something you like you can book it in two taps. You don't no need to go to a different website, fill out a form, and print out the tickets. Instead, you just show your phone at the door and have a great time!
<br />
</p>
<p>
<strong>Q</strong> - That's great! What do you find most exciting about working at the intersection of Data Science and entertainment?
<br />
<strong>A</strong> - YPlan is essentially a market place between people and events. It's interesting to tinker with our data because a lot of it comes from people (which events do they look at and which one do they eventually book?). Plus, it's motivating trying to solve a (luxury) problem you have yourself, and then to get feedback from your customers. Another reason why YPlan was so great to work at, was that everybody has the same goal: making sure that our customers would find the perfect event and have a great time. You can improve on your recommendation system as much as you want (which I tried to do), but without great content and great customer support, you won't achieve this goal. I guess what I'm trying to say is that the best thing about YPlan were my colleagues, and that's what made it exciting.
<br />
</p>
<p>
<strong>Q</strong> - So what have you been working on this year? What has been the most surprising insight you've found?
<br />
<strong>A</strong> - At YPlan I've mostly been working on a content-based recommendation system, where the goal is essentially to predict the probability a customer would book a certain event. The reason the recommendation system is a content-based one rather than a collaborative one, is that our events have a very short shelf life, which is very different from say, the movies available on Netflix.
<br />
<br />
We've also created a backtesting system, which allows us to quickly evaluate the performance of the recommendation system to historical data whenever we make a change. Of course, such an evaluation does not give a definitive answer, so we always A/B test a new version with the current version. Still, being able to quickly make changes and evaluate has proved to be very useful.
<br />
<br />
The most surprising insight is, I think, how wrong our instincts and assumptions can be. A recommendation system, or any machine learning algorithm in production for that matter, is not just the math you would find in textbooks. As soon as you apply it to the real world, a lot of (hidden) assumptions will be made. For example, the initial feature weighting I came up with, has recently been greatly improved using an Evolutionary Algorithm on top of the backtesting system.
<br />
</p>
<p>
<br />
<strong>Thanks for sharing all that detail - very interesting! Let's switch gears and talk about the book you've been working on that came out recently...</strong>
<br />
</p>
<p>
<strong>Q</strong> - You just finished writing a book titled <a href="http://datascienceatthecommand.com/">Data Science at the Command Line</a>. What does the book cover?
<br />
<strong>A</strong> - Well, the main goal of the book is to teach why, how, and when the command line could be employed for data science. The book starts with explaining what the command line is and why it's such a powerful approach for working with data. At the end of the first chapter, we demonstrate the flexibility of the command line through an amusing example where we use The New York Times' API to infer when New York Fashion Week is happening. Then, after an introduction to the most important Unix concepts and tools, we demonstrate how to obtain data from sources such as relational databases, APIs, and Excel. Obtaining data is actually the first step of the OSEMN model, which is a very practical definition of data science by Hilary Mason and Chris Wiggins that forms the backbone of the book. The steps scrubbing, exploring, and modelling data are also covered in separate chapters. For the final step, interpreting data, a computer is of little use, let alone the command line. Besides those step chapters we also cover more general topics such as parallelising pipelines and managing data workflows.
</p>
<p>
<strong>Q</strong> - Who is the book best suited for?
<br />
<strong>A</strong> - I'd say everybody who has an affinity with data! The command line can be intimidating at first, it was for me at least, so I made sure the book makes very little assumptions. I created a virtual machine that contains all the necessary software and data, so it doesn't matter whether readers are on Windows, OS X, or Linux. Some programming experience helps, because in Chapter 4 we look at how to create reusable command-line tools from existing Python and R code.
</p>
<p>
<strong>Q</strong> - What can readers hope to learn?
<br />
<strong>A</strong> - The goal of the book is make the reader a more efficient and productive data scientist. It may surprise people that quite a few data science tasks, especially those related to obtaining and scrubbing, can be done much quicker on the command line than in a programming language. Of course, the command line has its limits, which means that you'd need to resort to a different approach. I don't use the command line for everything myself. It all depends on the task at hand whether I use the command line, IPython notebook, R, Go, D3 & CoffeeScript, or simply pen & paper. Knowing when to use which approach is important, and I'm convinced that there's a place for the command line.
<br />
<br />
One advantage of the command line is that it can easily be integrated with your existing data science workflow. On the one hand, you can often employ the command line from your own environment. IPython and R, for instance, allow you to run command-line tools and capture their output. On the other hand, you can turn your existing code into a reusable command-line tool. I'm convinced that being able to build up your own set of tools can make you a more efficient and productive data scientist.
</p>
<p>
<strong>Q</strong> - What has been your favorite part of writing the book?
<br />
<strong>A</strong> - Because the book discusses more than 80 command-line tools, many of which have very particular installation instructions, it would take the reader the better part of the day to get all set up. To prevent that, I wanted to create a virtual machine that would contain all the tools and data pre-installed, much like Matthew Russell had done for his book <a href="http://miningthesocialweb.com/">Mining the Social Web</a>. I figured that many authors would want to do something like that for their readers. The same holds for teachers and workshop instructors. They want their students up and running as quickly as possible. So, while I was writing my book, I started a project called the Data Science Toolbox, which was, and continues to be, a very interesting and educational experience.
</p>
<p>
<strong>Q</strong> - Got it! Let's talk more about the <a href="http://datasciencetoolbox.org/">Data Science Toolbox</a>. What is your objective for this project?
<br />
<strong>A</strong> - On the one hand the goal of the Data Science Toolbox is to enable everybody to get started doing data science quickly. The base version contains both R and the Python scientific stack, currently the two most popular environments to do data science. (I still find it amazing that you can download a complete operating system with this software and have it up and running in a matter of minutes.) On the other hand, authors and teachers should be able to easily create custom software and data bundles for their readers and students. It's a shame to waste time on getting all the required software and date installed. When everybody's running the Data Science Toolbox, you know that you all have exactly the same environment and you can get straight to the good stuff: doing data science.
</p>
<p>
<strong>Q</strong> - What have you developed so far? And what is coming soon?
<br />
<strong>A</strong> - Because the Data Science Toolbox stands on the shoulders of many giants: Ubuntu, Vagrant, VirtualBox, Ansible, Packer, and Amazon Web Services, not too much needed to be developed, honestly. Most work went into combining these technologies, creating a command-line tool for installing bundles, and making sure the Vagrant box and AWS AMIs stay up-to-date.
The success of the Data Science Toolbox is going to depend much more on the quantity and quality of bundles. In that sense it's really a community effort. Currently, there are a handful of bundles available. The most recent bundle is by Rob Doherty for his <a href="https://generalassemb.ly/education/data-science">Introduction to Data Science class at General Assembly</a> in New York. There are a few interesting collaborations going on at the moment, which should result in more bundles soon.
</p>
<p>
<br />
<strong>Thanks for sharing all the projects you've been working on - super interesting! Good luck with all your ongoing endeavours! Finally, let's talk a bit about the future and share some advice...</strong>
</p>
<p>
<strong>Q</strong> - What does the future of Data Science look like?
<br />
<strong>A</strong> - For me, and I hope for many others, data science will have a dark background and bright fixed-width characters. Seriously, the command line has been around for four decades and isn't going anywhere soon. Two concepts that make the command line so powerful are: working with streams of data and chaining computational blocks. Because the amount of data, and the demand to quickly extract value from it, will only increase, so will the importance of these two concepts. For example, only recently does R, thanks to magrittr and dplyr, support the piping of functions. Also streamtools, a very promising project from the New York Times R&D lab, embeds these two concepts.
</p>
<p>
<strong>Q</strong> - One last question, you said you're going back to the Netherlands? What are your plans?
<br />
<strong>A</strong> - That's right, back to the land of tulips, windmills, bikes, hagelslag, and hopefully, some data science! About three years ago, when I was convincing my wife to come with me to New York City, the role of data scientist practically didn't exist in the Netherlands. While it still doesn't come close to say, London, San Francisco, or New York City, it's good to see that it's catching up. More and more startups are looking for data scientists. Also, as far as I'm aware, three data science research centres have been formed: one in Amsterdam, one in Leiden, and one in Eindhoven. These developments open up many possibilities. Joining a startup, forming a startup, teaching a class, consulting, training, research; I'm currently considering many things. Exciting times ahead!
<br />
</p>
<p>
<br />
<strong>Jeroen</strong> - Thank you ever so much for your time! Really enjoyed learning more about your background, your work at YPlan and both <a href="http://datascienceatthecommandline.com/">your book</a> and toolbox projects. Good luck with the move home!
<br />
</p>
<p>
Readers, thanks for joining us! If you want to read more from Jeroen he can be found on twitter <a href="https://twitter.com/jeroenhjanssens">@jeroenhjanssens</a>.
</p>
IBash Notebook2015-02-19T00:00:00Zhttps://jeroenjanssens.com/ibash/<p>Did you know that there’s a Bash kernel for Jupyter Notebook? It even
displays inline images. To give you a glimpse, the code cell below makes
an API call to <a href="http://memegenerator.net/">memegenerator.net</a>, which
generates images on demand. From the response, the URL of the generated
image is extracted using <code>jq</code> and subsequently downloaded using <code>curl</code>.
The output is then displayed as an inline image by piping it to a
function called <code>display</code>. Perhaps a bit contrived, but if not with a
meme, how else am I supposed to grab your attention these days?</p>
<figure>
<a href="https://jeroenjanssens.com/img/ibash-notebook.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 1)" srcset="https://jeroenjanssens.com/img/mgBGlW819D-336.webp 336w, https://jeroenjanssens.com/img/mgBGlW819D-504.webp 504w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 1)" srcset="https://jeroenjanssens.com/img/mgBGlW819D-336.webp 336w, https://jeroenjanssens.com/img/mgBGlW819D-504.webp 504w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 1)" srcset="https://jeroenjanssens.com/img/mgBGlW819D-336.jpeg 336w, https://jeroenjanssens.com/img/mgBGlW819D-504.jpeg 504w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 1)" srcset="https://jeroenjanssens.com/img/mgBGlW819D-336.jpeg 336w, https://jeroenjanssens.com/img/mgBGlW819D-504.jpeg 504w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 100%;" src="https://jeroenjanssens.com/img/mgBGlW819D-336.jpeg" alt="Professor Farnsworth is very excited about inline images" loading="lazy" />
</picture></a>
<figcaption>Professor Farnsworth is very excited about inline images</figcaption>
</figure>
<p>In this post, I first give some background on notebooks and the IPython
Notebook/Jupyter project. Then, I explore the idea whether this “IBash
Notebook” has the potential to become a convenient environment for doing
data science. Subsequently, I explain how I added support for displaying
inline images. As an aside, I wonder whether it would be feasible and
worthwhile to publish my book <a href="http://datascienceatthecommandline.com/">Data Science at the Command
Line</a> as a collection of
notebooks. Finally, I discuss which issues remain to be improved and how
you can try out IBash Notebook for yourself. I’m curious to hear what
you think.</p>
<h2>You get a notebook. And you get a notebook. Everybody gets a notebook!</h2>
<p>Let’s take a step back for a moment. Doing research is hard. Recalling
which steps you’ve taken, and why, is even harder. To be an effective
researcher, you may want to <a href="http://colinpurrington.com/tips/academic/labnotebooks">keep a laboratory
notebook</a>.
Besides having a record of your steps and results, this also allows you
to improve reproducibility, share your research with others, and, yes,
think more clearly. So, why wouldn’t you keep a notebook?</p>
<p>Well, if you perform your research or analysis on a computer, where most
steps boil down to running code, invoking commands, and clicking
buttons, keeping an analogue notebook is rather cumbersome. Fortunately,
since recently, digital counterparts are quickly gaining popularity. For
the R community, for example, there’s <a href="http://rmarkdown.rstudio.com/">R
Markdown</a>. And for those who use the
Python scientific stack, there’s <a href="http://ipython.org/notebook.html">IPython
Notebook</a>. Both solutions are free and
allow you to combine code, text, equations, and visualisations into a
single document.</p>
<p>The people behind the IPython project saw the potential of having a
language-agnostic architecture. By creating a flexible messaging
protocol, writing <a href="http://ipython.org/ipython-doc/dev/development/kernels.html">good
documentation</a>
for it, and rebranding the project as the <a href="http://jupyter.org/">Jupyter
project</a>, they opened the door to other languages.
And now, languages like Julia, Ruby, and Haskell have their own kernel.
<a href="http://beakernotebook.com/">Beaker</a>, a completely different project,
even supports multiple languages in the same notebook.</p>
<h2>What about poor old Bash?</h2>
<p>To demonstrate how easy it is to create a new kernel for IPython
Notebook, <a href="https://twitter.com/takluyver/">Thomas Kluyver</a> created a
Python package called
<a href="https://github.com/takluyver/bash_kernel">bash_kernel</a>. This Bash
kernel basically works by using
<a href="https://pexpect.readthedocs.org/en/latest/">pexpect</a> to wrap around a
Bash command line. When I stumbled upon this package I immediately got
excited. This could be much more than just a demonstration. Call me
crazy, but I believe that with some additional effort, we might have an
IBash Notebook that would have some important advantages over a terminal
(which is the standard environment to interact with the command line;
see image below).</p>
<figure>
<a href="https://jeroenjanssens.com/img/terminal.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/0Ad51JygVe-302.webp 302w, https://jeroenjanssens.com/img/0Ad51JygVe-453.webp 453w, https://jeroenjanssens.com/img/0Ad51JygVe-604.webp 604w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/0Ad51JygVe-302.webp 302w, https://jeroenjanssens.com/img/0Ad51JygVe-453.webp 453w, https://jeroenjanssens.com/img/0Ad51JygVe-604.webp 604w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/0Ad51JygVe-302.jpeg 302w, https://jeroenjanssens.com/img/0Ad51JygVe-453.jpeg 453w, https://jeroenjanssens.com/img/0Ad51JygVe-604.jpeg 604w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/0Ad51JygVe-302.jpeg 302w, https://jeroenjanssens.com/img/0Ad51JygVe-453.jpeg 453w, https://jeroenjanssens.com/img/0Ad51JygVe-604.jpeg 604w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/0Ad51JygVe-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
<p>First, and perhaps most importantly, the command line is ad-hoc in
nature, which makes it difficult to reproduce your steps or share them
with your peers. To improve reproducibility, you could put those steps
in a shell script, <a href="http://www.gnu.org/software/make/">Makefile</a>, or
<a href="https://github.com/Factual/drake">Drakefile</a>, but when you’re working
in a notebook, they would be stored automatically.</p>
<p>Second, if you’re running a server or virtual machine, there would be no
need to <code>ssh</code> into it. As a result, Microsoft Windows users wouldn’t
need to resort to a third-party tool like
<a href="http://www.chiark.greenend.org.uk/~sgtatham/putty/">PuTTY</a> anymore. I’m
particularly interested in this advantage, together with the next two,
because ever since I started writing <a href="http://datascienceatthecommandline.com/">Data Science at the Command
Line</a>, I’ve been looking for
ways to make the command line more accessible to newcomers.</p>
<p>Third, for users who are new to the command line, a notebook with code
cells could be less intimidating than a terminal with a prompt. Because
the browser (and perhaps also IPython Notebook) is a familiar
environment, the threshold to try out the command line will be lower.</p>
<p>Fourth, in order to view an image located on a server or virtual
machine, you normally have to go trough an extra hoop. Approaches that I
know of are either: (1) copy this image to the host OS, (2) forward X11,
or (3) serve it using, say, <code>python -m SimpleHTTPServer</code> and then open
it in a browser. With a notebook, images can be shown inline. Which
brings us to…</p>
<h2>Adding support for displaying inline images</h2>
<p>For the Bash kernel to be a convenient environment for doing data
science, it could use a few additional features besides running
commands. Thanks to the architecture of IPython Notebook, inline
Markdown and LaTeX equations work out of the box. Having seen <a href="http://liftoffsoftware.com/Products/GateOne">Gate
One</a> (a browser-based
terminal that I had running on 200 EC2 instances for <a href="http://strataconf.com/stratany2014/public/schedule/detail/36204">my workshop at
Strata
NYC</a>)
and <a href="https://pigshell.com/">pigshell</a> (a shell-like website that lets you
interact with various APIs as Unix files), which are both able to
display inline images, I knew that’s what the Bash kernel needed next.</p>
<p>I initially thought this would be as easy as detecting the MIME type of
the output of a command. That way, when you would run <code>cat file.png</code>, an
image would be shown automatically. Unfortunately this approach didn’t
work because, as I later learned, <code>pexpect</code> isn’t meant to transfer
binary data. With some suggestions from <a href="https://twitter.com/takluyver/">Thomas
Kluyver</a>, I implemented the following
solution instead. (You may decide whether it’s a hack or not.)</p>
<p>The solution includes a Bash function called <code>display</code> that is
registered when the kernel starts. That way, images can now be displayed
by running something as simple as:</p>
<pre class="language-bash"><code class="language-bash">display <span class="token operator"><</span> file.png</code></pre>
<p>or something as involved as:</p>
<pre class="language-bash"><code class="language-bash"><span class="token function">cat</span> iris.csv <span class="token operator">|</span> <span class="token comment"># Read our beloved Iris data set</span><br />cols <span class="token parameter variable">-C</span> species body tapkee <span class="token parameter variable">-m</span> pca <span class="token operator">|</span> <span class="token comment"># Apply PCA using tapkee</span><br />header <span class="token parameter variable">-r</span> x,y,species <span class="token operator">|</span> <span class="token comment"># Replace header of CSV</span><br />Rio-scatter x y species <span class="token operator">|</span> display <span class="token comment"># Create scatter plot using ggplot2</span></code></pre>
<p>which produces:</p>
<figure>
<a href="https://jeroenjanssens.com/img/iris-pca.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.8)" srcset="https://jeroenjanssens.com/img/eRbqxF35AK-268.webp 268w, https://jeroenjanssens.com/img/eRbqxF35AK-403.webp 403w, https://jeroenjanssens.com/img/eRbqxF35AK-537.webp 537w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.8)" srcset="https://jeroenjanssens.com/img/eRbqxF35AK-268.webp 268w, https://jeroenjanssens.com/img/eRbqxF35AK-403.webp 403w, https://jeroenjanssens.com/img/eRbqxF35AK-537.webp 537w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.8)" srcset="https://jeroenjanssens.com/img/eRbqxF35AK-268.jpeg 268w, https://jeroenjanssens.com/img/eRbqxF35AK-403.jpeg 403w, https://jeroenjanssens.com/img/eRbqxF35AK-537.jpeg 537w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.8)" srcset="https://jeroenjanssens.com/img/eRbqxF35AK-268.jpeg 268w, https://jeroenjanssens.com/img/eRbqxF35AK-403.jpeg 403w, https://jeroenjanssens.com/img/eRbqxF35AK-537.jpeg 537w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 80%;" src="https://jeroenjanssens.com/img/eRbqxF35AK-268.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
<p>In case you’re interested, <code>cols</code> and <code>body</code> are used to only pass
numerical columns and no header to <a href="http://tapkee.lisitsyn.me/">tapkee</a>,
which is a fantastic library for dimensionality reduction by <a href="https://twitter.com/qdrgsm/">Sergey
Lisitsyn</a>. These two Bash scripts, together
with <code>header</code> and <code>Rio-scatter</code>, can be found in <a href="https://github.com/jeroenjanssens/data-science-at-the-command-line/tree/master/tools">this
repository</a>.
Speaking of command-line tools for plotting,
<a href="http://bokeh.pydata.org/">Bokeh</a>, which is a Python visualization
library built on top of matplotlib, will soon have its own command-line
tool as well.</p>
<p>To see what the <code>display</code> function looks like, we can run <code>type display</code>
in a notebook:</p>
<pre class="language-bash"><code class="language-bash">display is a <span class="token keyword">function</span><br /><span class="token function-name function">display</span> <span class="token punctuation">(</span><span class="token punctuation">)</span><br /><span class="token punctuation">{</span><br /> <span class="token assign-left variable">TMPFILE</span><span class="token operator">=</span><span class="token variable"><span class="token variable">$(</span>mktemp $<span class="token punctuation">{</span>TMPDIR-/tmp<span class="token punctuation">}</span>/bash_kernel.XXXXXXXXXX<span class="token variable">)</span></span><span class="token punctuation">;</span><br /> <span class="token function">cat</span> <span class="token operator">></span> <span class="token variable">$TMPFILE</span><span class="token punctuation">;</span><br /> <span class="token builtin class-name">echo</span> <span class="token string">"bash_kernel: saved image data to: <span class="token variable">$TMPFILE</span>"</span> <span class="token operator"><span class="token file-descriptor important">1</span>></span><span class="token file-descriptor important">&2</span><br /><span class="token punctuation">}</span></code></pre>
<p>In words, <code>display</code> saves the standard input to a temporary file and
prints the filename to standard error. After a code cell has been
evaluated, the Bash kernel simply extracts the filename from the output,
detects its MIME type using the
<a href="https://docs.python.org/3.4/library/imghdr.html">imghdr</a> library, and
sends the image data (encoded with base64) to the front end. Easy peasy.</p>
<p>I chose the name “display” because there’s also a <a href="http://www.imagemagick.org/script/display.php">command-line tool in
ImageMagick</a> called
“display” that accepts image data from standard input and shows it in a
new window. Because that tool works only when X is running, I figured
that a function called “display” could serve as a drop-in replacement
when using IPython Notebook.</p>
<h2>Aside: Publishing a book as a collection of notebooks</h2>
<p>IPython Notebook can also be used to write entire books. <a href="https://github.com/ptwobrussell/Mining-the-Social-Web-2nd-Edition">Mining the
Social
Web</a>,
<a href="https://github.com/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers">Probabilistic Programming and Bayesian Methods for
Hackers</a>,
and <a href="http://nbviewer.ipython.org/github/unpingco/Python-for-Signal-Processing/tree/master/">Python for Signal
Processing</a>
are but a few examples of books that have been published as a collection
of notebooks (usually one notebook per chapter). The main advantage of a
notebook as opposed to a book is that you can immediately run the code
yourself. Instead of passively reading about a certain package or tool,
you can actively try it out.</p>
<p>I wonder if I could (and should) do the same with my book <a href="http://datascienceatthecommandline.com/">Data Science
at the Command Line</a>. As an
initial test, I manually converted part of the first chapter to a
notebook, which you can <a href="http://nbviewer.ipython.org/github/jeroenjanssens/jeroenjanssens.github.io/blob/master/Data%20Science%20at%20the%20Command%20Line%20-%20When%20is%20Fashion%20Week%20in%20New%20York%3F.ipynb">view on
nbviewer</a>.
Converting the book’s source code wouldn’t be too difficult, especially
if we to convert it to Markdown and use
<a href="https://github.com/rossant/ipymd">ipymd</a>. What would be more
challenging are packaging and distribution.</p>
<p>The book introduces over 80 command-line tools, and installing them
manually would take the better part of a day. I do offer a virtual
machine based on Vagrant and VirtualBox that has everything installed,
but I suspect there’s a better way to package this with IBash Notebook.
For example, recently, <a href="https://twitter.com/twiecki">Thomas Wiecki</a>
created a Docker container that launches an IPython notebook server with
the <a href="https://registry.hub.docker.com/u/twiecki/pydata-docker-jupyterhub/">PyData
stack</a>
installed. And the <a href="https://github.com/jupyter/tmpnb">tmpnb</a> project
seems very promising as well. I must admit that I haven’t had time to
look into Docker and these two projects at all.</p>
<p>Distribution is then something I would need to figure out with my
publisher O’Reilly Media. Considering the recent efforts for the book
<a href="https://github.com/ceteri/jem-docker">Just Enough Math</a> by <a href="https://twitter.com/odewahn">Andew
Odewahn</a>, O’Reilly’s CTO, and their forward
thinking regarding publishing in general, I foresee many opportunities.</p>
<h2>What’s next?</h2>
<p>Having inline Markdown, equations, and images sure is nice. However, in
my opinion, the Bash kernel currently has two issues that hamper
usability. First, the output is only printed when the command is
finished; there are no real-time updates. This is especially
inconvenient if you want to keep an eye on some long-running process
using, say, <code>tail -f</code> or <code>htop</code>. Second, there’s no interactivity with
the process possible. This means that you cannot drop into some other
REPL like <code>julia</code> or <code>psql</code>. If there’s sufficient interest in IBash
Notebook, then I suspect that these issues can be solved. Regardless, I
believe that despite these two issues, IBash Notebook could very well
serve as a means to introduce people to the command line.</p>
<p>If you want to try out the Bash kernel for yourself, you should install
<a href="https://github.com/ipython/ipython">IPython 3</a> (which is currently in
development). Then, you can clone the <a href="https://github.com/takluyver/bash_kernel">Bash kernel GitHub
repository</a> and install the
package. (Best to do this all inside a virtual environment.) Next time
you start a new notebook, you should be able to select the Bash kernel
in the top-right corner.</p>
<p>So, what do you think? Do you agree that IBash Notebook has potential?
Am I crazy thinking that the command line can ever live outside the
terminal? Would you like to see my book published as a collection of
IBash notebooks? So many questions. Let me know on
<a href="https://twitter.com/jeroenhjanssens/">Twitter</a>.</p>
<p>— Jeroen</p>
<p><em>Thanks to Rob Doherty and Adam Johnson for reading drafts of this.</em></p>
Lean, Mean Data Science Machine2013-12-07T00:00:00Zhttps://jeroenjanssens.com/machine/<p>Data scientists love to create interesting models and exciting data
visualisations. However, before they get to that point, usually much
effort goes into obtaining, scrubbing, and exploring the required data.
I argue that the Unix command-line, although invented decades ago,
remains a powerful environment for processing data. It provides a
read-eval-print loop (REPL) that is often much more convenient for
exploratory data analysis than the edit-compile-run-debug cycle
associated with large programs and even scripts.</p>
<p>Unfortunately, setting up a workable environment and installing the
latest command-line tools can be quite a pain. This post describes how
to alleviate that pain and how to get you started doing data science on
the command line in a matter minutes.</p>
<h2>Data Science at the Command Line</h2>
<p>I <del>am currently authoring</del> authored a book titled “<a href="https://www.datascienceatthecommandline.com/">Data
Science at the Command
Line</a>”, which <del>will
be</del> was published by O’Reilly in October 2014. The main goal of the
book is to teach why, how, and when the command line could be employed
for data science. The tentative outline is as follows:</p>
<ol>
<li>Introduction</li>
<li>Getting Started</li>
<li>Step 1: Obtaining Data</li>
<li>Creating Reusable Command-Line Tools</li>
<li>Step 2: Scrubbing Data</li>
<li>Managing Your Data Workflow</li>
<li>Step 3: Exploring Data</li>
<li>Speeding Up Data-Intensive Commands</li>
<li>Step 4: Modelling Data</li>
<li>Poor Man’s MapReduce</li>
<li>Step 5: Interpreting Data</li>
<li>Conclusion</li>
</ol>
<p>Naturally, the book will be drenched with commands and source code. It
is important that the text, the code, and the output of the code are
consistent with each other. Manually running the code and copy-pasting
the output is a cumbersome and error-prone process. To automate this
process, I have created a script (a <a href="http://www.dexy.it/">dexy</a> filter
to be precise) that will (1) extract all the source code from the text,
(2) run these in an isolated environment, and (3) paste the output back
into the text. From here the O’Reilly toolchain takes over and converts
the text to a variety of digital formats. Very smooth.</p>
<h2>Your own Data Science Toolbox environment with Vagrant</h2>
<p>The environment is created and configured using
<a href="http://www.vagrantup.com/">Vagrant</a>, which is basically a wrapper
around VirtualBox and other virtualisation software such AWS EC2. With a
few commands, a fresh virtual machine is spun up and configured
according to a simple script. It was <a href="http://miningthesocialweb.com/2013/11/23/confessions-of-a-prolific-moonlighter-with-a-chronic-writing-disorder">Matthew Russell’s Ignite
talk</a>
that inspired me to use Vagrant; he provides one for his book <a href="http://miningthesocialweb.com/">Mining
the Social Web</a> that is focused more on
Python. If my Vagrant environment would be provided with Data Science at
the Command Line, then the reader would be able to follow along with the
commands and source code. But since my mission is to enable everybody to
do data science at the command-line as soon as possible, I have decided
to make it available right now.</p>
<p>Currently, the environment includes the <a href="https://jeroenjanssens.com/blog/seven-command-line-tools-for-data-science">seven command-line tools I
discussed</a> a while ago
and <a href="http://www.gnu.org/software/parallel/">GNU parallel</a>, which will be
discussed in Chapter 8. Just like the book itself, the environment is a
work in progress. In order to be able to run
<a href="https://github.com/jeroenjanssens/data-science-toolbox/blob/master/tools/Rio">Rio</a>
(one of the seven tools), I had to include the latest version of <code>R</code>,
together with the packages <code>ggplot2</code>, <code>sqldf</code>, and <code>plyr</code>. <del>I am
aware that many of you would prefer the Python scientific stack to be
included as well.</del> The Python scientific stack (<code>ipython</code>, <code>numpy</code>,
<code>scipy</code>, <code>matplotlib</code>, <code>pandas</code>, and <code>scikit-learn</code>) is also included.
However, because of disk-space and provision-time constraints, I doubt
whether it is desirable (or even possible) to create an environment that
includes everything. Perhaps that we can devise a solution where you
select which tools, packages, and languages you would like to have
installed. As mentioned, it is a work in progress and my main goal is to
get you up and running on the command line.</p>
<h2>Installing the Data Science Toolbox environment</h2>
<p>The environment is currently configured to run on top of
<a href="https://www.virtualbox.org/">VirtualBox</a>. (I am looking into the option
to deploy it on an AWS EC2 instance.) So, first you will need to install
<a href="https://www.virtualbox.org/">VirtualBox</a>. Second you need to install
<a href="http://www.vagrantup.com/">Vagrant</a>. Third, you need to download the
environment by cloning the data science toolbox. (If you do not want to
use <code>git</code> you can also <a href="https://github.com/jeroenjanssens/data-science-toolbox/archive/master.zip">download the zip
file</a>.)</p>
<pre class="language-shell-session"><code class="language-shell-session"><span class="token command"><span class="token shell-symbol important">$</span> <span class="token bash language-bash"><span class="token function">git</span> clone https://github.com/jeroenjanssens/data-science-toolbox.git</span></span><br /><span class="token command"><span class="token shell-symbol important">$</span> <span class="token bash language-bash"><span class="token builtin class-name">cd</span> data-science-toolbox/box</span></span></code></pre>
<p>Running <code>vagrant up</code> in the <code>box</code> directory will download the base box
(Ubuntu 12.04 LTS 64-bit), spin up a virtual machine, and provision it.
(Now would be the perfect time to think about any command-line scripts
you may have lying around and donate them to the <a href="http://datasciencetoolbox.org/">data science
toolbox</a>.) Once the provisioning is
complete, you will be able to log into your own lean, mean data science
machine:</p>
<pre class="language-shell-session"><code class="language-shell-session"><span class="token command"><span class="token shell-symbol important">$</span> <span class="token bash language-bash">vagrant <span class="token function">ssh</span></span></span></code></pre>
<p>Run the following command to test whether everything has been installed
correctly:</p>
<pre class="language-shell-session"><code class="language-shell-session"><span class="token command"><span class="token shell-symbol important">$</span> <span class="token bash language-bash"><span class="token function">curl</span> <span class="token parameter variable">-s</span> <span class="token string">'http://en.wikipedia.org/wiki/List_of_countries_and_territories_by_border/area_ratio'</span> <span class="token operator">|</span></span></span><br /><span class="token output">> scrape -be 'table.wikitable > tr:not(:first-child)' |<br />> xml2json |<br />> jq -c '.html.body.tr[] | {country: .td[1][], border: .td[2][], surface: .td[3][], ratio: .td[4][]}' |<br />> json2csv -p -k=country,ratio |<br />> Rio -se'sqldf("select * from df where ratio > 0.3 order by ratio desc")' | <br />> csvlook<br />|----------------+------------|<br />| country | ratio |<br />|----------------+------------|<br />| Vatican City | 7.2727273 |<br />| Monaco | 2.2 |<br />| San Marino | 0.6393443 |<br />| Liechtenstein | 0.475 |<br />|----------------+------------|</span></code></pre>
<p>The virtual machine is not entirely isolated. Files that you put in the
<code>box</code> directory will be accessible from the <code>/vagrant</code> directory in the
virtual machine. This allows you to use both the tools you already have
installed and the command-line tools provided by the environment. If you
want to install any of these tools on your own machine, then you can run
the relevant commands from the <a href="https://github.com/jeroenjanssens/data-science-toolbox/blob/master/box/bootstrap.sh">provisioning
script</a>.</p>
<h2><a name="comparison-of-virtual-environments-for-data-science"></a>Comparison of virtual environments for data science</h2>
<p>Of course the Data Science Toolbox environment is not the only one
available for doing data science! So far, I have been able to perform a
rudimentary comparison with three other solutions. (Please let me know
if you know any others.)</p>
<p><strong>1. Data Science Toolbox (DST)</strong> <br /> Created by: <a href="https://twitter.com/jeroenhjanssens/">Jeroen
Janssens</a> <br /> Github:
<a href="https://github.com/jeroenjanssens/data-science-toolbox">jeroenjanssens/data-science-toolbox</a>
<br /> Installs R, the Python scientific stack, and of course many
command-line tools for processing data. Uses Vagrant and for now it can
be deployed on VirtualBox, only.</p>
<p><strong>2. Mining the Social Web (MTSW)</strong> <br /> Created by: <a href="https://twitter.com/ptwobrussell">Matthew
Russel</a> <br /> Website:
<a href="http://miningthesocialweb.com/">miningthesocialweb.com/</a> <br /> Github:
<a href="https://github.com/ptwobrussell/Mining-the-Social-Web-2nd-Edition">ptwobrussell/Mining-the-Social-Web-2nd-Edition</a>
<br /> Uses Vagrant (with Chef as the provisioner, which is really nice)
and can be deployed on both VirtualBox and AWS. Installs IPython
Notebook, numpy, mongo, and NLTK, which allows you to follow along with
the examples provided in the book. An AWS AMI is available as well.</p>
<p><strong>3. Data Science Toolkit (DSTK)</strong> <br /> Created by: <a href="https://twitter.com/petewarden">Pete
Warden</a> <br /> Website:
<a href="http://www.datasciencetoolkit.org/">www.datasciencetoolkit.org</a> <br />
Github: <a href="https://github.com/petewarden/dstk">petewarden/dstk</a> <br /> The
website provides a sandbox from which you can try out many interesting
APIs. These APIs can also be accessed from the command line. An AWS AMI
is available.</p>
<p><strong>4. Data Science Box (DSB)</strong> <br /> Created by: <a href="https://twitter.com/drewconway">Drew
Conway</a> <br /> Github:
<a href="https://github.com/drewconway/data_science_box">drewconway/data_science_box</a>
<br /> This is a bash script for which you need have an AWS EC2 instance
running. It installs R, Shiny, IPython Notebook, and the Python
scientific stack.</p>
<p>For your convenience I have summarised this information in the following
table.</p>
<div class="w-full rounded-lg py-2 px-4 mb-8 bg-gray-100">
<div class="overflow-x-scroll py-2">
<table class="w-full text-left m-0 !text-sm">
<thead>
<tr>
<th class="p-1">
</th><th class="p-1">Configuration</th>
<th class="p-1">VirtualBox</th>
<th class="p-1">AWS</th>
<th class="p-1">AMI</th>
<th class="p-1">Python</th>
<th class="p-1">R</th>
<th class="p-1">Shiny</th>
<th class="p-1">Comments</th>
</tr>
</thead>
<tbody>
<tr>
<th class="p-1">1. DST</th>
<td class="p-1">Vagrant</td>
<td class="p-1 bg-blue-200">Yes</td>
<td class="p-1 bg-orange-200">No</td>
<td class="p-1 bg-orange-200">No</td>
<td class="p-1 bg-blue-200">Yes</td>
<td class="p-1 bg-blue-200">Yes</td>
<td class="p-1 bg-orange-200">No</td>
<td class="p-1">Includes the Data Science Toolbox</td>
</tr>
<tr>
<th class="p-1">2. MTSW</th>
<td class="p-1">Vagrant</td>
<td class="p-1 bg-blue-200">Yes</td>
<td class="p-1 bg-blue-200">Yes</td>
<td class="p-1 bg-blue-200">Yes</td>
<td class="p-1 bg-blue-200">Yes</td>
<td class="p-1 bg-orange-200">No</td>
<td class="p-1 bg-orange-200">No</td>
<td class="p-1"></td>
</tr>
<tr>
<th class="p-1">3. DSTK</th>
<td class="p-1">Vagrant</td>
<td class="p-1 bg-blue-200">Yes</td>
<td class="p-1 bg-blue-200">Yes</td>
<td class="p-1 bg-blue-200">Yes</td>
<td class="p-1 bg-orange-200">No</td>
<td class="p-1 bg-orange-200">No</td>
<td class="p-1 bg-orange-200">No</td>
<td class="p-1">Includes various command-line tools</td>
</tr>
<tr>
<th class="p-1">4. DSB</th>
<td class="p-1">Bash</td>
<td class="p-1 bg-orange-200">No</td>
<td class="p-1 bg-blue-200">Yes</td>
<td class="p-1 bg-orange-200">No</td>
<td class="p-1 bg-blue-200">Yes</td>
<td class="p-1 bg-blue-200">Yes</td>
<td class="p-1 bg-blue-200">Yes</td>
<td class="p-1"></td>
</tr>
</tbody>
</table>
</div>
</div>
<p>In short, I think that they all have some strong aspects. Some of these
may be improved over time (I am currently looking into using Chef as the
provisioner), new environments may arise; that is the way open source
works. In the end, it is up to you to decide which one works best for
you. And if you want to make some tweaks, you can always fork the
appropriate Github repository.</p>
<p>It is in general just amazing to be able to spin up a new virtual
machine with your own or somebody else’s environment, whether by running
<code>vagrant up</code> or by clicking a few buttons on AWS.</p>
<p>I realise that three out of four names look really alike, which can be
confusing, but it could also indicate that there is a need for having an
automated (and isolated) setup to start doing data science without any
additional hassle.</p>
<h2>Conclusion</h2>
<p>While the command line is a very powerful environment to process data,
manually installing the latest command-line tools is not
straightforward. Vagrant allows you to spin up a virtual machine and to
install all the tools automatically. In this post I have shared with you
the exact same Vagrant environment as that I am using for my upcoming
book, in the hope that it will be useful to get you started with doing
data science at the command line. I have also compared my environment
with three other virtual environments for data science. Please let me
know if you have any questions, suggestions, or contributions.</p>
<p>— Jeroen</p>
Stochastic Outlier Selection2013-11-14T00:00:00Zhttps://jeroenjanssens.com/sos/<p>My Ph.D., which I completed earlier this year, was about <a href="https://github.com/jeroenjanssens/phd-thesis">outlier
selection and one-class
classification</a>. During
this time I learned about quite a few machine learning algorithms;
especially about outlier-selection algorithms and one-class classifiers.
With some help of <a href="https://twitter.com/fhuszar">Ferenc Huszár</a> and
<a href="http://homepage.tudelft.nl/19j49/Home.html">Laurens van der Maaten</a>, I
also came up with a new outlier-selection algorithm called <a href="https://github.com/jeroenjanssens/scikit-sos">Stochastic
Outlier Selection</a> (SOS),
which I would like to briefly describe here.</p>
<figure>
<a href="https://jeroenjanssens.com/img/sos-densities.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/T7a0tgXYHz-302.webp 302w, https://jeroenjanssens.com/img/T7a0tgXYHz-453.webp 453w, https://jeroenjanssens.com/img/T7a0tgXYHz-604.webp 604w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/T7a0tgXYHz-302.webp 302w, https://jeroenjanssens.com/img/T7a0tgXYHz-453.webp 453w, https://jeroenjanssens.com/img/T7a0tgXYHz-604.webp 604w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/T7a0tgXYHz-302.jpeg 302w, https://jeroenjanssens.com/img/T7a0tgXYHz-453.jpeg 453w, https://jeroenjanssens.com/img/T7a0tgXYHz-604.jpeg 604w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/T7a0tgXYHz-302.jpeg 302w, https://jeroenjanssens.com/img/T7a0tgXYHz-453.jpeg 453w, https://jeroenjanssens.com/img/T7a0tgXYHz-604.jpeg 604w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/T7a0tgXYHz-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
<p>If you prefer a more detailed discussion about the algorithm, the
experiments, and the results, you can read chapter 4 of <a href="https://github.com/jeroenjanssens/phd-thesis">my Ph.D.
thesis</a>. In case you can’t
wait to see whether your own dataset contains any outliers then there’s
a <a href="https://github.com/jeroenjanssens/scikit-sos">Python implementation of
SOS</a> which you can also
use from the command-line.</p>
<h2>Affinity-based outlier selection</h2>
<p>SOS is an unsupervised outlier-selection algorithm that takes as input
either a feature matrix or a dissimilarity matrix and outputs for each
data point an outlier probability. Intuitively, a data point is
considered to be an outlier when the other data points have insufficient
affinity with it. Allow me to explain this using the following
two-dimensional toy dataset.</p>
<figure>
<a href="https://jeroenjanssens.com/img/sos-toydataset.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/EvaWFYPX6--302.webp 302w, https://jeroenjanssens.com/img/EvaWFYPX6--453.webp 453w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/EvaWFYPX6--302.webp 302w, https://jeroenjanssens.com/img/EvaWFYPX6--453.webp 453w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/EvaWFYPX6--302.jpeg 302w, https://jeroenjanssens.com/img/EvaWFYPX6--453.jpeg 453w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/EvaWFYPX6--302.jpeg 302w, https://jeroenjanssens.com/img/EvaWFYPX6--453.jpeg 453w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/EvaWFYPX6--302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
<p>The right part of the figure shows that the feature matrix <strong>X</strong> is
transformed into a dissimilarity matrix <strong>D</strong> using the Euclidean
distance. (Any dissimilarity measure could have been used here.) Using
the dissimilarity matrix <strong>D</strong>, SOS computes an affinity matrix <strong>A</strong>, a
binding probability matrix <strong>B</strong>, and finally, the outlier probability
vector <strong>Φ</strong>, because Greek letters are cool.</p>
<figure>
<a href="https://jeroenjanssens.com/img/sos-matrices.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/V3aJEkfqc3-302.webp 302w, https://jeroenjanssens.com/img/V3aJEkfqc3-453.webp 453w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/V3aJEkfqc3-302.webp 302w, https://jeroenjanssens.com/img/V3aJEkfqc3-453.webp 453w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/V3aJEkfqc3-302.jpeg 302w, https://jeroenjanssens.com/img/V3aJEkfqc3-453.jpeg 453w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/V3aJEkfqc3-302.jpeg 302w, https://jeroenjanssens.com/img/V3aJEkfqc3-453.jpeg 453w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/V3aJEkfqc3-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
<p>The use of the concept of affinity is inspired by <a href="http://homepage.tudelft.nl/19j49/t-SNE.html">t-Distributed
Stochastic neighbour
Embedding</a> (t-SNE), which
is a non-linear dimensionality reduction technique created by <a href="http://homepage.tudelft.nl/19j49/Home.html">Laurens
van der Maaten</a> and
<a href="http://www.cs.toronto.edu/~hinton/">Geoffrey Hinton</a>. Both algorithms
use the concept of affinity to quantify the relationship between data
points. t-SNE uses it to preserve the local structure of a
high-dimensional dataset and SOS uses it to select outliers. The
affinity a certain data point has with another data point decreases
Gaussian-like with respect to their dissimilarity.</p>
<figure>
<a href="https://jeroenjanssens.com/img/sos-d2a.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/_nfedPH5-9-302.webp 302w, https://jeroenjanssens.com/img/_nfedPH5-9-453.webp 453w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/_nfedPH5-9-302.webp 302w, https://jeroenjanssens.com/img/_nfedPH5-9-453.webp 453w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/_nfedPH5-9-302.jpeg 302w, https://jeroenjanssens.com/img/_nfedPH5-9-453.jpeg 453w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/_nfedPH5-9-302.jpeg 302w, https://jeroenjanssens.com/img/_nfedPH5-9-453.jpeg 453w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/_nfedPH5-9-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
<p>Each data point has a variance associated with it. The variance depends
on the density of the neighbourhood. A higher density implies a lower
variance. In fact, the variance is set such that each data point has
effectively the same number of neighbours. This number is controlled via
the only parameter of SOS, called perplexity.</p>
<figure>
<a href="https://jeroenjanssens.com/img/sos-variances.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.6)" srcset="https://jeroenjanssens.com/img/oFcFXbIoqR-201.webp 201w, https://jeroenjanssens.com/img/oFcFXbIoqR-302.webp 302w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.6)" srcset="https://jeroenjanssens.com/img/oFcFXbIoqR-201.webp 201w, https://jeroenjanssens.com/img/oFcFXbIoqR-302.webp 302w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.6)" srcset="https://jeroenjanssens.com/img/oFcFXbIoqR-201.jpeg 201w, https://jeroenjanssens.com/img/oFcFXbIoqR-302.jpeg 302w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.6)" srcset="https://jeroenjanssens.com/img/oFcFXbIoqR-201.jpeg 201w, https://jeroenjanssens.com/img/oFcFXbIoqR-302.jpeg 302w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 60%;" src="https://jeroenjanssens.com/img/oFcFXbIoqR-201.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
<p>Perplexity can be interpreted as the <em>k</em> in <em>k</em>-nearest neighbour
algorithms. The difference is that in SOS being a neighbour is not a
binary property, but a probabilistic one. The following figure
illustrates the binding probabilities data point <strong>x1</strong> (or vertex
<strong>v1</strong> because we have switched to a graph representation of the
dataset) has with the other five data points.</p>
<figure>
<a href="https://jeroenjanssens.com/img/sos-binding.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/TCJWBd5Dt5-302.webp 302w, https://jeroenjanssens.com/img/TCJWBd5Dt5-453.webp 453w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/TCJWBd5Dt5-302.webp 302w, https://jeroenjanssens.com/img/TCJWBd5Dt5-453.webp 453w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/TCJWBd5Dt5-302.jpeg 302w, https://jeroenjanssens.com/img/TCJWBd5Dt5-453.jpeg 453w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/TCJWBd5Dt5-302.jpeg 302w, https://jeroenjanssens.com/img/TCJWBd5Dt5-453.jpeg 453w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/TCJWBd5Dt5-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
<p>The binding probability matrix is just the affinity matrix such that the
rows sum to 1. To obtain the outlier probability of data point we
compute the joint probability that the other data points will <em>not</em> bind
to it.</p>
<figure>
<a href="https://jeroenjanssens.com/img/sos-closedform.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/4ELpCLjHig-302.webp 302w, https://jeroenjanssens.com/img/4ELpCLjHig-453.webp 453w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/4ELpCLjHig-302.webp 302w, https://jeroenjanssens.com/img/4ELpCLjHig-453.webp 453w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/4ELpCLjHig-302.jpeg 302w, https://jeroenjanssens.com/img/4ELpCLjHig-453.jpeg 453w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/4ELpCLjHig-302.jpeg 302w, https://jeroenjanssens.com/img/4ELpCLjHig-453.jpeg 453w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/4ELpCLjHig-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
<p>This simple equation corresponds to the intuition behind SOS mentioned
earlier: a data point is considered to be an outlier when the other data
points have insufficient affinity with it. The proof behind this
equation is unfortunately beyond the scope of this post.</p>
<p>SOS has been evaluated on a variety of real-world and synthetic
datasets, and compared to four other outlier-selection algorithms. The
following figure shows the weighted AUC performance on 18 real-world
datasets.</p>
<figure>
<a href="https://jeroenjanssens.com/img/sos-results.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/tsgxxQS7R7-302.webp 302w, https://jeroenjanssens.com/img/tsgxxQS7R7-453.webp 453w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/tsgxxQS7R7-302.webp 302w, https://jeroenjanssens.com/img/tsgxxQS7R7-453.webp 453w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/tsgxxQS7R7-302.jpeg 302w, https://jeroenjanssens.com/img/tsgxxQS7R7-453.jpeg 453w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/tsgxxQS7R7-302.jpeg 302w, https://jeroenjanssens.com/img/tsgxxQS7R7-453.jpeg 453w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/tsgxxQS7R7-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
<p>As you can see, SOS has a higher performance on most of these real-world
datasets. However, there’s still the no-free-lunch theorem, which
basically says that no algorithm uniformly outperforms all other
algorithms on all datasets. So, if you’d like to select some outliers on
your own dataset, check out SOS by all means, but keep in mind that you
may obtain a higher performance with a different outlier-selection
algorithm. The real questions are: which one and why?</p>
<p>As this was a very brief description of SOS, I had to skip over many
details. Again, in case you’re interested, you can read either the
<a href="https://github.com/jeroenjanssens/sos/blob/master/doc/sos-ticc-tr-2012-001.pdf?raw=true">technical report
(PDF)</a>
or chapter 4 of <a href="https://github.com/jeroenjanssens/phd-thesis">my Ph.D.
thesis</a>. In the next
section I apply SOS to roll call voting data.</p>
<h2><a name="detecting-anomalous-senators"></a>Detecting anomalous senators</h2>
<p>Last week, I had the pleasure to talk about outlier selection and
one-class classification at the <a href="http://www.meetup.com/NYC-Machine-Learning/events/149093182/">NYC Machine Learning
meetup</a>.
In order to not just show fancy graphs and boring equations I created a
<a href="http://bl.ocks.org/jeroenjanssens/7608890">demo in D3 and
CoffeeScript</a>, of which you
see a screenshot below. In the
<a href="http://bl.ocks.org/jeroenjanssens/7608890">demo</a>, I apply SOS on roll
call voting data, which is inspired by <a href="http://vikparuchuri.com/blog/how-divided-is-the-senate/">this post on visualising the
senate</a> by Vik
Paruchuri. The demo illustrates how the approximated outlier probability
of each senator evolves as more Stochastic neighbour Graphs (SNG) are
being sampled. (Please note that SNGs are not discussed in this post.)</p>
<p><a href="http://bl.ocks.org/jeroenjanssens/7608890"><figure>
<a href="https://jeroenjanssens.com/img/sos-senators.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/HUYzIBYNly-302.webp 302w, https://jeroenjanssens.com/img/HUYzIBYNly-453.webp 453w, https://jeroenjanssens.com/img/HUYzIBYNly-604.webp 604w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/HUYzIBYNly-302.webp 302w, https://jeroenjanssens.com/img/HUYzIBYNly-453.webp 453w, https://jeroenjanssens.com/img/HUYzIBYNly-604.webp 604w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/HUYzIBYNly-302.jpeg 302w, https://jeroenjanssens.com/img/HUYzIBYNly-453.jpeg 453w, https://jeroenjanssens.com/img/HUYzIBYNly-604.jpeg 604w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/HUYzIBYNly-302.jpeg 302w, https://jeroenjanssens.com/img/HUYzIBYNly-453.jpeg 453w, https://jeroenjanssens.com/img/HUYzIBYNly-604.jpeg 604w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/HUYzIBYNly-302.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure></a></p>
<p>Let’s see how the approximated outlier probabilities compare to the
outlier probabilities computed on the command-line. Recently, I started
using <a href="https://github.com/Factual/drake#drake">drake</a> to organise my
data workflow. (If you care about reproducibility, then I recommend you
try it out.) The following <code>Drakefile</code> shows how to fetch the roll call
voting data, extract its features and labels, and apply the <a href="https://github.com/jeroenjanssens/scikit-sos">Python
implementation of SOS</a>
with a perplexity of 50 to it.</p>
<pre class="language-shell-session"><code class="language-shell-session"><span class="token command"><span class="token shell-symbol important">$</span> <span class="token bash language-bash"><span class="token function">cat</span> Drakefile</span></span><br /><span class="token output">;# Get dataset<br />dataset.csv <- [-timecheck]<br /> curl -s "https://raw.github.com/VikParuchuri/political-positions/master/113_frame.csv" > $OUTPUT<br /><br />;# Extract features<br />features.csv <- dataset.csv<br /> csvcut $INPUT -C "1,name,party,state" | sed '1d;s/NA/4/g' > $OUTPUT<br /><br />;# Extract labels<br />labels.csv <- dataset.csv<br /> csvcut $INPUT -c "name,party,state" > $OUTPUT<br /><br />;# Compute outlier probabilities using SOS<br />outlier.csv <- features.csv<br /> echo 'outlier' > $OUTPUT<br /> < $INPUT sos -p 50 >> $OUTPUT<br /><br />;# Combine labels and outlier probabilities and sort<br />result.csv <- labels.csv, outlier.csv<br /> paste -d, $INPUT0 $INPUT1 | csvsort -rc outlier > $OUTPUT</span></code></pre>
<pre class="language-shell-session"><code class="language-shell-session"><span class="token command"><span class="token shell-symbol important">$</span> <span class="token bash language-bash">drake</span></span><br /><span class="token command"><span class="token shell-symbol important">$</span> <span class="token bash language-bash"><span class="token function">head</span> result.csv <span class="token operator">|</span> csvlook</span></span><br /><span class="token output">|-------------+-------+-------+-------------|<br />| name | party | state | outlier |<br />|-------------+-------+-------+-------------|<br />| Cowan | D | MA | 0.91758412 |<br />| Lautenberg | D | NJ | 0.89442425 |<br />| Chiesa | R | NJ | 0.8457114 |<br />| Markey | D | MA | 0.7813504 |<br />| Kerry | D | MA | 0.75302407 |<br />| Wyden | D | OR | 0.70110306 |<br />| Murkowski | R | AK | 0.68868458 |<br />| Alexander | R | TN | 0.626972 |<br />| Vitter | R | LA | 0.59739462 |<br />|-------------+-------+-------+-------------|</span></code></pre>
<p>The tools <code>csvcut</code>, <code>csvsort</code>, and <code>csvlook</code> are part of
<a href="http://csvkit.readthedocs.org/">csvkit</a>. You may notice that the
outlier probabilities shown in the screenshot do not match the exact
ones computed with <code>sos</code>. That’s because (1) the screenshot was taken
not long after the demo started and (2) the demo was running in Chrome,
which apparently has a different implementation of <code>Math.random</code>. In
Firefox, the approximated outlier probabilities will match the exact
ones, eventually.</p>
<p>— Jeroen</p>
7 Command-Line Tools for Data Science2013-09-19T00:00:00Zhttps://jeroenjanssens.com/seven/<p>Data science is
<a href="http://www.dataists.com/2010/09/a-taxonomy-of-data-science/">OSEMN</a>
(pronounced as awesome). That is, it involves Obtaining, Scrubbing,
Exploring, Modelling, and iNterpreting data. As a data scientist, I
spend quite a bit of time on the command-line, especially when there’s
data to be obtained, scrubbed, or explored. And I’m not alone in this.
Recently, <a href="http://www.gregreda.com/2013/07/15/unix-commands-for-data-science/">Greg Reda
discussed</a>
how the classics (e.g., head, cut, grep, sed, and awk) can be used for
data science. Prior to that, Seth Brown discussed how to perform basic
<a href="http://www.drbunsen.org/explorations-in-unix/">exploratory data analysis in
Unix</a>.</p>
<p>I would like to continue this discussion by sharing seven command-line
tools that I have found useful in my day-to-day work. The tools are:
<a href="http://stedolan.github.io/jq/">jq</a>,
<a href="https://github.com/jehiah/json2csv">json2csv</a>,
<a href="https://github.com/onyxfish/csvkit">csvkit</a>, scrape,
<a href="https://github.com/parmentf/xml2json">xml2json</a>, sample, and Rio. (The
home-made tools <code>scrape</code>, <code>sample</code>, and <code>Rio</code> can be found in this
<a href="https://github.com/jeroenjanssens/data-science-at-the-command-line/tree/master/tools">repository</a>.)
Any suggestions, questions, comments, and even pull requests are more
than welcome. (Tools suggested by others can be found towards the bottom
of the post.) OSEMN, let’s get started with our first tool: <code>jq</code>.</p>
<h2>1. jq - sed for JSON</h2>
<p>JSON is becoming an increasingly common data format, especially as APIs
are appearing everywhere. I remember cooking up the ugliest <code>grep</code> and
<code>sed</code> incantations in order to process JSON. Thanks to <code>jq</code>, those days
are now in the past.</p>
<p>Imagine we’re interested in the candidate totals of the 2008
presidential election. It so happens that the New York Times has a
<a href="http://developer.nytimes.com/docs/campaign_finance_api/">Campaign Finance
API</a>. (You can
<a href="http://developer.nytimes.com/apps/mykeys">get your own API keys</a> if you
want to access any of their APIs.) Let’s get some JSON using <code>curl</code>:</p>
<pre class="language-shell-session"><code class="language-shell-session"><span class="token command"><span class="token shell-symbol important">$</span> <span class="token bash language-bash"><span class="token function">curl</span> <span class="token parameter variable">-s</span> <span class="token string">'http://api.nytimes.com/svc/elections/us/v3/finances/2008/president/totals.json?api-key=super-secret'</span> <span class="token operator">></span> nyt.json</span></span></code></pre>
<p>where <code>-s</code> puts <code>curl</code> in silent mode. In its simplest form, i.e.,
<code>jq '.'</code>, the tool transforms the incomprehensible API response we got:</p>
<pre class="language-shell-session"><code class="language-shell-session"><span class="token command"><span class="token shell-symbol important">$</span> <span class="token bash language-bash"><span class="token function">cat</span> nyt.json</span></span><br /><span class="token output">{"status":"OK","base_uri":"http://api.nytimes.com/svc/elections/us/v3/finances/2008/","cycle":2008,"copyright":"Copyright (c) 2013 The New York Times Company. All Rights Reserved.","results":[{"candidate_name":"Obama, Barack","name":"Barack Obama","party":"D", ...</span></code></pre>
<p>into nicely indented output:</p>
<pre class="language-shell-session"><code class="language-shell-session"><span class="token command"><span class="token shell-symbol important">$</span> <span class="token bash language-bash"><span class="token operator"><</span> nyt.json jq <span class="token string">'.'</span> <span class="token operator">|</span> <span class="token function">head</span></span></span><br /><span class="token output">{<br /> "results": [<br /> {<br /> "candidate_id": "P80003338",<br /> "date_coverage_from": "2007-01-01",<br /> "date_coverage_to": "2008-11-24",<br /> "candidate_name": "Obama, Barack",<br /> "name": "Barack Obama",<br /> "party": "D", </span></code></pre>
<p>Note that the output isn’t necessarily in the same order as the input.
Besides pretty printing, <code>jq</code> can also select, filter, and format JSON
data, as illustrated by the following command, which returns the name,
cash, and party of each candidate that had at least $1,000,000 in cash:</p>
<pre class="language-shell-session"><code class="language-shell-session"><span class="token command"><span class="token shell-symbol important">$</span> <span class="token bash language-bash"><span class="token operator"><</span> nyt.json jq <span class="token parameter variable">-c</span> <span class="token string">'.results[] | {name, party, cash: .cash_on_hand} | select(.cash | tonumber > 1000000)'</span> </span></span><br /><span class="token output">{"cash":"29911984.0","party":"D","name":"Barack Obama"}<br />{"cash":"32812513.75","party":"R","name":"John McCain"}<br />{"cash":"4428347.5","party":"D","name":"John Edwards"}</span></code></pre>
<p>Please refer to the <a href="http://stedolan.github.io/jq/manual/">jq manual</a> to
read about the many other things it can do, but don’t expect it to solve
all your data munging problems. Remember, the Unix philosophy favours
small programs that do one thing and do it well. And <code>jq</code>’s
functionality is more than sufficient I would say! Now that we have the
data we need, it’s time to move on to our second tool: <code>json2csv</code>.</p>
<h2>2. json2csv - convert JSON to CSV</h2>
<p>While JSON is a great format for interchanging data, it’s rather
unsuitable for most command-line tools. Not to worry, we can easily
convert JSON into CSV using
<a href="https://github.com/jehiah/json2csv">json2csv</a>. Assuming that we stored
the data from the last step in <code>million.json</code>, simply invoking
<code>json2csv</code> will convert it to some nicely comma-separated values:</p>
<pre class="language-shell-session"><code class="language-shell-session"><span class="token command"><span class="token shell-symbol important">$</span> <span class="token bash language-bash"><span class="token operator"><</span> million.json json2csv <span class="token parameter variable">-k</span> name,party,cash <span class="token operator">|</span> <span class="token function">head</span> <span class="token parameter variable">-n</span> <span class="token number">3</span></span></span><br /><span class="token output">Barack Obama,D,29911984.0<br />John McCain,R,32812513.75<br />John Edwards,D,4428347.5</span></code></pre>
<p>Having the data in CSV format allows us to use the classic tools such as
<code>cut -d,</code> and <code>awk -F,</code>. Others like <code>grep</code> and <code>sed</code> don’t really have
a notion of fields. Since CSV is the king of tabular file formats,
according to the authors of <a href="http://csvkit.readthedocs.org/">csvkit</a>,
they created, well, <code>csvkit</code>.</p>
<h2>3. csvkit - suite of utilities for working with CSV</h2>
<p>Rather than being one tool, <a href="http://csvkit.readthedocs.org/">csvkit</a> is
a collection of tools that operate on CSV data. Most of these tools
expect the CSV data to have a header, so let’s add one. (Since the
publication of this post, <code>json2csv</code> has been updated to print the
header with the <code>-p</code> option.)</p>
<pre class="language-shell-session"><code class="language-shell-session"><span class="token command"><span class="token shell-symbol important">$</span> <span class="token bash language-bash"><span class="token builtin class-name">echo</span> name,party,cash <span class="token operator">|</span> <span class="token function">cat</span> - million.csv <span class="token operator">></span> million-header.csv</span></span></code></pre>
<p>We can, for example, sort the candidates by cash with <code>csvsort</code> and
display the data using <code>csvlook</code>:</p>
<pre class="language-shell-session"><code class="language-shell-session"><span class="token command"><span class="token shell-symbol important">$</span> <span class="token bash language-bash"><span class="token operator"><</span> million-header.csv csvsort <span class="token parameter variable">-rc</span> cash <span class="token operator">|</span> csvlook</span></span><br /><span class="token output">|---------------+-------+--------------|<br />| name | party | cash |<br />|---------------+-------+--------------|<br />| John McCain | R | 32812513.75 |<br />| Barack Obama | D | 29911984.0 |<br />| John Edwards | D | 4428347.5 |<br />|---------------+-------+--------------|</span></code></pre>
<p>Looks like the MySQL console doesn’t it? Speaking of databases, you can
insert the CSV data into an sqlite database as follows (many other
databases are supported as well):</p>
<pre class="language-shell-session"><code class="language-shell-session"><span class="token command"><span class="token shell-symbol important">$</span> <span class="token bash language-bash">csvsql <span class="token parameter variable">--db</span> sqlite:///myfirst.db <span class="token parameter variable">--insert</span> million-header.csv</span></span><br /><span class="token command"><span class="token shell-symbol important">$</span> <span class="token bash language-bash">sqlite3 myfirst.db</span></span><br /><span class="token output">sqlite> .schema million-header</span></code></pre>
<pre class="language-sql"><code class="language-sql"><span class="token keyword">CREATE</span> <span class="token keyword">TABLE</span> <span class="token string">"million-header"</span> <span class="token punctuation">(</span><br /> name <span class="token keyword">VARCHAR</span><span class="token punctuation">(</span><span class="token number">12</span><span class="token punctuation">)</span> <span class="token operator">NOT</span> <span class="token boolean">NULL</span><span class="token punctuation">,</span> <br /> party <span class="token keyword">VARCHAR</span><span class="token punctuation">(</span><span class="token number">1</span><span class="token punctuation">)</span> <span class="token operator">NOT</span> <span class="token boolean">NULL</span><span class="token punctuation">,</span> <br /> cash <span class="token keyword">FLOAT</span> <span class="token operator">NOT</span> <span class="token boolean">NULL</span><br /><span class="token punctuation">)</span><span class="token punctuation">;</span></code></pre>
<p>In this case, the database columns have the correct data types because
the type is inferred from the CSV data. Other tools within <code>csvkit</code> that
might be of interest are: <code>in2csv</code>, <code>csvgrep</code>, and <code>csvjoin</code>. And with
<code>csvjson</code>, the data can even be converted back to JSON. All in all,
<code>csvkit</code> is worth <a href="http://csvkit.readthedocs.org/">checking out</a>.</p>
<h2>4. scrape - HTML extraction using XPath or CSS selectors</h2>
<p>JSON APIs sure are nice, but they aren’t the only source of data; a lot
of it is <a href="http://www.ted.com/talks/tim_berners_lee_on_the_next_web.html">unfortunately
still</a>
embedded in HTML.
<a href="https://github.com/jeroenjanssens/data-science-toolbox">scrape</a> is a
python script I put together that employs the <code>lxml</code> and <code>cssselect</code>
packages to select certain HTML elements by means of an XPath query or
<a href="http://net.tutsplus.com/tutorials/html-css-techniques/the-30-css-selectors-you-must-memorize/">CSS
selector</a>.
Let’s extract the table from <a href="http://en.wikipedia.org/wiki/List_of_countries_and_territories_by_border/area_ratio">this Wikipedia article that lists the
border and area ratio of each
country</a>.</p>
<pre class="language-shell-session"><code class="language-shell-session"><span class="token command"><span class="token shell-symbol important">$</span> <span class="token bash language-bash"><span class="token function">curl</span> <span class="token parameter variable">-s</span> <span class="token string">'http://en.wikipedia.org/wiki/List_of_countries_and_territories_by_border/area_ratio'</span> <span class="token operator">|</span></span></span><br /><span class="token output">> scrape -b -e 'table.wikitable > tr:not(:first-child)' | head<br /><!DOCTYPE html><br /><html><br /><body><br /><tr><br /><td>1</td><br /><td>Vatican City</td><br /><td>3.2</td><br /><td>0.44</td><br /><td>7.2727273</td><br /></tr></span></code></pre>
<p>The <code>-b</code> argument lets <code>scrape</code> enclose the output with <code><html></code> and
<code><body></code> tags, which is sometimes required by <code>xml2json</code> to convert
correctly the HTML to JSON.</p>
<h2>5. xml2json - convert XML to JSON</h2>
<p>As its name implies, <a href="https://github.com/parmentf/xml2json">xml2json</a>
takes XML (and HTML) as input and returns JSON as output. Therefore,
<code>xml2json</code> is a great liaison between <code>scrape</code> and <code>jq</code>.</p>
<pre class="language-shell-session"><code class="language-shell-session"><span class="token command"><span class="token shell-symbol important">$</span> <span class="token bash language-bash"><span class="token function">curl</span> <span class="token parameter variable">-s</span> <span class="token string">'http://en.wikipedia.org/wiki/List_of_countries_and_territories_by_border/area_ratio'</span> <span class="token operator">|</span></span></span><br /><span class="token output">> scrape -be 'table.wikitable > tr:not(:first-child)' |<br />> xml2json |<br />> jq -c '.html.body.tr[] | {country: .td[1][], border: .td[2][], surface: .td[3][], ratio: .td[4][]}' | head<br />{"ratio":"7.2727273","surface":"0.44","border":"3.2","country":"Vatican City"}<br />{"ratio":"2.2000000","surface":"2","border":"4.4","country":"Monaco"}<br />{"ratio":"0.6393443","surface":"61","border":"39","country":"San Marino"}<br />{"ratio":"0.4750000","surface":"160","border":"76","country":"Liechtenstein"}<br />{"ratio":"0.3000000","surface":"34","border":"10.2","country":"Sint Maarten (Netherlands)"}<br />{"ratio":"0.2570513","surface":"468","border":"120.3","country":"Andorra"}<br />{"ratio":"0.2000000","surface":"6","border":"1.2","country":"Gibraltar (United Kingdom)"}<br />{"ratio":"0.1888889","surface":"54","border":"10.2","country":"Saint Martin (France)"}<br />{"ratio":"0.1388244","surface":"2586","border":"359","country":"Luxembourg"}<br />{"ratio":"0.0749196","surface":"6220","border":"466","country":"Palestinian territories"}</span></code></pre>
<p>Of course this JSON data could then be piped into <code>json2csv</code> and so
forth.</p>
<h2>6. sample - when you’re in debug mode</h2>
<p>The second tool I made is
<a href="https://github.com/jeroenjanssens/data-science-toolbox/blob/master/tools/sample">sample</a>.
(It’s based on two scripts in <a href="https://github.com/bitly/data_hacks">bitly’s
data_hacks</a>, which contains some
other tools worth checking out.) When you’re in the process of
formulating your data pipeline and you have a lot of data, then
debugging your pipeline can be cumbersome. In that case, <code>sample</code> might
be useful. The tool serves three purposes (which isn’t very Unix-minded,
but since it’s mostly useful when you’re in debug mode, that’s not such
a big deal).</p>
<p>The first purpose of <code>sample</code> is to get a subset of the data by
outputting only a certain percentage of the input on a line-by-line
basis. The second purpose is to add some delay to the output. This comes
in handy when the input is a constant stream (e.g., the Twitter
firehose), and the data comes in too fast to see what’s going on. The
third purpose is to run only for a certain time. The following
invocation illustrates all three purposes.</p>
<pre class="language-shell-session"><code class="language-shell-session"><span class="token command"><span class="token shell-symbol important">$</span> <span class="token bash language-bash"><span class="token function">seq</span> <span class="token number">10000</span> <span class="token operator">|</span> sample <span class="token parameter variable">-r</span> <span class="token number">20</span>% <span class="token parameter variable">-d</span> <span class="token number">1000</span> <span class="token parameter variable">-s</span> <span class="token number">5</span> <span class="token operator">|</span> jq <span class="token string">'{number: .}'</span></span></span></code></pre>
<p>This way, every input line has a 20% chance of being forwarded to <code>jq</code>.
Moreover, there is a 1000 millisecond delay between each line and after
five seconds <code>sample</code> will stop entirely. Please note that each argument
is optional. In order to prevent unnecessary computation, try to put
<code>sample</code> as early as possible in your pipeline (the same argument holds
for <code>head</code> and <code>tail</code>). Once you’re done debugging you can simply take
it out of the pipeline.</p>
<h2>7. Rio - making R part of the pipeline</h2>
<p>This post wouldn’t be complete without some R. It’s not straightforward
to make R/Rscript part of the pipeline since they don’t work with stdin
and stdout out of the box. Therefore, as a proof of concept, I put
together a bash script called
<a href="https://github.com/jeroenjanssens/data-science-toolbox/blob/master/tools/Rio">Rio</a>.</p>
<p><code>Rio</code> works as follows. First, the CSV provided to stdin is redirected
to a temporary file and lets R read that into a data frame <code>df</code>. Second,
the specified commands in the <code>-e</code> option are executed. Third, the
output of the last command is redirected to stdout. Allow me to
demonstrate three one-liners that use the Iris dataset (don’t mind the
url).</p>
<p>Display the five-number-summary of each field.</p>
<pre class="language-shell-session"><code class="language-shell-session"><span class="token command"><span class="token shell-symbol important">$</span> <span class="token bash language-bash"><span class="token function">curl</span> <span class="token parameter variable">-s</span> <span class="token string">'https://raw.github.com/pydata/pandas/master/pandas/tests/data/iris.csv'</span> <span class="token operator">></span> iris.csv</span></span><br /><span class="token command"><span class="token shell-symbol important">$</span> <span class="token bash language-bash"><span class="token operator"><</span> iris.csv Rio <span class="token parameter variable">-e</span> <span class="token string">'summary(df)'</span></span></span><br /><span class="token output"> SepalLength SepalWidth PetalLength PetalWidth <br /> Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 <br /> 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 <br /> Median :5.800 Median :3.000 Median :4.350 Median :1.300 <br /> Mean :5.843 Mean :3.054 Mean :3.759 Mean :1.199 <br /> 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800 <br /> Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500 <br /> Name <br /> Length:150 <br /> Class :character <br /> Mode :character </span></code></pre>
<p>If you specify the <code>-s</code> option, the <code>sqldf</code> package will be imported. In
case the output is a data frame, CSV will be written to stdout. This
enables you to further process that data using other tools.</p>
<pre class="language-shell-session"><code class="language-shell-session"><span class="token command"><span class="token shell-symbol important">$</span> <span class="token bash language-bash"><span class="token operator"><</span> iris.csv Rio <span class="token parameter variable">-se</span> <span class="token string">'sqldf("select * from df where df.SepalLength > 7.5")'</span> <span class="token operator">|</span> csvlook</span></span><br /><span class="token output">|--------------+------------+-------------+------------+-----------------|<br />| SepalLength | SepalWidth | PetalLength | PetalWidth | Name |<br />|--------------+------------+-------------+------------+-----------------|<br />| 7.6 | 3 | 6.6 | 2.1 | Iris-virginica |<br />| 7.7 | 3.8 | 6.7 | 2.2 | Iris-virginica |<br />| 7.7 | 2.6 | 6.9 | 2.3 | Iris-virginica |<br />| 7.7 | 2.8 | 6.7 | 2 | Iris-virginica |<br />| 7.9 | 3.8 | 6.4 | 2 | Iris-virginica |<br />| 7.7 | 3 | 6.1 | 2.3 | Iris-virginica |<br />|--------------+------------+-------------+------------+-----------------|</span></code></pre>
<p>If you specify the <code>-g</code> option, <code>ggplot2</code> gets imported and a ggplot
object called <code>g</code> with <code>df</code> as the data is initialised. If the final
output is a ggplot object, a PNG will be written to stdout.</p>
<pre class="language-shell-session"><code class="language-shell-session"><span class="token command"><span class="token shell-symbol important">$</span> <span class="token bash language-bash"><span class="token operator"><</span> iris.csv Rio <span class="token parameter variable">-ge</span> <span class="token string">'g + geom_point(aes(x = SepalLength, y = SepalWidth, colour = Name))'</span> <span class="token operator">></span> iris.png</span></span></code></pre>
<figure>
<a href="https://jeroenjanssens.com/img/iris.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.6)" srcset="https://jeroenjanssens.com/img/jw_pSrlYKV-201.webp 201w, https://jeroenjanssens.com/img/jw_pSrlYKV-302.webp 302w, https://jeroenjanssens.com/img/jw_pSrlYKV-403.webp 403w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.6)" srcset="https://jeroenjanssens.com/img/jw_pSrlYKV-201.webp 201w, https://jeroenjanssens.com/img/jw_pSrlYKV-302.webp 302w, https://jeroenjanssens.com/img/jw_pSrlYKV-403.webp 403w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.6)" srcset="https://jeroenjanssens.com/img/jw_pSrlYKV-201.jpeg 201w, https://jeroenjanssens.com/img/jw_pSrlYKV-302.jpeg 302w, https://jeroenjanssens.com/img/jw_pSrlYKV-403.jpeg 403w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.6)" srcset="https://jeroenjanssens.com/img/jw_pSrlYKV-201.jpeg 201w, https://jeroenjanssens.com/img/jw_pSrlYKV-302.jpeg 302w, https://jeroenjanssens.com/img/jw_pSrlYKV-403.jpeg 403w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 60%;" src="https://jeroenjanssens.com/img/jw_pSrlYKV-201.jpeg" alt="" loading="lazy" />
</picture></a>
<figcaption></figcaption>
</figure>
<p>I made this tool so that I could take advantage of the power of R on the
command-line. Of course it has its limits, but at least there’s no need
to learn <a href="http://www.gnuplot.info/">gnuplot</a> any more.</p>
<h2>Command-line tools suggested by others</h2>
<p>Below is an uncurated list of tools and repositories that others have
suggested via Twitter or <a href="https://news.ycombinator.com/item?id=6412190">Hacker
News</a> (last updated on
23-09-2013 07:15 EST). Thanks everybody.</p>
<ul>
<li><a href="http://bigmler.readthedocs.org/en/latest/">BigMLer</a> by
<a href="https://news.ycombinator.com/user?id=aficionado">aficionado</a></li>
<li><a href="https://code.google.com/p/crush-tools/">crush-tools</a> by
<a href="https://news.ycombinator.com/user?id=mjn">mjn</a></li>
<li><a href="https://github.com/dergachev/csv2sqlite">csv2sqlite</a> by
<a href="https://news.ycombinator.com/user?id=dergachev">dergachev</a></li>
<li><a href="https://github.com/dbro/csvquote">csvquote</a> by
<a href="https://news.ycombinator.com/user?id=susi22">susi22</a></li>
<li><a href="https://github.com/clarkgrubb/data-tools">data-tools repository</a> by
<a href="https://news.ycombinator.com/user?id=cgrubb">cgrubb</a></li>
<li><a href="https://github.com/dkogan/feedgnuplot">feedgnuplot</a> by
<a href="https://news.ycombinator.com/user?id=dima55">dima55</a></li>
<li><a href="https://github.com/cgutteridge/Grinder/tree/master/bin">Grinder
repository</a> by
<a href="https://twitter.com/cgutteridge">@cgutteridge</a></li>
<li><a href="http://www.hdfgroup.org/HDF5/doc/RM/Tools.html">HDF5 Tools</a> by
<a href="https://news.ycombinator.com/user?id=susi22">susi22</a></li>
<li><a href="http://code.google.com/p/littler/">littler</a> by
<a href="https://twitter.com/eddelbuettel">@eddelbuettel</a></li>
<li><a href="http://gibrown.wordpress.com/2013/01/26/unix-bi-grams-tri-grams-and-topic-modeling/">mallet</a>
by <a href="https://news.ycombinator.com/user?id=gibrown">gibrown</a></li>
<li><a href="https://github.com/benbernard/RecordStream">RecordStream</a> by
<a href="https://news.ycombinator.com/user?id=revertts">revertts</a></li>
<li><a href="https://github.com/paulgb/subsample">subsample</a> by
<a href="https://news.ycombinator.com/user?id=paulgb">paulgb</a></li>
<li><a href="http://search.cpan.org/~ken/xls2csv-1.07/script/xls2csv">xls2csv</a> by
<a href="https://twitter.com/sheeshee">@sheeshee</a></li>
<li><a href="http://xmlstar.sourceforge.net/">XMLStarlet</a> by
<a href="https://news.ycombinator.com/user?id=gav">gav</a></li>
</ul>
<h2>Conclusion</h2>
<p>I have shown you seven command-line tools that I use in my daily work as
a data scientist. While each tool is useful in its own way, I often find
myself combining them with, or just resorting to, the classics such as
<code>grep</code>, <code>sed</code>, and <code>awk</code>. Combining such small tools into a larger
pipeline is what makes them really powerful.</p>
<p>I’m curious to hear what you think about this list and what command-line
tools you like to use. Also, if you’ve made any tools yourself, you’re
more than welcome to add them to this <a href="https://github.com/jeroenjanssens/data-science-toolbox">data science
toolbox</a>.</p>
<p>Don’t worry if you don’t regard yourself as a toolmaker. The next time
you’re cooking up that exotic pipeline, consider to put it in a file,
add a <a href="http://en.wikipedia.org/wiki/Shebang_%28Unix%29">shebang</a>,
parametrise it with some <code>$1</code>s and <code>$2</code>s, and <code>chmod +x</code> it. That’s all
there is to it. Who knows, you might even become interested in applying
the <a href="http://www.faqs.org/docs/artu/ch01s06.html">Unix philosophy</a>.</p>
<p>While the power of the command-line should not be underestimated when it
comes to Obtaining, Scrubbing, and Exploring data, it can only get you
so far. When you’re ready to do some more serious Exploring, Modelling,
and iNterpretation of your data, you’re probably better off continuing
your work in a statistical computing environment, such as
<a href="http://www.r-project.org/">R</a> or <a href="https://jupyter.org/">Jupyter
Notebook</a>+<a href="https://pandas.pydata.org/">pandas</a>.</p>
<p>— Jeroen</p>
Bayesian A/B Testing Headlines2013-08-18T00:00:00Zhttps://jeroenjanssens.com/testing/<p>The Visual Revenue platform provides a number of great tools that
support editors to optimize their front page. Instant Headline Testing
is one of those tools. The quality of a story headline greatly
influences its click-through-rate (CTR). Front page editors therefore
spend a lot of thought coming up with the right wording to engage their
readers. But on digital media, headlines do not have to be set in stone.
Instant Headline Testing gives the editor the opportunity to improve the
quality of a headline after it has made the front page. Let me give you
an example.</p>
<h3>A sporty example</h3>
<p>With Super Bowl XLVII (and its power outage) still fresh in our minds,
one of our clients, <a href="http://www.usatoday.com/sports/">USA Today Sports</a>,
used our platform to conduct the following headline test:</p>
<p><strong>Headline A: “What Harbaugh regrets about Super Bowl” (3.06% CTR)</strong></p>
<p><strong>Headline B: “John Harbaugh explains Super Bowl tirade” (4.93% CTR)</strong></p>
<p>Headline A (the original headline) got 3.06% CTR and headline B got
4.93% CTR, which are both strong CTRs. After only seven minutes of
testing the two headlines, headline B had been declared winner with
99.93% certainty (explained later). Subsequently, the winning headline
was served to 100% of the audience for over one hour. Finally, the
editor even made the change permanent in their CMS. Note that by
changing a few words only, a 61% lift had been achieved, which
eventually resulted in tens of thousands more views for <a href="http://www.usatoday.com/story/sports/nfl/2013/02/04/ravens-john-harbaugh-super-bowl-jim-harbaugh-49ers/1890387/">that
article</a>!</p>
<h3>Four challenges for instant headline testing</h3>
<p>Instant Headline Testing is essentially <a href="http://www.alistapart.com/articles/a-primer-on-a-b-testing/">A/B
testing</a>
for story headlines. However, there are four challenges when it comes to
Instant Headline Testing that we need to take into account.</p>
<ol>
<li>Headlines may be on the front page for only a couple of hours, so a
headline test cannot take too long.</li>
<li>The number of readers varies greatly per front page.</li>
<li>The CTR of a headline depends on where it is positioned on the front
page. For example, a headline positioned at the hero spot has a much
higher CTR than one positioned at bottom.</li>
<li>Front pages are dynamic, so headlines can change position.</li>
</ol>
<h3>Frequentist approach to instant headline testing</h3>
<p>Our improved implementation overcomes these four challenges by using a
Bayesian approach. Before I explain that, I’ll first discuss the
frequentist approach to Instant Headline Testing.</p>
<p>To conclude which headline is better than the other, we cannot just look
at the highest CTR. We need to apply statistics in order make sure that
the difference in CTR did not happen by chance. It may be the case that
we cannot declare a headline as winner at all.</p>
<p>We apply statistics to the data that we have collected during the
headline test. This data includes front page impressions (i.e., views)
and clicks. The more data the better, right? Well, not quite. It’s
important to realize is that the longer a headline test is running, the
longer we are serving one headline of possibly lesser quality to 50% of
the readers. This means that the corresponding article may lose out on
value. In decision theory, the difference between the actual and
potential article impressions is known as
<a href="http://en.wikipedia.org/wiki/Regret_(decision_theory)">regret</a>.</p>
<p>So, on the one hand, we want to collect as much data as possible in
order to make a reliable conclusion, while on the other hand, we want to
maximize article impressions, i.e., minimize regret. This raises two
questions: (1) When can we stop a headline test? and (2) How do we know
that one headline is better than the other?</p>
<p>Within statistics, the frequentist approach and the Bayesian approach
are two well-known approaches when it comes to A/B testing. The
frequentist approach provides a <a href="http://visualwebsiteoptimizer.com/ab-split-significance-calculator/">statistical
test</a>
whether the CTRs of the two headlines are <a href="http://en.wikipedia.org/wiki/Statistical_significance">significantly
different</a>. That
would answer our second question.</p>
<p>The frequentist approach doesn’t provide an easy answer to the first
question because the statistical test assumes that the number of views
are fixed before we start a headline test. Furthermore, we cannot run a
headline test until we see a significant difference between CTRs as this
would falsely increase the probability of obtaining a significant
result, as <a href="http://www.evanmiller.org/how-not-to-run-an-ab-test.html">Evan Miller explains on his
blog</a>. We
would have to estimate how many views we would need in order to obtain a
significant difference.</p>
<p>And this is where the four challenges come into play. Due to challenge
1, the headline test cannot take too long, say 20 minutes at most, which
limits the number of views we can get. Because of challenge 2, the views
per minute may be anything between 10 and 10,000, and we do need to have
a tool that’s usable by all our front page editors. Challenge 3
determines the CTRs of the headlines as well. When the CTRs are closer
together, we need more views in order to obtain a statistically
significant difference. Including these three challenges when estimating
the desired number of views is not straightforward. On top of that, when
a headline changes position during a test (which is the fourth
challenge), our estimate becomes completely invalid!</p>
<p>Let’s have a look at a Bayesian approach to Instant Headline Testing,
which is one that I find much more straightforward and elegant.</p>
<h3>Bayesian approach to instant headline testing</h3>
<p>Whereas the frequentist approach assumes that the “true” CTRs remain the
same, all that the Bayesian approach cares about is the data we have
actually observed. So, we don’t need to worry about estimating the
required number of views beforehand. Moreover, the Bayesian approach
doesn’t mind that headlines change position while testing, so that
overcomes challenge 4.</p>
<p>Below I’ll first explain how the Bayesian approach determines when to
stop a headline test (which was our first question). The second
question, determining whether one headline is better than the other is
discussed in the next section.</p>
<p>During the headline test we keep track of number of views and number of
clicks on headline A and B. Important here is the absolute difference
between the number of clicks both headlines. When the absolute
difference crosses the so-called Anscombe boundary, we stop the headline
test.</p>
<p>The Anscombe boundary is defined in terms of (1) the number of views so
far, (denoted by $n$) and (2) the number of views we will be serving the
winning headline (denoted by $k$). Stated more formally, we stop when
the following inequality is true:</p>
<p>$$ |y| \gt -\Phi^{-1}\left(\frac{n}{k + 2n}\right) \sqrt{n} $$</p>
<p>where $y$ is the absolute difference between the number of clicks for
both headlines. The shape of the Anscombe boundary is shown in the
figure below. Here, we assumed that the front page gets 1,000 views per
minute (VPM). So, after 20 minutes of testing, the front page has been
viewed 20,000 times. The figure also shows the absolute difference for
five simulated headline tests (CTR A=5%, CTR B=3%), which are denoted by
gray lines.</p>
<figure>
<a href="https://jeroenjanssens.com/img/vr-ht-anscombe.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/h9s9DzhJPO-302.webp 302w, https://jeroenjanssens.com/img/h9s9DzhJPO-453.webp 453w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/h9s9DzhJPO-302.webp 302w, https://jeroenjanssens.com/img/h9s9DzhJPO-453.webp 453w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/h9s9DzhJPO-302.jpeg 302w, https://jeroenjanssens.com/img/h9s9DzhJPO-453.jpeg 453w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/h9s9DzhJPO-302.jpeg 302w, https://jeroenjanssens.com/img/h9s9DzhJPO-453.jpeg 453w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/h9s9DzhJPO-302.jpeg" alt="Anscombe Boundary" loading="lazy" />
</picture></a>
<figcaption>Anscombe Boundary</figcaption>
</figure>
<p>The figure illustrates that the Anscombe boundary takes the trade-off
between the number of views so far, and the number of views after
conclusion into account. In a recent blog post, Aaron Goodman performed
an <a href="http://blog.custora.com/2012/05/a-bayesian-approach-to-ab-testing/">interesting
comparison</a>
between the frequentist, multiple testing, and Bayesian approaches. He
demonstrated that the Bayesian approach is best at minimizing regret,
or, in other words, maximizing article views.</p>
<p>Again, once the absolute difference passes the Anscombe boundary, we are
ready to conclude which headline was better.</p>
<h3>Certainty of conclusion</h3>
<p>It would nice to know with how much certainty we can conclude that one
headline is better than the other.</p>
<p>The certainty associated with the CTRs of Headlines A and B can be
modeled by a beta distribution. The beta distribution can be defined in
terms of the number of views and CTR as follows:</p>
<p>$$ \beta\left(\textrm{views} \times \textrm{CTR}, \textrm{views} \times
\left(1-\textrm{CTR}\right)\right) $$</p>
<p>The figure below shows how a beta distribution becomes more peaked as
the number of views increases (while keeping the CTR at 20%). It
illustrates that as we collect more evidence (i.e., views) our
uncertainty about the CTR decreases.</p>
<figure>
<a href="https://jeroenjanssens.com/img/vr-ht-beta-views-ctr.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/pnmPDtOMsy-302.webp 302w, https://jeroenjanssens.com/img/pnmPDtOMsy-453.webp 453w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/pnmPDtOMsy-302.webp 302w, https://jeroenjanssens.com/img/pnmPDtOMsy-453.webp 453w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/pnmPDtOMsy-302.jpeg 302w, https://jeroenjanssens.com/img/pnmPDtOMsy-453.jpeg 453w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/pnmPDtOMsy-302.jpeg 302w, https://jeroenjanssens.com/img/pnmPDtOMsy-453.jpeg 453w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/pnmPDtOMsy-302.jpeg" alt="Parameterizing a beta distribution with views and CTR" loading="lazy" />
</picture></a>
<figcaption>Parameterizing a beta distribution with views and CTR</figcaption>
</figure>
<p>The gray lines may give the impression that the peakedness increases
linearly with the number of views, but please note that these lines
represent the number of views on a log scale. Allow me to clarify this
with the following image, which shows that the interval in which 95% of
the probability density is located, i.e., the 95% credibility interval
as it is called in Bayesian statistics, decreases exponentially with
respect to the number of views.</p>
<figure>
<a href="https://jeroenjanssens.com/img/vr-ht-credibility-interval.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/Tub0InSBS9-302.webp 302w, https://jeroenjanssens.com/img/Tub0InSBS9-453.webp 453w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/Tub0InSBS9-302.webp 302w, https://jeroenjanssens.com/img/Tub0InSBS9-453.webp 453w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/Tub0InSBS9-302.jpeg 302w, https://jeroenjanssens.com/img/Tub0InSBS9-453.jpeg 453w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/Tub0InSBS9-302.jpeg 302w, https://jeroenjanssens.com/img/Tub0InSBS9-453.jpeg 453w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/Tub0InSBS9-302.jpeg" alt="Credibility Interval" loading="lazy" />
</picture></a>
<figcaption>Credibility Interval</figcaption>
</figure>
<p>The amount of overlap between A’s beta distributions and B’s beta
distribution determines the certainty of our conclusion. We can estimate
this certainty by generating a random value (i.e., drawing a random
sample) from both beta distributions and note which value one is higher.
If we repeat this, say, a million times, we can accurately estimate the
probability that B is better than A. This probability is the certainty
with which we can declare headline B as the true winner.<sup class="footnote-ref"><a href="https://jeroenjanssens.com/testing/#fn1" id="fnref1">[1]</a></sup></p>
<p>Below are two figures that show how long it takes to conclude and the
amount of certainty associated to that conclusion. In the first figure,
the CTRs of both headlines are kept constant (CTR A=5%, CTR B=3%), and
the views per minute (VPM) varies from 10 to 10,000. In the second
figure, the VPM (=1000) and the CTR of headline A (=5%) are kept
constant and the CTR of headline B varies from 0% to 10%.</p>
<figure>
<a href="https://jeroenjanssens.com/img/vr-ht-time-to-conclude.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/88tiJ7CgoP-302.webp 302w, https://jeroenjanssens.com/img/88tiJ7CgoP-453.webp 453w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/88tiJ7CgoP-302.webp 302w, https://jeroenjanssens.com/img/88tiJ7CgoP-453.webp 453w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/88tiJ7CgoP-302.jpeg 302w, https://jeroenjanssens.com/img/88tiJ7CgoP-453.jpeg 453w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/88tiJ7CgoP-302.jpeg 302w, https://jeroenjanssens.com/img/88tiJ7CgoP-453.jpeg 453w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/88tiJ7CgoP-302.jpeg" alt="Time to conclude" loading="lazy" />
</picture></a>
<figcaption>Time to conclude</figcaption>
</figure>
<figure>
<a href="https://jeroenjanssens.com/img/vr-ht-certainty.png">
<picture>
<source type="image/webp" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/WXiwFZYBDg-302.webp 302w, https://jeroenjanssens.com/img/WXiwFZYBDg-453.webp 453w" media="(min-width: 640px)" /><source type="image/webp" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/WXiwFZYBDg-302.webp 302w, https://jeroenjanssens.com/img/WXiwFZYBDg-453.webp 453w" media="(min-width: 0px)" /><source type="image/jpeg" sizes="calc(42rem * 0.9)" srcset="https://jeroenjanssens.com/img/WXiwFZYBDg-302.jpeg 302w, https://jeroenjanssens.com/img/WXiwFZYBDg-453.jpeg 453w" media="(min-width: 640px)" /><source type="image/jpeg" sizes="calc((100vw - 4rem) * 0.9)" srcset="https://jeroenjanssens.com/img/WXiwFZYBDg-302.jpeg 302w, https://jeroenjanssens.com/img/WXiwFZYBDg-453.jpeg 453w" media="(min-width: 0px)" />
<img class="mx-auto " style="width: 90%;" src="https://jeroenjanssens.com/img/WXiwFZYBDg-302.jpeg" alt="Time to conclude" loading="lazy" />
</picture></a>
<figcaption>Time to conclude</figcaption>
</figure>
<p>In case of sufficient certainty, we declare the headline with the
highest CTR as the winner of the headline test. After that, the editor
may decide to continue to serve the winning headline to 100% of the
readers.</p>
<p>The serving aspect is actually taken care of by our platform and doesn’t
require any additional integration, but that’s material for a blog post
that one of our fine front-end engineers should write!</p>
<h3>Conclusions</h3>
<p>Visual Revenue’s Instant Headline Testing tool enables editors to A/B
test easily different headlines and to quickly see whether the quality
has improved. The frequentist approach, which is often used for A/B
tests, is unable to overcome the challenges that are associated with A/B
testing headlines. The Bayesian approach, however, offers the
flexibility that front pages require.</p>
<p>I have explained how the Bayesian approach, using the Anscombe boundary,
determines when the stop a headline test. I also discussed how we can
compute the certainty associated with concluding that one headline is
better than the other.</p>
<p>If you would like to play with beta distributions and compute the
associated certainty, Peak Conversion provides a nice <a href="http://www.peakconversion.com/2012/02/ab-split-test-graphical-calculator/">graphical
calculator</a>.</p>
<p>If I still haven’t convinced you that the Bayesian approach is the way
to go, then you may want to have a look at: <em>“So you want to run an
experiment, now what? Some Simple Rules of Thumb for Optimal
Experimental Design.”</em> by John List, Sally Sadoff, and Mathis Wagner.</p>
<p>Alternatively, the frequentist approach provides methods that allow for
setting up “checkpoints” where you may determine whether you want to
stop a headline test. In other words, these methods offer <a href="http://en.wikipedia.org/wiki/Multiple_comparisons">correction
for multiple
testing</a>.</p>
<p>– Jeroen</p>
<hr class="footnotes-sep" />
<section class="footnotes">
<ol class="footnotes-list">
<li id="fn1" class="footnote-item"><p>Instead of estimating the certainty, it can also be computed in
closed form <a href="http://www.johndcook.com/blog/2012/10/11/beta-inequalities-in-r/">(see this post from John Cook’s
blog)</a>,
but the <a href="http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.beta.html">stats.beta.pdf function from
scipy</a>
doesn’t like very peaked beta distributions. <a href="https://jeroenjanssens.com/testing/#fnref1" class="footnote-backref">↩︎</a></p>
</li>
</ol>
</section>
Quickly Navigate your Filesystem from the Command Line2013-08-16T00:00:00Zhttps://jeroenjanssens.com/navigate/<p><em>Update (6-9-2013) This code is now also available as the “jump” plugin
in <a href="https://github.com/robbyrussell/oh-my-zsh">oh-my-zsh</a>.</em></p>
<p><em>Update (18-8-2013): Thanks to the many useful suggestions in <a href="https://news.ycombinator.com/item?id=6229001">the
discussion on Hacker
News</a>, I have added (1)
quotes to the code, (2) a section about tab completion, and (3) a note
for macOS users.</em></p>
<p>Like many others, I spend most of my day behind a computer. In order
make the most of it (and to keep my body from complaining too much), I
try to maintain an optimised setup. For example, I code in
<a href="http://en.wikipedia.org/wiki/Vim_(text_editor)">Vim</a>, browse with
<a href="http://www.vimperator.org/vimperator">Vimperator</a>, and move windows
around in <a href="http://i3wm.org/">i3</a>. Another common task is filesystem
navigation. I prefer to use the command-line for this, but typing
<code>cd ~/some/very/deep/often-used/directory</code> over and over again does
become cumbersome.</p>
<p>Automated tools like
<a href="https://github.com/joelthelion/autojump">autojump</a>,
<a href="https://github.com/rupa/z">z</a>, and <a href="https://github.com/clvv/fasd">fasd</a>
address this problem by offering shortcuts to the directories you often
go to. Personally, I prefer a more manual solution, which I would like
to share with you. I have noticed quite an increase in efficiency with
this, and perhaps you will too.</p>
<h2>Jumping with symbolic links</h2>
<p>Under the hood this manual solution comes down to storing symbolic links
in a hidden directory (e.g., <code>~/.marks</code>). There are four shell functions
<code>jump</code>, <code>mark</code>, <code>unmark</code>, and <code>marks</code>, and they look like this:</p>
<pre class="language-bash"><code class="language-bash"><span class="token builtin class-name">export</span> <span class="token assign-left variable">MARKPATH</span><span class="token operator">=</span><span class="token environment constant">$HOME</span>/.marks<br /><span class="token keyword">function</span> <span class="token function-name function">jump</span> <span class="token punctuation">{</span> <br /> <span class="token builtin class-name">cd</span> <span class="token parameter variable">-P</span> <span class="token string">"<span class="token variable">$MARKPATH</span>/<span class="token variable">$1</span>"</span> <span class="token operator"><span class="token file-descriptor important">2</span>></span>/dev/null <span class="token operator">||</span> <span class="token builtin class-name">echo</span> <span class="token string">"No such mark: <span class="token variable">$1</span>"</span><br /><span class="token punctuation">}</span><br /><span class="token keyword">function</span> <span class="token function-name function">mark</span> <span class="token punctuation">{</span> <br /> <span class="token function">mkdir</span> <span class="token parameter variable">-p</span> <span class="token string">"<span class="token variable">$MARKPATH</span>"</span><span class="token punctuation">;</span> <span class="token function">ln</span> <span class="token parameter variable">-s</span> <span class="token string">"<span class="token variable"><span class="token variable">$(</span><span class="token builtin class-name">pwd</span><span class="token variable">)</span></span>"</span> <span class="token string">"<span class="token variable">$MARKPATH</span>/<span class="token variable">$1</span>"</span><br /><span class="token punctuation">}</span><br /><span class="token keyword">function</span> <span class="token function-name function">unmark</span> <span class="token punctuation">{</span> <br /> <span class="token function">rm</span> <span class="token parameter variable">-i</span> <span class="token string">"<span class="token variable">$MARKPATH</span>/<span class="token variable">$1</span>"</span><br /><span class="token punctuation">}</span><br /><span class="token keyword">function</span> <span class="token function-name function">marks</span> <span class="token punctuation">{</span><br /> <span class="token function">ls</span> <span class="token parameter variable">-l</span> <span class="token string">"<span class="token variable">$MARKPATH</span>"</span> <span class="token operator">|</span> <span class="token function">sed</span> <span class="token string">'s/ / /g'</span> <span class="token operator">|</span> <span class="token function">cut</span> -d<span class="token string">' '</span> -f9- <span class="token operator">|</span> <span class="token function">sed</span> <span class="token string">'s/ -/\t-/g'</span> <span class="token operator">&&</span> <span class="token builtin class-name">echo</span><br /><span class="token punctuation">}</span></code></pre>
<p>Put this in your <code>.zshrc</code> or <code>.bashrc</code> and you’re ready to jump (Mac OS
X users need a slightly different version of the <code>marks</code> function; see
below). I have also turned this into a plugin for
<a href="https://github.com/robbyrussell/oh-my-zsh">oh-my-zsh</a> called <code>jump</code>. To
add a new bookmark, <code>cd</code> into the directory and <code>mark</code> it with a name to
your liking:</p>
<pre class="language-shell-session"><code class="language-shell-session"><span class="token command"><span class="token shell-symbol important">$</span> <span class="token bash language-bash"><span class="token builtin class-name">cd</span> ~/some/very/deep/often-used/directory</span></span><br /><span class="token command"><span class="token shell-symbol important">$</span> <span class="token bash language-bash">mark deep</span></span></code></pre>
<p>This adds a symbolic link named <code>deep</code> to the directory <code>~/.marks</code>. To
<code>jump</code> to this directory, type the following from any place in the
filesystem:</p>
<pre class="language-shell-session"><code class="language-shell-session"><span class="token command"><span class="token shell-symbol important">$</span> <span class="token bash language-bash">jump deep</span></span></code></pre>
<p>To remove the bookmark (i.e., the symbolic link), type:</p>
<pre class="language-shell-session"><code class="language-shell-session"><span class="token command"><span class="token shell-symbol important">$</span> <span class="token bash language-bash">unmark deep</span></span></code></pre>
<p>You can view all marks by typing:</p>
<pre class="language-shell-session"><code class="language-shell-session"><span class="token command"><span class="token shell-symbol important">$</span> <span class="token bash language-bash">marks</span></span><br /><span class="token output">deep -> /home/johndoe/some/very/deep/often-used/directory<br />foo -> /usr/bin/foo/bar</span></code></pre>
<p>That’s all there is to it!</p>
<h2>Adding tab completion</h2>
<p>In order to add tab completion for the <code>jump</code> and <code>unmark</code> functions,
add the following code to your .zshrc (thanks to
<a href="https://news.ycombinator.com/item?id=6229468">tiziano88</a>):</p>
<pre class="language-bash"><code class="language-bash"><span class="token keyword">function</span> <span class="token function-name function">_completemarks</span> <span class="token punctuation">{</span><br /> <span class="token assign-left variable">reply</span><span class="token operator">=</span><span class="token punctuation">(</span><span class="token variable"><span class="token variable">$(</span><span class="token function">ls</span> $MARKPATH<span class="token variable">)</span></span><span class="token punctuation">)</span><br /><span class="token punctuation">}</span><br /><br />compctl <span class="token parameter variable">-K</span> _completemarks jump<br />compctl <span class="token parameter variable">-K</span> _completemarks unmark</code></pre>
<p>or the following to your .bashrc (thanks to
<a href="https://news.ycombinator.com/item?id=6229768">microcolonel</a>):</p>
<pre class="language-bash"><code class="language-bash"><span class="token function-name function">_completemarks</span><span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token punctuation">{</span><br /> <span class="token builtin class-name">local</span> <span class="token assign-left variable">curw</span><span class="token operator">=</span><span class="token variable">${COMP_WORDS<span class="token punctuation">[</span>COMP_CWORD<span class="token punctuation">]</span>}</span><br /> <span class="token builtin class-name">local</span> <span class="token assign-left variable">wordlist</span><span class="token operator">=</span><span class="token variable"><span class="token variable">$(</span><span class="token function">find</span> $MARKPATH <span class="token parameter variable">-type</span> l <span class="token parameter variable">-printf</span> <span class="token string">"%f<span class="token entity" title="\n">\n</span>"</span><span class="token variable">)</span></span><br /> <span class="token assign-left variable">COMPREPLY</span><span class="token operator">=</span><span class="token punctuation">(</span><span class="token variable"><span class="token variable">$(</span>compgen <span class="token parameter variable">-W</span> <span class="token string">'${wordlist[@]}'</span> -- <span class="token string">"<span class="token variable">$curw</span>"</span><span class="token variable">)</span></span><span class="token punctuation">)</span><br /> <span class="token builtin class-name">return</span> <span class="token number">0</span><br /><span class="token punctuation">}</span><br /><br />complete <span class="token parameter variable">-F</span> _completemarks jump unmark</code></pre>
<p>If you now type <code>jump</code> or <code>unmark</code> and then press TAB, you see a list of
available marks. Neat!</p>
<h2>Note for macOS users</h2>
<p>As pointed out by
<a href="https://news.ycombinator.com/item?id=6229428">guygurari</a>, macOS users
need a slightly different version of the <code>marks</code> function:</p>
<pre class="language-bash"><code class="language-bash"><span class="token keyword">function</span> <span class="token function-name function">marks</span> <span class="token punctuation">{</span><br /> <span class="token punctuation">\</span>ls <span class="token parameter variable">-l</span> <span class="token string">"<span class="token variable">$MARKPATH</span>"</span> <span class="token operator">|</span> <span class="token function">tail</span> <span class="token parameter variable">-n</span> +2 <span class="token operator">|</span> <span class="token function">sed</span> <span class="token string">'s/ / /g'</span> <span class="token operator">|</span> <span class="token function">cut</span> -d<span class="token string">' '</span> -f9- <span class="token operator">|</span> <span class="token function">awk</span> <span class="token parameter variable">-F</span> <span class="token string">' -> '</span> <span class="token string">'{printf "%-10s -> %s\n", $1, $2}'</span><br /><span class="token punctuation">}</span></code></pre>
<p>— Jeroen</p>