<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
    <title>Dimitri Lozeve's Blog</title>
    <link href="https://www.lozeve.com/atom.xml" rel="self" />
    <link href="https://www.lozeve.com" />
    <id>https://www.lozeve.com/atom.xml</id>
    <author>
        <name>Dimitri Lozeve</name>
        
        <email>dimitri+web@lozeve.com</email>
        
    </author>
    <updated>2026-02-14T00:00:00Z</updated>
    <entry>
    <title>A Mastodon bot for interesting integer sequences</title>
    <link href="https://www.lozeve.com/posts/oeis-bot.html" />
    <id>https://www.lozeve.com/posts/oeis-bot.html</id>
    <published>2026-02-14T00:00:00Z</published>
    <updated>2026-02-14T00:00:00Z</updated>
    <summary type="html"><![CDATA[<section>
  <p>A few days ago I stumbled into the <a href="https://oeis.org/">OEIS</a> again via their <a href="https://oeis.org/OEIS_pics.html">pictures page</a>. Looking at some of the pages there gave me an idea for a new,
small project, so I gave it a shot.</p>
<h2 id="what-is-the-oeis">What is the OEIS?</h2>
<p>The Online Encyclopedia of Integer Sequences, or “OEIS” for short, is
a collaborative project whose objective is to collect all sequences of
integers into a comprehensive repository. It is useful to all
mathematicians (professional or amateur!) who stumble into a sequence
and wish to identify it: they can type a few terms into the search
bar, and the OEIS will return all sequences that match.</p>
<p>The OEIS was started by Neil J. A. Sloane, after publishing a few
books collecting many important sequences. Many people had interesting
sequences they wanted to contribute, and sent letters to Sloane with
suggestions. The collection began to be so large that Sloane started
an online version, which became the OEIS<span class="sidenote-wrapper"><label for="sn-0" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sn-0" class="margin-toggle" /><span class="sidenote">The full history can be
found <a href="https://oeisf.org/#HISTORY">here</a>.<br />
<br />
</span></span>.</p>
<p>Each sequence in the database has a unique identifier like <a href="https://oeis.org/A000108">A000108</a>. An
entry includes a description, the known terms of the sequence, and a
lot of additional data: explicit formulas when they exist, comments
and bibliographical references, programs to compute the sequence, and
other related sequences in the database. The <a href="https://oeis.org/demo1.html">demo</a> is a good place to
start for how to use the database.</p>
<p>Like most people, I have no real need of the OEIS on a day-to-day
basis. However I still find it fascinating. Many sequences are
well-known: the prime numbers (<a href="https://oeis.org/A000040">A000040</a>), the Fibonacci numbers
(<a href="https://oeis.org/A000045">A000045</a>), etc. Others I have encountered elsewhere in popular
mathematics, like the Busy Beaver problem (<a href="https://oeis.org/A060843">A060843</a>), whose 5th term
has been <a href="https://www.quantamagazine.org/amateur-mathematicians-find-fifth-busy-beaver-turing-machine-20240702/">computed recently</a>. And others I have discovered directly on
the OEIS, like the “peaceable coexisting armies of queens” (<a href="https://oeis.org/A250000">A250000</a>,
look at the ASCII chess boards in the examples!).</p>
<p>What I like about these sequences is that they are often at the
intersection of many domains in mathematics. Some of these sequences
have very deep connections to many areas, and let you discover new
interesting concepts and ideas. Moreover the relationships between
sequences is often interesting as well, and let you continue the
exploration.</p>
<h2 id="the-idea">The idea</h2>
<p>The only problem is that the OEIS is more a search engine, and
therefore most sequences are not very discoverable. Contributors and
editors add keywords to the sequences, some are <a href="https://oeis.org/search?q=keyword%3Anice&amp;language=english&amp;go=Search">nice</a>, some have a nice
<a href="https://oeis.org/search?q=keyword%3Alook&amp;language=english&amp;go=Search">graph</a> (click on “graph” under the sequence itself!), and the wiki
lists some <a href="https://oeis.org/wiki/Index_to_OEIS:_Section_Cor#core">important sequences</a>. But other than this it is not easy to
discover interesting sequences randomly.</p>
<p>So I got the idea of implementing a bot that would share a random
sequence regularly, so I could read the full OEIS page if the
description seemed interesting. Since I am mostly on <a href="https://mathstodon.xyz/@dlzv">Mastodon</a>, I
wanted to implement this as a Mastodon bot so that others could follow
it if they want.</p>
<h2 id="the-implementation">The implementation</h2>
<p>The general idea is as follows:</p>
<ol type="1">
<li>Pick a random sequence ID out of the nearly 400,000 sequences
currently in the database.</li>
<li>Retrieve the metadata for this sequence (its name, the first few
terms, keywords and comments).</li>
<li>Format this as a Mastodon post and post it.</li>
</ol>
<p>It is a very simple process (basically a GET request to the OEIS
followed by a POST request to a Mastodon instance). Mastodon servers
have a public API that is very easy to use and very well documented: I
only needed the <a href="https://docs.joinmastodon.org/methods/statuses/#create">statuses</a> endpoint. It requires a token that can be
generated easily from the account settings, by creating an application
with the <code class="verbatim">write:statuses</code> scope.</p>
<p>The OEIS API has a <a href="https://oeis.org/wiki/JSON_Format">JSON output format</a>. I wanted to parse this
correctly. Looking at the <a href="https://oeis.org/eishelp2.html">documentation</a> of the format, I realized that
many sequences should be excluded because they are duplicates,
unimportant, or too obscure<span class="sidenote-wrapper"><label for="sn-1" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sn-1" class="margin-toggle" /><span class="sidenote">Specifically, I excluded sequences
tagged with <code class="verbatim">dead</code>, <code class="verbatim">dumb</code>, <code class="verbatim">dupe</code>, <code class="verbatim">less</code>, <code class="verbatim">obsc</code>, <code class="verbatim">probation</code>, or
<code class="verbatim">uned</code>.<br />
<br />
</span></span>. So I added a step in the process above to make sure that the
sequence picked at random was sufficiently interesting. Given the
number of available sequences, there should be more than enough
interesting ones!</p>
<p>I implemented it in Rust<span class="sidenote-wrapper"><label for="sn-2" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sn-2" class="margin-toggle" /><span class="sidenote">Mostly because I am comfortable with it,
it has a very good ecosystem, and it can be deployed as a single
binary.<br />
<br />
</span></span>. Claude Code was fairly helpful to build all this, especially
for the serde types to deserialize the OEIS internal format. All in
all, I had a fully functional program in under an hour, for something
that would have taken me at least a few hours if I had done everything
“manually” (and which would not have been particularly interesting or
fun to do). I am reasonably happy with the overall quality of the
code, that you can find <a href="https://github.com/dlozeve/oeis_bot">on GitHub</a>.</p>
<p>The final program is deployed on my VPS, I just copied the executable
and set up a <a href="https://github.com/dlozeve/oeis_bot/blob/main/oeis-bot.timer">systemd timer</a> to call it every 4 hours. The token for
Mastodon is given as an environment variable in the <a href="https://github.com/dlozeve/oeis_bot/blob/main/oeis-bot.service">service file</a>.</p>
<h2 id="the-result">The result</h2>
<p>I created an account on <a href="https://mathstodon.xyz/">Mathstodon</a>, the instance I use for my <a href="https://mathstodon.xyz/@dlzv">personal account</a>. As this is my first bot, I wasn’t sure about the process, but
the administrators of the instance were happy to allow the bot as long
as it is clearly identified as such and does not spam too much. I also
made it clear that the bot is not operated officially by the
OEIS. Should they decide to join Mastodon, I would be happy to cede
them the account if they wish.</p>
<p>If you want to follow it, the account is <a href="https://mathstodon.xyz/@OEISbot">@OEISbot</a>!</p>
<blockquote lang="en" cite="https://mathstodon.xyz/@OEISbot/116069813960351345" data-source="fediverse">
  <p>OEIS sequence A177263<br />Triangle read by rows: T(n,k) is the number of permutations of {1,2,...,n} having k as the last entry in the first block (1&lt;=k&lt;=n).</p><p>1, 0, 2, 1, 1, 4, 4, 5, 5, 10, 18, 22, 23, 23, 34, 96, 114, 118, 119, 119, 154, 600, 696, 714, 718, 719, 719, 874, 4320, 4920, 5016, 5034, 5038, 5039, 5039, 5914, 35280, 39600, 40200, 40296, 40314, 40318, 40319, 40319, 46234, 322560, 357840, 362160, 362760, 362856, 362874, 362878, 362879, 362879, 409114</p><p><a href="https://oeis.org/A177263" target="_blank" rel="nofollow noopener" translate="no"><span class="invisible">https://</span><span class>oeis.org/A177263</span><span class="invisible"></span></a></p>
  <footer>
     — OEIS bot (@OEISbot) <a href="https://mathstodon.xyz/@OEISbot/116069813960351345"><time datetime="2026-02-14T16:00:07.743Z">2/14/2026, 5:00:07 PM</time></a>
  </footer>
</blockquote>

<blockquote lang="en" cite="https://mathstodon.xyz/@OEISbot/116067926081324392" data-source="fediverse">
  <p>OEIS sequence A071904<br />Odd composite numbers.</p><p>9, 15, 21, 25, 27, 33, 35, 39, 45, 49, 51, 55, 57, 63, 65, 69, 75, 77, 81, 85, 87, 91, 93, 95, 99, 105, 111, 115, 117, 119, 121, 123, 125, 129, 133, 135, 141, 143, 145, 147, 153, 155, 159, 161, 165, 169, 171, 175, 177, 183, 185, 187, 189, 195, 201, 203, 205</p><p><a href="https://oeis.org/A71904" target="_blank" rel="nofollow noopener" translate="no"><span class="invisible">https://</span><span class>oeis.org/A71904</span><span class="invisible"></span></a></p>
  <footer>
     — OEIS bot (@OEISbot) <a href="https://mathstodon.xyz/@OEISbot/116067926081324392"><time datetime="2026-02-14T08:00:00.996Z">2/14/2026, 9:00:00 AM</time></a>
  </footer>
</blockquote>

<p>For now only a few sequences have been posted, but I’m excited to see
what interesting sequences come up!</p>
</section>
]]></summary>
</entry>
<entry>
    <title>The Bulgarian Solitaire</title>
    <link href="https://www.lozeve.com/posts/bulgarian-solitaire.html" />
    <id>https://www.lozeve.com/posts/bulgarian-solitaire.html</id>
    <published>2026-02-08T00:00:00Z</published>
    <updated>2026-02-08T00:00:00Z</updated>
    <summary type="html"><![CDATA[<section>
  <p>Imagine you are the director of a unit in a large company. You have 5
teams, with various numbers of people in each team. An important new
project has come up, and you would like to set up a new team to tackle
it. The problem is: you don’t have any budget to hire new people!</p>
<p>What can you do? Well, you could take one person from each existing
team, and build a new team with these 5 people.</p>
<p>It works well for a while, but of course soon enough, a new project
comes up, and you still don’t have any additional budget! It worked
the first time, so you reshuffle the teams once more in the same
way. One of the existing team had only one person left, so they go to
the newly created team and their existing team just disappears.</p>
<p>Now the CEO is getting concerned with your unconventional management
practices. She summons you to her office and asks you: What would
happen if you do this repeatedly for every new project? How many teams
will you end up with, with how many people in each one of them?</p>
<h2 id="the-bulgarian-solitaire">The Bulgarian Solitaire</h2>
<p>Instead of this (admittedly far-fetched) management scenario, you can
do the same process with playing cards. You have <span class="math inline">\(n\)</span> cards divided
arbitrarily into several piles. At each step, you remove one card from
each pile and collect them in a new pile next to the existing ones. If
you repeat this operation, how many piles of cards will you get? With
how many cards in each of them? This problem is called the <em>Bulgarian
Solitaire</em>. It is described in <a href="https://arxiv.org/abs/1503.00885">this paper</a><span class="sidenote-wrapper"><label for="sn-0" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sn-0" class="margin-toggle" /><span class="sidenote">Which also describes the
history of the problem, including why it is called “Bulgarian”.<br />
<br />
</span></span>.</p>
<p>With <span class="math inline">\(n\)</span> cards, the Bulgarian Solitaire is an example of a discrete
dynamical system on partitions of <span class="math inline">\(n\)</span>. A <em>partition</em> is an unordered
sequence of numbers that sum to <span class="math inline">\(n\)</span>. For instance, <span class="math inline">\((4,3,2,1)\)</span> and
<span class="math inline">\((8,3,2,2)\)</span> are partitions of 15. At each step, the Bulgarian
Solitaire will transform a partition like <span class="math inline">\((8,3,2,2)\)</span> into a new one,
in this case <span class="math inline">\((7,2,1,1,4)\)</span> (the new pile is at the end, but we don’t
care about the order so we can also write it as <span class="math inline">\((7,4,2,1,1)\)</span>).</p>
<p>What we are trying to understand is in fact a standard question in
dynamical systems: where is the system going? Are there stable states,
cycles? What are they?</p>
<h2 id="visualizing-the-bulgarian-solitaire-process">Visualizing the Bulgarian Solitaire process</h2>
<p>As a first example, let’s take <span class="math inline">\(n=10\)</span> cards, in piles of 4, 3, and 3
cards. We take one card from the top of each pile (in orange), and we
make a new pile of size 3:</p>
<p><img src="../images/bulgarian-solitaire/cards.svg" width="300" /></p>
<p>Visualizing cards like this is not very practical, but as it turns
out, mathematicians have already devised a nice way to visualize
partitions: <a href="https://en.wikipedia.org/wiki/Young_tableau#Diagrams">Young diagrams</a>. They represent a partition as rows of
boxes. For example, the partition <span class="math inline">\((6,5,3,1)\)</span> of the number 15 can be
visualized as:</p>
<p><picture><source srcset="../images/bulgarian-solitaire/young-dark.png" media="(prefers-color-scheme: dark)"><img src="../images/bulgarian-solitaire/young.png" alt></picture></p>
<p>In our cases, we can rotate the Young diagram by 90° so that it looks
more like piles of cards:</p>
<p><picture><source srcset="../images/bulgarian-solitaire/young_rotated-dark.png" media="(prefers-color-scheme: dark)"><img src="../images/bulgarian-solitaire/young_rotated.png" alt></picture></p>
<p>A step in the Bulgarian Solitaire will look like this:</p>
<p><picture><source srcset="../images/bulgarian-solitaire/step-dark.png" media="(prefers-color-scheme: dark)"><img src="../images/bulgarian-solitaire/step.png" alt></picture></p>
<p>We take one card from each pile (highlighted in orange in the diagram
above), and create a new pile from them. Then we reorder the piles so
that they are sorted by decreasing size.</p>
<h2 id="convergence-of-the-bulgarian-solitaire">Convergence of the Bulgarian Solitaire</h2>
<p>Now, if you try the Bulgarian Solitaire with 15 cards, for every
possible starting position, you will end up with 4 piles of size 1, 2,
3, and 4. It turns out that this pattern can be observed consistently:</p>
<div class="theorem">
<p>When the number of cards <span class="math inline">\(n = k(k+1)/2\)</span> is triangular<span class="sidenote-wrapper"><label for="sn-1" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sn-1" class="margin-toggle" /><span class="sidenote"><a href="https://en.wikipedia.org/wiki/Triangular_number">Triangular numbers</a> are the sums of consecutive
integers: 1, 1+2, 1+2+3, etc. They get their names from the fact that
they can be arranged in a triangle. (OEIS: <a href="https://oeis.org/A000217">A000217</a>) (Image from
<a href="https://commons.wikimedia.org/wiki/File:First_six_triangular_numbers.svg">Wikimedia</a>, CC-BY-SA.)
<picture><source srcset="../images/bulgarian-solitaire/First_six_triangular_numbers-dark.svg" media="(prefers-color-scheme: dark)"><img src="../images/bulgarian-solitaire/First_six_triangular_numbers.svg" alt></picture><br />
<br />
</span></span>,
the Bulgarian Solitaire will converge to piles of size 1, 2, …, <span class="math inline">\(k\)</span>.</p>
</div>
<p>How can we prove this?</p>
<p>There is a nice trick: we rotate (again!) our Young diagram by 45°:</p>
<p><picture><source srcset="../images/bulgarian-solitaire/young_rotated_45-dark.png" media="(prefers-color-scheme: dark)"><img src="../images/bulgarian-solitaire/young_rotated_45.png" alt></picture></p>
<p>When we do a step in the Bulgarian Solitaire, we move around the boxes:</p>
<p><picture><source srcset="../images/bulgarian-solitaire/young_fall-dark.png" media="(prefers-color-scheme: dark)"><img src="../images/bulgarian-solitaire/young_fall.png" alt></picture></p>
<p>Sometimes we can see that a box (in grey here) is above an empty
space, and therefore it falls down. This corresponds to cases when
there is a need to reorder the piles so that they remain in order of
decreasing size.</p>
<p>We can see that at each step, the total potential energy of the boxes
either stays the same (when all boxes are still resting on other boxes
or on the edges) or decreases (when a box is above an empty space and
falls down). And the lowest possible potential energy is reached when
the configuration is fully triangular, when boxes are spread evenly
across all piles!</p>
<p>So now we just need to prove that all configurations will end up at
this minimum of potential energy, i.e. that there is no local minimum
that prevents the system from reaching this global minimum.</p>
<p>Imagine we are <em>not</em> at this triangular position. Since the total
number of boxes is triangular, there is necessarily one layer with an
empty space for a box, and one layer above with an additional box. At
each step, the boxes will “rotate”, and the empty space and the
additional box will move one space to the right. If the layer with the
empty space is of size <span class="math inline">\(k\)</span>, the layer above it will be of size <span class="math inline">\(k+1\)</span>,
and these numbers are relatively prime. After a certain number of
steps, the additional box will be directly above the empty space and
will fall down.</p>
<p>So no configuration that is not fully triangular is stable: after a
certain number of steps, the potential energy will decrease. The
detailed proof can be found in section 2 of <a href="https://arxiv.org/abs/1503.00885">the paper</a>.</p>
<h2 id="simulating-the-bulgarian-solitaire">Simulating the Bulgarian Solitaire</h2>
<p>Let’s simulate the process (in Python) to visualize the
convergence. The script will generate all partitions of a given
integer <span class="math inline">\(n\)</span>, and generate the transition graph of the Bulgarian
Solitaire. If a step of the Bulgarian Solitaire transforms partition
<span class="math inline">\(p_1\)</span> into partition <span class="math inline">\(p_2\)</span>, we will draw an edge between <span class="math inline">\(p_1\)</span> and
<span class="math inline">\(p_2\)</span>.</p>
<p>We’ll start by defining helpful type aliases: a partition is a tuple
of ints, and the graph we will generate is a dictionary from
partitions to partitions.</p>
<div class="sourceCode" id="cb1"><pre class="sourceCode python"><code class="sourceCode python"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="bu">type</span> Partition <span class="op">=</span> <span class="bu">tuple</span>[<span class="bu">int</span>, ...]</span>
<span id="cb1-2"><a href="#cb1-2" aria-hidden="true" tabindex="-1"></a><span class="bu">type</span> Graph <span class="op">=</span> <span class="bu">dict</span>[Partition, Partition]</span></code></pre></div>
<p>Next, we need a function that enumerates all possible partitions of a
given integer <span class="math inline">\(n\)</span>. It is built recursively: for all possible prefixes
<span class="math inline">\(0 &lt; k \leq n\)</span>, the partitions we are looking for consist of <span class="math inline">\(k\)</span>
followed by all partitions of <span class="math inline">\(n-k\)</span>.</p>
<div class="sourceCode" id="cb2"><pre class="sourceCode python"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> collections.abc <span class="im">import</span> Generator</span>
<span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> integer_partition(n: <span class="bu">int</span>) <span class="op">-&gt;</span> Generator[Partition]:</span>
<span id="cb2-4"><a href="#cb2-4" aria-hidden="true" tabindex="-1"></a>    <span class="cf">if</span> n <span class="op">&lt;=</span> <span class="dv">0</span>:</span>
<span id="cb2-5"><a href="#cb2-5" aria-hidden="true" tabindex="-1"></a>        <span class="cf">return</span></span>
<span id="cb2-6"><a href="#cb2-6" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb2-7"><a href="#cb2-7" aria-hidden="true" tabindex="-1"></a>    <span class="kw">def</span> find_partitions_recursive(target: <span class="bu">int</span>, current_partition: <span class="bu">list</span>[<span class="bu">int</span>]):</span>
<span id="cb2-8"><a href="#cb2-8" aria-hidden="true" tabindex="-1"></a>        <span class="cf">if</span> target <span class="op">==</span> <span class="dv">0</span>:</span>
<span id="cb2-9"><a href="#cb2-9" aria-hidden="true" tabindex="-1"></a>            <span class="cf">yield</span> current_partition</span>
<span id="cb2-10"><a href="#cb2-10" aria-hidden="true" tabindex="-1"></a>            <span class="cf">return</span></span>
<span id="cb2-11"><a href="#cb2-11" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb2-12"><a href="#cb2-12" aria-hidden="true" tabindex="-1"></a>        max_val <span class="op">=</span> current_partition[<span class="op">-</span><span class="dv">1</span>] <span class="cf">if</span> current_partition <span class="cf">else</span> target</span>
<span id="cb2-13"><a href="#cb2-13" aria-hidden="true" tabindex="-1"></a>        <span class="cf">for</span> i <span class="kw">in</span> <span class="bu">range</span>(<span class="bu">min</span>(target, max_val), <span class="dv">0</span>, <span class="op">-</span><span class="dv">1</span>):</span>
<span id="cb2-14"><a href="#cb2-14" aria-hidden="true" tabindex="-1"></a>            current_partition.append(i)</span>
<span id="cb2-15"><a href="#cb2-15" aria-hidden="true" tabindex="-1"></a>            <span class="cf">yield</span> <span class="cf">from</span> find_partitions_recursive(target <span class="op">-</span> i, current_partition)</span>
<span id="cb2-16"><a href="#cb2-16" aria-hidden="true" tabindex="-1"></a>            <span class="co"># Backtrack</span></span>
<span id="cb2-17"><a href="#cb2-17" aria-hidden="true" tabindex="-1"></a>            current_partition.pop()</span>
<span id="cb2-18"><a href="#cb2-18" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb2-19"><a href="#cb2-19" aria-hidden="true" tabindex="-1"></a>    <span class="cf">yield</span> <span class="cf">from</span> <span class="bu">map</span>(<span class="bu">tuple</span>, find_partitions_recursive(n, []))</span></code></pre></div>
<p>The <code>step</code> function is the Bulgarian Solitaire itself:</p>
<div class="sourceCode" id="cb3"><pre class="sourceCode python"><code class="sourceCode python"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> step(partition: Partition) <span class="op">-&gt;</span> Partition:</span>
<span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a>    <span class="bu">next</span> <span class="op">=</span> [<span class="bu">len</span>(partition)] <span class="op">+</span> [k <span class="op">-</span> <span class="dv">1</span> <span class="cf">for</span> k <span class="kw">in</span> partition <span class="cf">if</span> k <span class="op">&gt;</span> <span class="dv">1</span>]</span>
<span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a>    <span class="cf">return</span> <span class="bu">tuple</span>(<span class="bu">sorted</span>(<span class="bu">next</span>, reverse<span class="op">=</span><span class="va">True</span>))</span></code></pre></div>
<p>We can now compute the transition graph:</p>
<div class="sourceCode" id="cb4"><pre class="sourceCode python"><code class="sourceCode python"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> compute_graph(n: <span class="bu">int</span>) <span class="op">-&gt;</span> Graph:</span>
<span id="cb4-2"><a href="#cb4-2" aria-hidden="true" tabindex="-1"></a>    <span class="cf">return</span> {partition: step(partition) <span class="cf">for</span> partition <span class="kw">in</span> integer_partition(n)}</span></code></pre></div>
<p>Now for display, we generate some <a href="https://graphviz.org/">Graphviz</a> code. The colors are
adapted for viewing directly in the terminal.</p>
<div class="sourceCode" id="cb5"><pre class="sourceCode python"><code class="sourceCode python"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> print_graphviz(graph: Graph) <span class="op">-&gt;</span> <span class="va">None</span>:</span>
<span id="cb5-2"><a href="#cb5-2" aria-hidden="true" tabindex="-1"></a>    <span class="bu">print</span>(<span class="st">&quot;&quot;&quot;digraph {</span></span>
<span id="cb5-3"><a href="#cb5-3" aria-hidden="true" tabindex="-1"></a><span class="st">    bgcolor=&quot;transparent&quot;;</span></span>
<span id="cb5-4"><a href="#cb5-4" aria-hidden="true" tabindex="-1"></a><span class="st">    node [shape=&quot;rect&quot;, fontcolor=&quot;white&quot;, color=&quot;white&quot;];</span></span>
<span id="cb5-5"><a href="#cb5-5" aria-hidden="true" tabindex="-1"></a><span class="st">    edge [fontcolor=&quot;white&quot;, color=&quot;white&quot;];</span></span>
<span id="cb5-6"><a href="#cb5-6" aria-hidden="true" tabindex="-1"></a><span class="st">    &quot;&quot;&quot;</span>)</span>
<span id="cb5-7"><a href="#cb5-7" aria-hidden="true" tabindex="-1"></a>    <span class="cf">for</span> k, v <span class="kw">in</span> graph.items():</span>
<span id="cb5-8"><a href="#cb5-8" aria-hidden="true" tabindex="-1"></a>        <span class="bu">print</span>(<span class="ss">f'    &quot;</span><span class="sc">{</span>k<span class="sc">}</span><span class="ss">&quot; -&gt; &quot;</span><span class="sc">{</span>v<span class="sc">}</span><span class="ss">&quot;;'</span>)</span>
<span id="cb5-9"><a href="#cb5-9" aria-hidden="true" tabindex="-1"></a>    <span class="bu">print</span>(<span class="st">&quot;}&quot;</span>)</span></code></pre></div>
<p>To finish, let’s add a command-line interface:</p>
<div class="sourceCode" id="cb6"><pre class="sourceCode python"><code class="sourceCode python"><span id="cb6-1"><a href="#cb6-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> argparse</span>
<span id="cb6-2"><a href="#cb6-2" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb6-3"><a href="#cb6-3" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> main():</span>
<span id="cb6-4"><a href="#cb6-4" aria-hidden="true" tabindex="-1"></a>    parser <span class="op">=</span> argparse.ArgumentParser()</span>
<span id="cb6-5"><a href="#cb6-5" aria-hidden="true" tabindex="-1"></a>    parser.add_argument(<span class="st">&quot;n&quot;</span>, <span class="bu">type</span><span class="op">=</span><span class="bu">int</span>)</span>
<span id="cb6-6"><a href="#cb6-6" aria-hidden="true" tabindex="-1"></a>    args <span class="op">=</span> parser.parse_args()</span>
<span id="cb6-7"><a href="#cb6-7" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb6-8"><a href="#cb6-8" aria-hidden="true" tabindex="-1"></a>    graph <span class="op">=</span> compute_graph(args.n)</span>
<span id="cb6-9"><a href="#cb6-9" aria-hidden="true" tabindex="-1"></a>    print_graphviz(graph)</span>
<span id="cb6-10"><a href="#cb6-10" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb6-11"><a href="#cb6-11" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb6-12"><a href="#cb6-12" aria-hidden="true" tabindex="-1"></a><span class="cf">if</span> <span class="va">__name__</span> <span class="op">==</span> <span class="st">&quot;__main__&quot;</span>:</span>
<span id="cb6-13"><a href="#cb6-13" aria-hidden="true" tabindex="-1"></a>    main()</span></code></pre></div>
<p>The script can then be called directly:</p>
<div class="sourceCode" id="cb7" data-org-language="sh"><pre class="sourceCode bash"><code class="sourceCode bash"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Save the graphviz file</span></span>
<span id="cb7-2"><a href="#cb7-2" aria-hidden="true" tabindex="-1"></a><span class="ex">python</span> bulgarian_solitaire.py 10 <span class="op">&gt;</span> graph10.dot</span>
<span id="cb7-3"><a href="#cb7-3" aria-hidden="true" tabindex="-1"></a><span class="co"># Generate a PNG graph</span></span>
<span id="cb7-4"><a href="#cb7-4" aria-hidden="true" tabindex="-1"></a><span class="ex">python</span> bulgarian_solitaire.py 10 <span class="kw">|</span> <span class="ex">dot</span> <span class="at">-Tpng</span> <span class="op">&gt;</span> graph_10.png</span>
<span id="cb7-5"><a href="#cb7-5" aria-hidden="true" tabindex="-1"></a><span class="co"># Display the image directly in the terminal (if it supports image display)</span></span>
<span id="cb7-6"><a href="#cb7-6" aria-hidden="true" tabindex="-1"></a><span class="ex">python</span> bulgarian_solitaire.py 12 <span class="kw">|</span> <span class="ex">dot</span> <span class="at">-Tpng</span> <span class="kw">|</span> <span class="ex">viu</span> <span class="at">-</span></span></code></pre></div>
<p>Here is the output for <span class="math inline">\(n=6\)</span>:</p>
<p><img src="../images/bulgarian-solitaire/graph_6.svg" style="background-color: #151515;" /></p>
<p>And for <span class="math inline">\(n=10\)</span>:</p>
<p><img src="../images/bulgarian-solitaire/graph_10.svg" style="background-color: #151515;" /></p>
<p>As we can see, all partitions indeed converge to the triangular
configurations <span class="math inline">\((3,2,1)\)</span> and <span class="math inline">\((4,3,2,1)\)</span>.</p>
<p>Now what happens when the number <span class="math inline">\(n\)</span> is not triangular?</p>
<p>For <span class="math inline">\(n=7\)</span>:</p>
<p><img src="../images/bulgarian-solitaire/graph_7.svg" style="background-color: #151515;" /></p>
<p>The dynamical system converges to a cycle between <span class="math inline">\((4,2,1)\)</span>,
<span class="math inline">\((3,3,1)\)</span>, <span class="math inline">\((3,2,2)\)</span>, and <span class="math inline">\((3,2,1,1)\)</span>. When we look at this cycle, we
realize that it is the triangular configuration <span class="math inline">\((3,2,1)\)</span>, plus an
additional box cycling above.</p>
<p>For <span class="math inline">\(n=8\)</span>:</p>
<p><img src="../images/bulgarian-solitaire/graph_8.svg" style="background-color: #151515;" /></p>
<p>Here we get two connected components in the graph. And here again, we
get the triangular partition but with two additional boxes cycling
above.</p>
<p>This property can be proved (Theorem 2 in the paper), and the number
of connected components can be computed exactly for all <span class="math inline">\(n\)</span>.</p>
<h2 id="conclusion">Conclusion</h2>
<p>The Bulgarian Solitaire is a nice example of a discrete dynamical
system. The proof is particularly interesting because it relies on
visualization and physical intuition about the “potential energy” of
the system, which translates directly into the formal proof.</p>
<p><a href="https://arxiv.org/abs/1503.00885">The paper</a> also mentions generalizations. I am interested in particular
in the <em>stochastic</em> Bulgarian Solitaire. Instead of taking one card
from each pile at each step, we take one card from each pile
independently with probability <span class="math inline">\(0&lt;p&lt;1\)</span>. The dynamical system becomes a
Markov chain, and it would be interesting to study (even
experimentally) its convergence properties.</p>
<h2 id="bonus-bqn-implementation-of-the-simulator">Bonus: <a href="https://mlochbaum.github.io/BQN/">BQN</a> implementation of the simulator</h2>
<p>First, the <code>Step</code> function computes one step of the Bulgarian
Solitaire. It removes 1 from each pile (<code>¯1⊸+</code>), removes the piles
that became empty (<code>0⊸≠</code>), adds a pile of the original size (<code>≠∾</code>), and
sorts the piles for consistency (<code>∨</code>).</p>
<div class="sourceCode" id="cb8"><pre class="sourceCode txt"><code class="sourceCode default"><span id="cb8-1"><a href="#cb8-1" aria-hidden="true" tabindex="-1"></a>Step←∨≠∾(0⊸≠⊸/¯1⊸+)</span></code></pre></div>
<p>The <code>Partitions</code> function generates all partitions of a given
integer. Conveniently, this was already available in <a href="https://mlochbaum.github.io/bqncrate/?q=integer%20partitions">BQNcrate</a>!</p>
<div class="sourceCode" id="cb9"><pre class="sourceCode txt"><code class="sourceCode default"><span id="cb9-1"><a href="#cb9-1" aria-hidden="true" tabindex="-1"></a>Partitions←{∾⊢´∾⟜(&lt;-⟜↕∘≠{𝕨∾¨∾(-𝕨⌊≠𝕩)↑𝕩}¨⊢)⍟𝕩⋈⋈⋈↕0}</span></code></pre></div>
<p>We can now compute the graph, by first computing all partitions of the
input, then building the edges by calling <code>Step</code> on each partition.</p>
<div class="sourceCode" id="cb10"><pre class="sourceCode txt"><code class="sourceCode default"><span id="cb10-1"><a href="#cb10-1" aria-hidden="true" tabindex="-1"></a>ComputeGraph←{&gt;(⊢⋈Step)¨Partitions𝕩}</span></code></pre></div>
<p>We now generate and print the graphviz code:</p>
<div class="sourceCode" id="cb11"><pre class="sourceCode txt"><code class="sourceCode default"><span id="cb11-1"><a href="#cb11-1" aria-hidden="true" tabindex="-1"></a>EdgeToGraphviz←{∾⟜(&quot;&quot;&quot;;&quot;∾@+10)5↓∾&quot;&quot;&quot; -&gt; &quot;&quot;&quot;⊸∾¨•Fmt¨𝕩}</span>
<span id="cb11-2"><a href="#cb11-2" aria-hidden="true" tabindex="-1"></a>graphvizheader←&quot;bgcolor=&quot;&quot;transparent&quot;&quot;;</span>
<span id="cb11-3"><a href="#cb11-3" aria-hidden="true" tabindex="-1"></a>node [shape=&quot;&quot;rect&quot;&quot;,fontcolor=&quot;&quot;white&quot;&quot;,color=&quot;&quot;white&quot;&quot;];</span>
<span id="cb11-4"><a href="#cb11-4" aria-hidden="true" tabindex="-1"></a>edge [fontcolor=&quot;&quot;white&quot;&quot;,color=&quot;&quot;white&quot;&quot;];&quot;</span>
<span id="cb11-5"><a href="#cb11-5" aria-hidden="true" tabindex="-1"></a>ToGraphviz←{&quot;digraph {&quot;∾graphvizheader∾(∾EdgeToGraphviz¨&lt;˘𝕩)∾&quot;}&quot;}</span>
<span id="cb11-6"><a href="#cb11-6" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb11-7"><a href="#cb11-7" aria-hidden="true" tabindex="-1"></a>n←•ParseFloat⊑•args</span>
<span id="cb11-8"><a href="#cb11-8" aria-hidden="true" tabindex="-1"></a>•Out ToGraphviz ComputeGraph n</span></code></pre></div>
<p>Similarly to the Python version, we can run the script with:</p>
<div class="sourceCode" id="cb12" data-org-language="sh"><pre class="sourceCode bash"><code class="sourceCode bash"><span id="cb12-1"><a href="#cb12-1" aria-hidden="true" tabindex="-1"></a><span class="ex">bqn</span> bulgarian_solitaire.bqn 6 <span class="kw">|</span> <span class="ex">dot</span> <span class="at">-Tpng</span> <span class="op">&gt;</span> graph_6.png</span></code></pre></div>
</section>
]]></summary>
</entry>
<entry>
    <title>Language models are injective and hence invertible</title>
    <link href="https://www.lozeve.com/posts/language-models-are-injective.html" />
    <id>https://www.lozeve.com/posts/language-models-are-injective.html</id>
    <published>2025-11-25T00:00:00Z</published>
    <updated>2025-11-25T00:00:00Z</updated>
    <summary type="html"><![CDATA[<section>
  <p>This is the title of a recent preprint
<span class="citation" data-cites="nikolaou2025_languag_model">(Nikolaou et al. 2025)</span><span class="sidenote-wrapper"><label for="sn-0" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sn-0" class="margin-toggle" /><span class="sidenote">Quick link: <a href="https://arxiv.org/abs/2510.15511">ArXiv:2510.15511</a>.<br />
<br />
</span></span>, that has some
interesting insights on the mathematical properties of Transformers
when viewed as functions.</p>
<p>The immediate reaction is “this is nonsense”, and the title may be
viewed as either misleading or deliberately provocative in order to
gather interest. Indeed, most functions inside neural
networks<span class="sidenote-wrapper"><label for="sn-1" class="margin-toggle">⊕</label><input type="checkbox" id="sn-1" class="margin-toggle" /><span class="marginnote"><img src="../images/language-models-are-injective/swiglu.png" /> The
SwiGLU activation function. It is clearly not injective.<br />
<br />
</span></span> and particularly Transformers are not
injective. One thinks immediately of activation functions,
normalization layers, and so on, counter-examples abound. So how can
an entire language model be injective?</p>
<p>At a higher level, injectivity seems like an even taller order. From
the user’s point of view, plenty of input prompts will lead to the
same answers (one can think of all prompts ending with “answer with
yes or no”)<span class="sidenote-wrapper"><label for="sn-2" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sn-2" class="margin-toggle" /><span class="sidenote">Indeed, the public discourse around this paper was
dismal, with thousands of users on X/Twitter posting ChatGPT
screenshots “disproving” injectivity…<br />
<br />
</span></span>.</p>
<p>So, let’s actually go through the paper to understand the claim more
precisely!</p>
<h2 id="language-models-as-functions">Language models as functions</h2>
<p>There are many different ways of viewing language models as
mathematical functions. In the first intuition above, we considered
models as functions from vectors to vectors, or from text to
text. This is what makes the result seem counter-intuitive at first.</p>
<p>I think one of the main contribution of the paper is to slightly
change the context: they see language models as functions from a
<em>prompt</em> (i.e. a finite sequence of discrete tokens) to <em>activations</em>
on the output layers (vectors).<span class="sidenote-wrapper"><label for="sn-3" class="margin-toggle">⊕</label><input type="checkbox" id="sn-3" class="margin-toggle" /><span class="marginnote"><img src="../images/language-models-are-injective/map.png" /> The
function we are studying: from prompt space (sequence of tokens) to
latent space (vector of activations).<br />
<br />
</span></span></p>
<p>Formally, we have a finite vocabulary <span class="math inline">\(\mathcal{V}\)</span>, and context
length <span class="math inline">\(K\)</span>. From a sequence of tokens <span class="math inline">\(s \in \mathcal{V}^{\leq K}\)</span> and
parameters <span class="math inline">\(\mathbf{\theta}\)</span>, the language model gives us <em>last-token
representations</em> <span class="math inline">\(\mathbf{r}(s, \mathbf{\theta}) \in \mathbb{R}^d\)</span>.</p>
<p>The central claim of the paper is that <em>this</em> function is injective.</p>
<h2 id="why-is-injectivity-important">Why is injectivity important?</h2>
<p>Since the various components of the language models are not themselves
injective, people tend to assume that there is some “forgetting” and
privacy is built into all language models. Since the input prompt is
transformed into an intermediate representation that goes through many
transformations, information is lost and the input prompt cannot be
reconstructed exactly. This intuition is wrong.</p>
<p>This has practical consequences for systems that store intermediate
activations (e.g. embeddings), in the assumption that this is somehow
more secure, or more respectful of their users privacy, than storing
the user inputs directly. In practice, all the information is in the
activations, and the paper provides both a formal proof and a
practical reconstruction method.</p>
<h2 id="structure-of-the-paper">Structure of the paper</h2>
<p>The paper has two main parts.</p>
<ul>
<li>The <em>theoretical</em> part defines a Transformer-based language model
mathematically as a function from sequences of tokens to real-valued
vectors. In this context, they prove that language models based on
common blocks (feed-forward layers, embedding layers, self-attention
blocks, common activations, etc) are almost surely injective after
random initialization of the parameters, and that this property is
preserved during training.</li>
<li>The <em>practical</em> part builds on these insights and builds a fairly
straightforward algorithm to reconstruct user inputs from the
activations of any intermediate layer, including the output
layer. They show 100% reconstruction accuracy on widely-used
language models.</li>
</ul>
<h2 id="the-theoretical-argument">The theoretical argument</h2>
<h3 id="a-primer-on-analytic-functions">A primer on analytic functions</h3>
<p>To prove the main results, the paper uses the notion of
<em>analyticity</em><span class="sidenote-wrapper"><label for="sn-4" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sn-4" class="margin-toggle" /><span class="sidenote">See Appendix A of the paper for the full
background on analytic functions.<br />
<br />
</span></span>. Analytic functions are highly
regular, actually even “smoother” than infinitely differentiable
functions. It is this “smoothness” that will have important
consequences for proving injectivity.</p>
<p>Intuitively, analytic functions are “locally polynomial”:</p>
<div class="definition">
<p>Let <span class="math inline">\(\mathcal{U} \subseteq \mathbb{R}^m\)</span> be open.
A function <span class="math inline">\(f : \mathcal{U} \to \mathbb{R}\)</span> is <em>real-analytic</em> on <span class="math inline">\(\mathcal{U}\)</span> if,
for every <span class="math inline">\(\mathbf{y} \in \mathcal{U}\)</span>, there exists coefficients
<span class="math inline">\(\{c_\mathbf{\alpha} \in \mathbb{R}\}_{\alpha \in \mathbb{N}^m}\)</span>
and <span class="math inline">\(r&gt;0\)</span> such that</p>
<p><span class="math display">\[
f(\mathbf{x}) = \sum_{\mathbf{\alpha} \in \mathbb{N}^m} c_\mathbf{\alpha} (\mathbf{x} - \mathbf{y})^\mathbf{\alpha}
\]</span></p>
<p>for all <span class="math inline">\(\mathbf{x} \in \mathcal{U}\)</span> with <span class="math inline">\(\lVert \mathbf{x} - \mathbf{y} \rVert_2 &lt; r\)</span>.</p>
<p>The set of analytic functions on <span class="math inline">\(\mathcal{U}\)</span> is denoted by <span class="math inline">\(C^\omega(\mathcal{U})\)</span>.</p>
</div>
<p>Intuitively, not only the function is infinitely differentiable, and
therefore can be locally approximated by its Taylor series, but
actually the coefficients are zero after some point, so the function
is locally a polynomial. This gives enormous regularity guarantees,
and analytic functions are among the most well-behaved.</p>
<p>A vector-valued function is analytic if all its components are
analytic, and we can similarly extend the definition to matrix-valued
functions. This is very important for our discussions since the
component blocks of Transformers are often matrix-valued,
e.g. attention layers.</p>
<p>Importantly, the set of analytic functions is closed under common
operations: addition, product, quotient, and composition. This means
that we only have to prove the analyticity of common building blocks
of Transformers, and through addition and composition the analyticity
of the overall function will be immediately proven.</p>
<p>Moreover, analytic functions have a key property relating to their
zero sets:</p>
<div class="proposition">
<p>Let <span class="math inline">\(\mathcal{U} \subseteq \mathbb{R}^m\)</span> be connected and open,
and let <span class="math inline">\(f \in C^\omega(\mathcal{U}; \mathbb{R}^n)\)</span>.
If <span class="math inline">\(f \not\equiv \mathbf{0}_n\)</span>, then its zero set
<span class="math display">\[
Z(f) := f^{-1}(\{\mathbf{0}_n\}) = \{\mathbf{x} \in \mathcal{U} : f(\mathbf{x}) = \mathbf{0}_n\}
\]</span>
has Lebesgue measure zero in <span class="math inline">\(\mathbb{R}^m\)</span>.</p>
</div>
<div class="proof">
<p>The proof can be found in <span class="citation" data-cites="mityagin2020_zero_set">Mityagin (2020)</span><span class="sidenote-wrapper"><label for="sn-5" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sn-5" class="margin-toggle" /><span class="sidenote">The paper actually cites the (identical) <a href="https://arxiv.org/abs/1512.07276">preprint version</a>.<br />
<br />
</span></span>.</p>
</div>
<p>The gist is simply this: if <span class="math inline">\(f\)</span> is not zero everywhere, then it is
zero almost nowhere. If there is an entire continuous region where <span class="math inline">\(f\)</span>
is zero, it means that <span class="math inline">\(f\)</span> is zero everywhere! This is a very strong
property of analytic functions: fundamentally, they are so regular
that if they are zero on any significant portion of their domain,
their smoothness imposes that they become zero everywhere.</p>
<p>This proposition is key to the paper as it will be used to prove
injectivity. We can now go on to the actual results of the paper.</p>
<h3 id="language-models-are-analytic">Language models are analytic</h3>
<p>A language model is a composition of many building blocks. Transformer
blocks are themselves composed of self-attention layers and MLPs,
which are themselves compositions of stacking operations,
normalizations, softmax, exponentials, and polynomials. By going
bottom-up, we can start from basic blocks (polynomial functions, the
exponential function), show that they are analytic, and since analytic
functions are closed under composition, show that the entire language
model is analytic.</p>
<p>The proofs that these building blocks are analytics are detailed in
Appendix A of the paper. The full language model is defined
mathematically as a composition of these building blocks in
Appendix B, with Proposition B.3 putting everything together and
proving that Transformers are analytic.</p>
<h3 id="injectivity-at-initialization">Injectivity at initialization</h3>
<p>Let us draw a set of parameters <span class="math inline">\(\mathbf{\theta}\)</span> at random, from a
distribution with a density. For any two prompts <span class="math inline">\(s, s' \in
\mathcal{V}^{\leq K}\)</span>, the probability that the output activations
<span class="math inline">\(\mathbf{r}(s, \mathbf{\theta})\)</span> and <span class="math inline">\(\mathbf{r}(s', \mathbf{\theta})\)</span>
are equal is zero.</p>
<p>This is because the function <span class="math inline">\(\mathbf{\theta} \mapsto \mathbf{r}(s,
\mathbf{\theta}) - \mathbf{r}(s', \mathbf{\theta})\)</span> is a sum of
analytic functions, and is therefore analytic. By the key proposition
above, its zero set has measure zero. So by drawing random parameters
<span class="math inline">\(\mathbf{\theta}\)</span>, you have almost surely <span class="math inline">\(\mathbf{r}(s,
\mathbf{\theta}) \neq \mathbf{r}(s', \mathbf{\theta})\)</span>.</p>
<p>We have just proven that if <span class="math inline">\(s \neq s'\)</span>, then, with probability 1 over
<span class="math inline">\(\mathbf{\theta}\)</span>, we have that <span class="math inline">\(\mathbf{r}(s, \mathbf{\theta}) \neq
\mathbf{r}(s', \mathbf{\theta})\)</span> i.e., <span class="math inline">\(\mathbf{r}\)</span> is injective!</p>
<h3 id="injectivity-is-preserved-during-training">Injectivity is preserved during training</h3>
<p>However, we only draw <span class="math inline">\(\mathbf{\theta}\)</span> at random during
initialization. What if during training, as <span class="math inline">\(\mathbf{\theta}\)</span> changes,
the language model loses its injectivity?</p>
<p>It turns out that gradient descent itself is also analytic (and so are
its variants, stochastic gradient descent, minibatch GD, etc). It is
therefore once again very smooth, and can only stretch and bend the
parameter space, but not collapse regions of positive volume into
single points. So as the parameters change during training, the points
that are apart stay apart and do not collapse onto each
other<span class="sidenote-wrapper"><label for="sn-6" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sn-6" class="margin-toggle" /><span class="sidenote">The argument is made formally in the paper using
topological properties of the Jacobian of the gradient descent
function. I won’t go into the details here, see Section 2 of the paper
for an overview, and Appendix C for the full proof.<br />
<br />
</span></span>.</p>
<p>The injectivity property established at initialization is therefore
preserved at each step of the training. This concludes the main result
of the paper: a fully-trained language model based on Transformers is
injective. Two different input prompts will always lead to two
different output activations in the final layer (and in any
intermediate layer for that matter).</p>
<h3 id="remarks">Remarks</h3>
<p>The paper bases all the proofs on a definition of a “standard”
language model based on Transformers. However, there are architecture
choices that can break the fundamental hypotheses. In particular, some
activation functions like ReLU are not analytic (indeed, it is not
even differentiable), and things like quantization also break the
analytic property of the entire model. In these cases, the model will
not be injective.</p>
<p>However, all modern language model architectures use smooth
activations<span class="sidenote-wrapper"><label for="sn-7" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sn-7" class="margin-toggle" /><span class="sidenote">See <a href="https://jcarlosroldan.com/post/348/">this great article</a> on SwiGLU and family for an
overview of activation functions used in LLMs.<br />
<br />
</span></span>, and quantization is not the norm
everywhere. As we will see in the next section, the theoretical result
hold very well in practice for LLMs that are widely used today.</p>
<h2 id="practical-application">Practical application</h2>
<p>Proving a theorem might not be enough, as there are plenty of things
that are ignored in the mathematical formulation (e.g. floating-point
precision). Moreover, real-world LLMs are very complex systems with
thousands of building blocks. For instance, activation functions are
part of a very active research domain and change frequently, so there
is no guarantee that all activation functions found in the wild follow
the requirements of the theorem.</p>
<p>The first practical test is to check for collisions: two input prompts
that lead to the same activation vector. If we can exhibit two
sequences of tokens like this, we will have broken the injectivity.</p>
<p>For large vocabularies, the input space is too big to be checked
exhaustively, so they can’t check everything. Instead they start with
a dataset of real-world prompts and find the ones that are the closest
in the output. They append all the possible sequences of tokens to
these prompts. Therefore the test is “exhaustive” starting from
“prefix” prompts that were already very close.</p>
<p>What they find is that all input pairs are very far from a collision
threshold of <span class="math inline">\(10^{-6}\)</span> (probably chosen to account for floating-point
error). As this check is not truly exhaustive and the thresholds have
been fixed somewhat arbitrarily, it’s not very significant, but it
acts as a nice “sanity check” for the theory.</p>
<p><img src="../images/language-models-are-injective/collisions.png" />
<span class="sidenote-wrapper"><label for="sn-8" class="margin-toggle">⊕</label><input type="checkbox" id="sn-8" class="margin-toggle" /><span class="marginnote">Distance between final layer and intermediate layer outputs
for the “exhaustive” check on input prompts. All are significantly
above the collision threshold as defined in the paper.<br />
<br />
</span></span></p>
<p>The second practical experiments is to exploit the injectivity result
to build an algorithm to invert a language model. Starting from the
last-layer activations, they build a simple gradient-guided search
that reconstruct the input sequence of tokens. The results are
impressive: they get 100% token-level accuracy on all prompts, and in
a remarkably quick process. The following results are for the GPT-2
Small model:</p>
<table>
<thead>
<tr>
<th>Method</th>
<th>Mean Time (s)</th>
<th>Accuracy</th>
</tr>
</thead>
<tbody>
<tr>
<td>HardPrompts</td>
<td>6132.59 ± 104.61</td>
<td>0.00</td>
</tr>
<tr>
<td>BruteForce</td>
<td>3889.61 ± 691.17</td>
<td><strong>1.00</strong></td>
</tr>
<tr>
<td>SipIt</td>
<td><strong>28.01 ± 35.87</strong></td>
<td><strong>1.00</strong></td>
</tr>
</tbody>
</table>
<h2 id="conclusion-rebuilding-our-intuition">Conclusion: rebuilding our intuition</h2>
<h3 id="expressivity-and-regularization">Expressivity and regularization</h3>
<p>Transformers (and neural networks in general) are very useful because
they generalize very well. The class of learned functions needs to be
very large to be as expressive as possible, but still constrained to
avoid overfitting. The regularization built into our language models
can be understood as properties of the functions they
represent. Analytic functions are a class of very smooth, very regular
functions, that have a lot of nice property, so it is no accident that
they are very good targets of a machine learning algorithm.</p>
<h3 id="curse-of-dimensionality">Curse of dimensionality</h3>
<p>The main result of the paper does not seem so unintuitive if its
hypotheses are laid out clearly. What we have is a function with
inputs on a <em>discrete space</em> (finite sequences on a finite
vocabulary), and outputs in <em>continuous, high-dimension</em> vector
space. Since the output space is so much larger, it would actually be
very surprising that a non-pathological function makes several inputs
“collapse” into a single high-dimension vector!</p>
<p>Moreover, there is an instance of the <em>curse of dimensionality</em>: if
you draw two vectors with random directions in a high-dimension space,
they will very probably be orthogonal
<span class="citation" data-cites="vershynin2018_high_dimen_probab">(Vershynin 2018, sec. 3.3.3)</span>. High dimensions
tend to “push everything to the edges” and so collisions become
increasingly unlikely. So another way to see this result is that it is
unintuitive only because probability on high-dimension vector spaces
is in itself quite unintuitive!</p>
<h2 class="unnumbered" id="references">References</h2>
<div id="refs" class="references csl-bib-body hanging-indent" data-entry-spacing="0" role="list">
<div id="ref-mityagin2020_zero_set" class="csl-entry" role="listitem">
Mityagin, Boris S. 2020. <span>“The Zero Set of a Real Analytic Function.”</span> <em>Mathematical Notes</em> 107 (3–4): 529–30. <a href="https://doi.org/10.1134/s0001434620030189">https://doi.org/10.1134/s0001434620030189</a>.
</div>
<div id="ref-nikolaou2025_languag_model" class="csl-entry" role="listitem">
Nikolaou, Giorgos, Tommaso Mencattini, Donato Crisostomi, Andrea Santilli, Yannis Panagakis, and Emanuele Rodolà. 2025. <span>“Language Models Are Injective and Hence Invertible.”</span> <a href="https://arxiv.org/abs/2510.15511">https://arxiv.org/abs/2510.15511</a>.
</div>
<div id="ref-vershynin2018_high_dimen_probab" class="csl-entry" role="listitem">
Vershynin, Roman. 2018. <em>High-Dimensional Probability: An Introduction with Applications in Data Science</em>. Cambridge Series in Statistical and Probabilistic Mathematics 47. <span>Cambridge</span>: <span>Cambridge University Press</span>. <a href="https://www.math.uci.edu/~rvershyn/papers/HDP-book/HDP-book.html">https://www.math.uci.edu/~rvershyn/papers/HDP-book/HDP-book.html</a>.
</div>
</div>
</section>
]]></summary>
</entry>
<entry>
    <title>Reading notes: Kolmogorov-Arnold Networks</title>
    <link href="https://www.lozeve.com/posts/kolmogorov-arnold-networks.html" />
    <id>https://www.lozeve.com/posts/kolmogorov-arnold-networks.html</id>
    <published>2024-06-08T00:00:00Z</published>
    <updated>2024-06-08T00:00:00Z</updated>
    <summary type="html"><![CDATA[<section>
  <p>This paper <span class="citation" data-cites="liu2024_kan">(Liu et al. 2024)</span> proposes an alternative to multi-layer
perceptrons (MLPs) in machine learning.</p>
<p>The basic idea is that MLPs have parameters on the nodes of the
computation graph (the weights and biases on each cell), and that KANs
have the parameters on the edges. Each edge has a learnable activation
function parameterized as a spline.</p>
<p>The network is learned at two levels, which allows for “adjusting
locally”:</p>
<ul>
<li>the overall shape of the computation graph and its connexions
(external degrees of freedom, to learn the compositional structure),</li>
<li>the parameters of each activation function (internal degrees of
freedom).</li>
</ul>
<p>It is based on the <a href="https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Arnold_representation_theorem">Kolmogorov-Arnold representation theorem</a>, which
says that any continuous multivariate function can be represented as a
sum of continuous univariate functions. We recover the distinction
between the compositional structure of the sum and the structure of
each internal univariate function.</p>
<p>The theorem can be interpreted as two layers, and the paper then
generalizes it to multiple layer of arbitrary width. In the theorem,
the univariate functions are arbitrary and can be complex (even
fractal), so the hope is that allowing for arbitrary depth and width
will allow to only use splines. They derive an approximation theorem:
when replacing the arbitrary continuous functions of the
Kolmogorov-Arnold representation with splines, we can bound the error
independently of the dimension. (However there is a constant which
depends on the function and its representation, and therefore on the
dimension…) Theoretical scaling laws in the number of parameters are
much better than for MLPs, and moreover, experiments show that KANs
are much closer to their theoretical bounds than MLPs.</p>
<p>KANs have interesting properties:</p>
<ul>
<li>The splines are interpolated on grid points which can be iteratively
refined. The fact that there is a notion of “fine-grainedness” is
very interesting, it allows to add parameters without having to
retrain everything.</li>
<li>Larger is not always better: the quality of the reconstruction
depends on finding the optimal shape of the network, which should
match the structure of the function we want to approximate. Finding
this optimal shape is found via sparsification, pruning, and
regularization (non-trivial).</li>
<li>We can have a “human in the loop” during training, guiding pruning,
and “symbolifying” some activations (i.e. by recognizing that an
activation function is actually a cos function, replace it
directly). This symbolic discovery can be guided by a symbolic
system recognizing some functions. It’s therefore a mix of symbolic
regression and numerical regression.</li>
</ul>
<p>They test mostly with scientific applications in mind: reconstructing
equations from physics and pure maths. Conceptually, it has a lot of
overlap with Neural Differential Equations
<span class="citation" data-cites="chenNeuralOrdinaryDifferential2018 ruthotto2024_differ_equat">(Chen et al. 2018; Ruthotto 2024)</span>
and “scientific ML” in general.</p>
<p>There is an interesting discussion at the end about KANs as the model
of choice for the “language of science”. The idea is that LLMs are
important because they are useful for natural language, and KANs
could fill the same role for the language of functions. The
interpretability and adaptability (being able to be manipulated and
guided during training by a domain expert) is thus a core feature that
traditional deep learning models lack.</p>
<p>There are still challenges, mostly it’s unclear how it performs on
other types of data and other modalities, but it is very
encouraging. There is also a computational challenges, they are
obviously much slower to train, but there has been almost no
engineering work on them to optimize this, so it’s expected. The fact
that the operations are not easily batchable (compared to matrix
multiplication) is however worrying for scalability to large networks.</p>
<h2 class="unnumbered" id="references">References</h2>
<div id="refs" class="references csl-bib-body hanging-indent" data-entry-spacing="0" role="list">
<div id="ref-chenNeuralOrdinaryDifferential2018" class="csl-entry" role="listitem">
Chen, Ricky T. Q., Yulia Rubanova, Jesse Bettencourt, and David Duvenaud. 2018. <span>“Neural <span>Ordinary Differential Equations</span>,”</span> June. <a href="http://arxiv.org/abs/1806.07366">http://arxiv.org/abs/1806.07366</a>.
</div>
<div id="ref-liu2024_kan" class="csl-entry" role="listitem">
Liu, Ziming, Yixuan Wang, Sachin Vaidya, Fabian Ruehle, James Halverson, Marin Soljačić, Thomas Y. Hou, and Max Tegmark. 2024. <span>“<span>KAN</span>: <span>Kolmogorov</span>-<span>Arnold</span> <span>Networks</span>.”</span> arXiv. <a href="https://doi.org/10.48550/arXiv.2404.19756">https://doi.org/10.48550/arXiv.2404.19756</a>.
</div>
<div id="ref-ruthotto2024_differ_equat" class="csl-entry" role="listitem">
Ruthotto, Lars. 2024. <span>“Differential <span>Equations</span> for <span>Continuous</span>-<span>Time</span> <span>Deep</span> <span>Learning</span>.”</span> <em>Notices of the American Mathematical Society</em> 71 (05). <a href="https://doi.org/10.1090/noti2930">https://doi.org/10.1090/noti2930</a>.
</div>
</div>
</section>
]]></summary>
</entry>
<entry>
    <title>Randomness and Uncertainty: from random noise to predictable oscillations via differential equations</title>
    <link href="https://www.lozeve.com/posts/randomness-and-uncertainty.html" />
    <id>https://www.lozeve.com/posts/randomness-and-uncertainty.html</id>
    <published>2023-10-20T00:00:00Z</published>
    <updated>2023-10-20T00:00:00Z</updated>
    <summary type="html"><![CDATA[<section>
  <p>This <a href="https://www.lms.ac.uk/sites/default/files/inline-files/NLMS_505_for%20web.pdf#page=24">article (PDF)</a> by Nick Trefethen in the <a href="https://www.lms.ac.uk/publications/lms-newsletter"><em>London Mathematical Society Newsletter</em></a> demonstrates an interesting relationship between
what we perceive as “randomness” and what we perceive as “certainty”.</p>
<p>There are many ways to generate pseudo-random numbers that look
perfectly “random” but are actually the output of fully deterministic
processes. Trefethen gives an example of a chaotic system based on a
logistic equation.</p>
<p>But more interesting (to me) and more original may be the other way
around: how to get certainty from randomness. There are ordinary
differential equations that can take random noise as input, and whose
solution is very stable, oscillating between two possible
values. Given a function <span class="math inline">\(f\)</span> approximating random white noise, the
solution to the ODE
<span class="math display">\[ y' = y - y^3 + Cf(t) \]</span>
is “bistable” and remains always around -1 and 1. The parameter <span class="math inline">\(C\)</span>
allows to control the half-life of the transitions.</p>
<p>To explore this behaviour, I replicated Trefethen’s experiments in
Python with the <a href="https://docs.kidger.site/diffrax/">Diffrax</a> library (a differential equations solver based
on <a href="https://jax.readthedocs.io/">JAX</a>). The full code is <a href="https://gist.github.com/dlozeve/4924e71097e1d86933e8d5528cd2f6b4">in this Gist</a>.</p>
<p>It suffices to define a simple function for the vector field and to
give it to a solver:</p>
<div class="sourceCode" id="cb1"><pre class="sourceCode python"><code class="sourceCode python"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> f(t, y, args):</span>
<span id="cb1-2"><a href="#cb1-2" aria-hidden="true" tabindex="-1"></a>    <span class="cf">return</span> y <span class="op">-</span> y<span class="op">**</span><span class="dv">3</span> <span class="op">+</span> <span class="fl">0.4</span> <span class="op">*</span> args[t.astype(<span class="bu">int</span>)]</span></code></pre></div>
<p>where <code>args</code> will be the input, as a simple array of
normally-distributed random values. <span class="math inline">\(C\)</span> is hardcoded as 0.4 as in the
article, but could be passed through <code>args</code> as well (it can be a
dictionary).</p>
<p><img src="../images/randomness_uncertainty.png" /></p>
</section>
]]></summary>
</entry>
<entry>
    <title>High reliability organizations</title>
    <link href="https://www.lozeve.com/posts/high-reliability-organizations.html" />
    <id>https://www.lozeve.com/posts/high-reliability-organizations.html</id>
    <published>2022-06-03T00:00:00Z</published>
    <updated>2022-06-03T00:00:00Z</updated>
    <summary type="html"><![CDATA[<section>
  <p><span class="citation" data-cites="dietterich2018_robus_artif_intel_robus_human_organ">Dietterich (2018)</span> is an
interesting article about how to make <em>robust</em> AI. High risk
situations require the combined AI and human system to operate as a
high reliability organization (HRO). Only such an organization can
have sufficiently strong safety and reliability properties to ensure
that powerful AI systems will not amplify human mistakes.</p>
<h2 id="reliability-and-high-reliability-organizations">Reliability and high-reliability organizations</h2>
<p>The concept of high reliability organization (HRO) comes from
<span class="citation" data-cites="weick1999_organ">Weick, Sutcliffe, and Obstfeld (1999)</span>. Examples of HROs include nuclear power
plants, aircraft carriers, air traffic control systems, and space
shuttles. They share several characteristics: an unforgiving
environment, vast potential for error, and dramatic scales in the case
of a failure.</p>
<p>The paper identifies five processes common to HROs, that they group
into the concept of <em>mindfulness</em> (a kind of “enriched
awareness”). Mindfulness is about allocating and conserving attention
of the group. It includes both being consciously aware of the
situation and <em>acting</em> on this understanding.</p>
<p>This mindfulness leads to the capacity to discover and manage
unexpected events, which in turn leads to reliability.</p>
<h2 id="characteristics-of-a-high-reliability-organization">Characteristics of a high reliability organization</h2>
<p>An HRO is an organization with the following five attributes.</p>
<h3 id="preoccupation-with-failure">Preoccupation with failure</h3>
<p>Failures in HROs are extremely rare. To make it easier to learn from
them, the organization has to broaden the data set by expanding the
definition of failure and studying all types of anomalies and near
misses. Additionally, the analysis is much richer, and always
considers the reliability of the entire system, even for localized
failures.</p>
<p>HROS also study the <em>absence</em> of failure: why it didn’t fail, and the
possibility that no flaws were identified because there wasn’t enough
attention to potential flaws.</p>
<p>To further increase the number of data point to study, HROs encourage
reporting all mistakes and anomalies by anyone. Contrary to most
organizations, members are rewarded for reporting potential failures,
even if their analysis is wrong or if they are responsible for
them. This creates an atmosphere of “psychological safety” essential
for transparency and honesty in anomaly reporting.</p>
<h3 id="reluctance-to-simplify-interpretations">Reluctance to simplify interpretations</h3>
<p>HROs avoid having a single interpretation for a given event. They
encourage generating multiple, complex, contradicting interpretations
for every phenomenon. These varied interpretations enlarge the number
of concurrent precautions. Redundancy is implemented not only via
duplication, but via skepticism of existing systems.</p>
<p>People are encouraged to have different views, different backgrounds,
and are re-trained often. To resolve the contradictions and the
oppositions of views, interpersonal and human skills are highly
valued, possibly more than technical skills.</p>
<h3 id="sensitivity-to-operations">Sensitivity to operations</h3>
<p>HROs rely a lot on “situational awareness”. They are ensuring that no
<a href="https://en.wikipedia.org/wiki/Emergence">emergent phenomena</a> emerge in the system: all outputs should always be
explained by the known inputs. Otherwise, there might be other forces
at work that need to be identified and dealt with. A small group of
people may be dedicated to this awareness at all times.</p>
<h3 id="commitments-to-resilience">Commitments to resilience</h3>
<p>HROs train people to be experts at combining all processes and events
to improve their reactions and their improvisation skills. Everyone
should be an expert at anticipating potential adverse events, and
managing surprise. When events get outside normal operational
boundaries, organizations members self-organize into small dedicated
teams to improvise solutions to novel problems.</p>
<h3 id="underspecification-of-structures">Underspecification of structures</h3>
<p>There is no fixed reporting path, anyone can raise an alarm and halt
operations. Everyone can take decisions related to their technical
expertise. Information is spread directly through the organization, so
that people with the right expertise are warned first. Power is
delegated to operation personal, but management is completely
available at all times.</p>
<h2 id="hros-vs-non-hros">HROs vs non-HROs</h2>
<p>Non-HROs increasingly exhibit some properties of HROs. This may be due
to the fact that highly competitive environments with short cycles
create unforgiving conditions (high performance standards, low
tolerance for errors). However, most everyday organizations do not put
failure at the heart of their thinking.</p>
<p>Failures in non-HROs come from the same sources: cultural assumptions
on the effectiveness or accuracy of previous precautions measures.</p>
<p>Preoccupation with failure also reveal the couplings and the complex
interactions in the manipulated systems. This in turn leads to
uncoupling and less emergent behaviour over time. People understand
better long-term, complex interactions.</p>
<h2 id="reliability-vs-performance-and-the-importance-of-learning">Reliability vs performance, and the importance of learning</h2>
<p>An interesting discussion is around the (alleged) trade-off between
reliability and performance. It is assumed that HROs put the focus on
reliability at the cost of throughput. As a consequence, it may not
make sense for ordinary organizations to put as much emphasis on
safety and reliability, as the cost to the business may be
prohibitive.</p>
<p>However, investments in safety can also be viewed as investments in
<em>learning</em>. HROs view safety and reliability as a process of search
and learning (constant search for anomalies, learning the interactions
between the parts of a complex system, ensuring we can link outputs to
known inputs). As such, investments in safety encourage collective
knowledge production and dissemination.</p>
<p>Mindfulness also stimulates intrinsic motivation and perceptions of
efficacy and control, which increase individual performance. (People
who strongly believe they are in control of their own output are more
motivated and more efficient.)</p>
<p>HROs may encourage mindfulness based on operational necessity in front
of the catastrophic consequences of any failure, but non-HROs can
adopt the same practice to boost efficiency and learning to gain
competitive advantage.</p>
<p>Additional lessons that can be learned from HROs (implicit in the
previous discussion):</p>
<ol type="1">
<li>The expectation of surprise is an organizational resource because
it promotes real-time attentiveness and discovery.</li>
<li>Anomalous events should be treated as outcomes rather than
accidents, to encourage search for sources and causes.</li>
<li>Errors should be made as conspicuous as possible to undermine
self-deception and concealment.</li>
<li>Reliability requires diversity, duplication, overlap, and a varied
response repertoire, whereas efficiency requires homogeneity,
specialization, non-redundancy, and standardization.</li>
<li>Interpersonal skills are just as important in HROs as are technical
skills.</li>
</ol>
<h2 class="unnumbered" id="references">References</h2>
<div id="refs" class="references csl-bib-body hanging-indent" data-entry-spacing="0" role="list">
<div id="ref-dietterich2018_robus_artif_intel_robus_human_organ" class="csl-entry" role="listitem">
Dietterich, Thomas G. 2018. <span>“Robust Artificial Intelligence and Robust Human Organizations.”</span> <em>CoRR</em>. <a href="http://arxiv.org/abs/1811.10840">http://arxiv.org/abs/1811.10840</a>.
</div>
<div id="ref-weick1999_organ" class="csl-entry" role="listitem">
Weick, Karl E., Kathleen M. Sutcliffe, and David Obstfeld. 1999. <span>“Organizing for High Reliability: Processes of Collective Mindfulness.”</span> In <em>Research in Organizational Behavior</em>, edited by R. S. Sutton and B. M. Staw, 21:81–123. Research in Organizational Behavior, Vol. 21. Stanford: Elsevier Science/JAI Press. <a href="https://archive.org/details/organizing-for-high-reliability">https://archive.org/details/organizing-for-high-reliability</a>.
</div>
</div>
</section>
]]></summary>
</entry>
<entry>
    <title>How to train your differentiable filter</title>
    <link href="https://www.lozeve.com/posts/how-to-train-your-differentiable-filter.html" />
    <id>https://www.lozeve.com/posts/how-to-train-your-differentiable-filter.html</id>
    <published>2022-05-20T00:00:00Z</published>
    <updated>2022-05-20T00:00:00Z</updated>
    <summary type="html"><![CDATA[<section>
  <p>This is a short overview of the following paper <span class="citation" data-cites="kloss2021_how">(Kloss, Martius, and Bohg 2021)</span>:</p>
<blockquote>
<p>Kloss, Alina, Georg Martius, and Jeannette Bohg. 2021. “How to Train
Your Differentiable Filter.” <em>Autonomous Robots</em> 45 (4):
561–78. <a href="https://doi.org/10.1007/s10514-021-09990-9">https://doi.org/10.1007/s10514-021-09990-9</a>.</p>
</blockquote>
<h2 id="bayesian-filtering-for-state-estimation">Bayesian filtering for state estimation</h2>
<p>Bayesian filters<span class="sidenote-wrapper"><label for="sn-0" class="margin-toggle">⊕</label><input type="checkbox" id="sn-0" class="margin-toggle" /><span class="marginnote"><span class="citation" data-cites="thrun2006_probab_robot">(Thrun 2006)</span> contains a
great explanation of Bayesian filters (including Kalman and particle
filters), in the context of robotics, which is relevant for this
paper. For a more complete overview of Kalman filters, see
<span class="citation" data-cites="anderson2005_optim_filter">(Anderson and Moore 2005)</span>.<br />
<br />
</span></span> are the standard method for
probabilistic state estimation. Common examples are (extended,
unscented) <a href="https://en.wikipedia.org/wiki/Kalman_filter">Kalman filters</a> and <a href="https://en.wikipedia.org/wiki/Particle_filter">particle filters</a>. These filters require
a <em>process model</em> predicting how the state evolves over time, and an
<em>observation model</em> relating an sensor value to the underlying state.</p>
<p>The objective of a filter for state estimation is to estimate a latent
state <span class="math inline">\(\mathbf{x}\)</span> of a dynamical system at any time step <span class="math inline">\(t\)</span> given an
initial belief <span class="math inline">\(\mathrm{bel}(\mathbf{x}_0) = p(\mathbf{x}_0)\)</span>, a
sequence of observations <span class="math inline">\(\mathbf{z}_{1\ldots t}\)</span>, and controls
<span class="math inline">\(\mathbf{u}_{0\ldots t}\)</span>.</p>
<p>We make the Markov assumption (i.e. states and observations are
conditionally independent from the history of past states).</p>
<p><span class="math display">\[
\begin{align*}
\mathrm{bel}(\mathbf{x}_t) &amp;= \eta p(\mathbf{z}_t | \mathbf{x}_t) \int p(\mathbf{x}_t | \mathbf{x}_{t-1}, \mathbf{u}_{t-1}) \mathrm{bel}(\mathbf{x}_{t-1}) d\mathbf{x}_{t-1}\\
&amp;= \eta p(\mathbf{z}_t | \mathbf{x}_t) \overline{\mathrm{bel}}(\mathbf{x}_t),
\end{align*}
\]</span></p>
<p>where <span class="math inline">\(\eta\)</span> is a normalization factor. Computing
<span class="math inline">\(\overline{\mathrm{bel}}(\mathbf{x}_t)\)</span> is the <em>prediction step</em>, and
applying <span class="math inline">\(p(\mathbf{z}_t | \mathbf{x}_t)\)</span> is the <em>update step</em> (or the
<em>observation step</em>).</p>
<p>We model the dynamics of the system through a process model <span class="math inline">\(f\)</span> and an
observation model <span class="math inline">\(h\)</span>:</p>
<p><span class="math display">\[
\begin{align*}
\mathbf{x}_t &amp;= f(\mathbf{x}_{t-1}, \mathbf{u}_{t-1}, \mathbf{q}_{t-1})\\
\mathbf{z}_t &amp;= h(\mathbf{x}_t, \mathbf{r}_t),
\end{align*}
\]</span>
where <span class="math inline">\(\mathbf{q}\)</span> and <span class="math inline">\(\mathbf{r}\)</span> are random variables representing
process and observation noise, respectively.</p>
<h2 id="differentiable-bayesian-filters">Differentiable Bayesian filters</h2>
<p>These models are often difficult to formulate and specify, especially
when the application has complex dynamics, with complicated noises,
nonlinearities, high-dimensional state or observations, etc.</p>
<p>To improve this situation, the key idea is to <em>learn</em> these complex
dynamics and noise models from data. Instead of spending hours in
front of a blackboard deriving the equations, we could give a simple
model a lot of data and learn the equations from them!</p>
<p>In the case of Bayesian filters, we have to define the process,
observation, and noise processes as parameterized functions
(e.g. neural networks), and learn their parameters end-to-end, through
the entire apparatus of the filter. To learn these parameters, we will
use the simplest method: gradient descent. Our filter have to become
<em>differentiable</em>.</p>
<p>The paper shows that such <em>differentiable filters</em> (trained
end-to-end) outperform unstructured <a href="https://en.wikipedia.org/wiki/Long_short-term_memory">LSTMs</a>, and outperform standard
filters where the process and observation models are fixed in advance
(i.e. analytically derived or even trained separately in isolation).</p>
<p>In most applications, the process and observation noises are often
assumed to be uncorrelated Gaussians, with zero mean and constant
covariance (which is a hyperparameter of the filter). With end-to-end
training, we can learn these parameters (mean and covariance of the
noise), but we can even go further, and use <a href="https://en.wikipedia.org/wiki/Heteroscedasticity">heteroscedastic</a> noise
models. In this model, the noise can depend on the state of the system
and the applied control.</p>
<h2 id="learnable-process-and-observation-models">Learnable process and observation models</h2>
<p>The observation model <span class="math inline">\(f\)</span> can be implemented as a simple feed-forward
neural network. Importantly, this NN is trained to output the
<em>difference</em> between the next and the current state (<span class="math inline">\(\mathbf{x}_{t+1} - \mathbf{x}_t\)</span>).
This ensure stable gradients and an easier initialization near the
identity.</p>
<p>For the observation model, we could do the same and model <span class="math inline">\(g\)</span> as a
generative neural network predicting the output of the
sensors. However, the observation space is often high-dimensional, and
the network is thus difficult to train. Consequently, the authors use
a <em>discriminative</em> neural network to reduce the dimensionality of the
raw sensory output.</p>
<h2 id="learnable-noise-models">Learnable noise models</h2>
<p>In the Gaussian case, we use neural networks to predict the covariance
matrix of the noise processes. To ensure positive-definiteness, the
network predicts an upper-triangular matrix <span class="math inline">\(\mathbf{L}_t\)</span> and the
noise covariance matrix is set to <span class="math inline">\(\mathbf{Q}_t = \mathbf{L}_t \mathbf{L}_t^T\)</span>.</p>
<p>In the heteroscedastic case, the noise covariance is predicted from
the state and the control input.</p>
<h2 id="loss-function">Loss function</h2>
<p>We assume that we have access to the ground-truth trajectory <span class="math inline">\(\mathbf{x}_{1\ldots T}\)</span>.</p>
<p>We can then use the mean squared error (MSE) between the ground truth
and the mean of the belief:</p>
<p><span class="math display">\[ L_\mathrm{MSE} = \frac{1}{T} \sum_{t=0}^T (\mathbf{x}_t - \mathbf{\mu}_t)^T (\mathbf{x}_t - \mathbf{\mu}_t). \]</span></p>
<p>Alternatively, we can compute the negative log-likelihood of the true
state under the belief distribution (represented by a Gaussian of mean
<span class="math inline">\(\mathbf{\mu}_t\)</span> and covariance <span class="math inline">\(\mathbf{\Sigma}_t\)</span>):</p>
<p><span class="math display">\[ L_\mathrm{NLL} = \frac{1}{2T} \sum_{t=0}^T \log(|\mathbf{\Sigma}_t|) + (\mathbf{x}_t - \mathbf{\mu}_t)^T \mathbf{\Sigma}^{-1} (\mathbf{x}_t - \mathbf{\mu}_t). \]</span></p>
<h2 id="implementation-issues">Implementation issues</h2>
<p>We need to implement the filters (<a href="https://en.wikipedia.org/wiki/Extended_Kalman_filter">EKF</a>, <a href="https://en.wikipedia.org/wiki/Kalman_filter#Unscented_Kalman_filter">UKF</a>, <a href="https://en.wikipedia.org/wiki/Particle_filter">PF</a>) in a <a href="https://en.wikipedia.org/wiki/Differentiable_programming">differentiable programming</a> framework. The authors use <a href="https://en.wikipedia.org/wiki/Differentiable_programming">TensorFlow</a>. Their code is
available <a href="https://github.com/akloss/differentiable_filters">on GitHub</a>.</p>
<p>Some are easy because they use only differentiable operations (mostly
simple linear algebra). For the EKF, we also need to compute
Jacobians. This can be done automatically via automatic
differentiation, but the authors have encountered technical
difficulties with this (memory consumption or slow computations), so
they recommend computing Jacobians manually.<span class="sidenote-wrapper"><label for="sn-1" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sn-1" class="margin-toggle" /><span class="sidenote">It is not clear
whether this is a limitation of automatic differentiation, or of their
specific implementation with TensorFlow. Some other projects have
successfully computed Jacobians for EKFs with autodiff libraries, like
<a href="https://github.com/sisl/GaussianFilters.jl">GaussianFilters.jl</a> in Julia.<br />
<br />
</span></span></p>
<p>The particle filter has a resampling step that is not differentiable:
the gradient cannot be propagated to particles that are not selected
by the sampling step. There are apparently specific resampling
algorithms that help mitigate this issue in practice when training.</p>
<h2 id="conclusions">Conclusions</h2>
<p>Differentiable filters achieve better results with fewer parameters
than unstructured models like LSTMs, especially on complex tasks. The
paper runs extensive experiments on various toy models of various
complexity, although unfortunately no real-world application is shown.</p>
<p>Noise models with full covariance improve the tracking
accuracy. Heteroscedastic noise models improve it even more.</p>
<p>The main issue is to keep the training stable. They recommend the
differentiable extended Kalman filter for getting started, as it is
the most simple filter, and is less sensitive to hyperparameter
choices. If the task is strongly non-linear, one should use a
differentiable unscented Kalman filter or a differentiable particle
filter.</p>
<h2 class="unnumbered" id="references">References</h2>
<div id="refs" class="references csl-bib-body hanging-indent" data-entry-spacing="0" role="list">
<div id="ref-anderson2005_optim_filter" class="csl-entry" role="listitem">
Anderson, Brian D. O., and John B. Moore. 2005. <em>Optimal Filtering</em>. Dover Books on Electrical Engineering. Dover Publications.
</div>
<div id="ref-kloss2021_how" class="csl-entry" role="listitem">
Kloss, Alina, Georg Martius, and Jeannette Bohg. 2021. <span>“How to Train Your Differentiable Filter.”</span> <em>Autonomous Robots</em> 45 (4): 561–78. <a href="https://doi.org/10.1007/s10514-021-09990-9">https://doi.org/10.1007/s10514-021-09990-9</a>.
</div>
<div id="ref-thrun2006_probab_robot" class="csl-entry" role="listitem">
Thrun, Sebastian. 2006. <em>Probabilistic Robotics</em>. Cambridge, Massachusetts: The MIT Press. <a href="https://mitpress.mit.edu/books/probabilistic-robotics">https://mitpress.mit.edu/books/probabilistic-robotics</a>.
</div>
</div>
</section>
]]></summary>
</entry>
<entry>
    <title>The Dawning of the Age of Stochasticity</title>
    <link href="https://www.lozeve.com/posts/dawning-of-the-age-of-stochasticity.html" />
    <id>https://www.lozeve.com/posts/dawning-of-the-age-of-stochasticity.html</id>
    <published>2022-03-24T00:00:00Z</published>
    <updated>2022-03-24T00:00:00Z</updated>
    <summary type="html"><![CDATA[<section>
  <blockquote>
<p>Mumford, David. 2000. “The Dawning of the Age of Stochasticity.”
<em>Atti Della Accademia Nazionale Dei Lincei. Classe Di Scienze Fisiche,
Matematiche E Naturali. Rendiconti Lincei. Matematica E Applicazioni</em>
11: 107–25. <a href="http://eudml.org/doc/289648">http://eudml.org/doc/289648</a>.</p>
</blockquote>
<p>This article <span class="citation" data-cites="mumford2000_dawnin_age_stoch">(Mumford 2000)</span> is an interesting call
for a new set of foundations of mathematics on probability and
statistics. It argues that logic has had its time, and now we should
make random variables a first-class concept, as they would make for
better foundations.</p>
<h2 id="the-taxonomy-of-mathematics">The taxonomy of mathematics</h2>
<p><span class="sidenote-wrapper"><label for="sn-0" class="margin-toggle">⊕</label><input type="checkbox" id="sn-0" class="margin-toggle" /><span class="marginnote">This is probably the best definition of mathematics I have
seen. Before that, the most satisfying definition was “mathematics is
what mathematicians do”. It also raises an interesting question: what
would the study of <em>non-reproducible</em> mental objects be?<br />
<br />
</span></span></p>
<blockquote>
<p>The study of mental objects with reproducible properties is called mathematics.
<span class="citation" data-cites="davis2012_mathem_exper_study_edition">(Davis, Hersh, and Marchisotto 2012)</span></p>
</blockquote>
<p>What are the categories of reproducible mental objects? Mumford
considers the principal sub-fields of mathematics (geometry, analysis,
algebra, logic) and argues that they are indeed rooted in common
mental phenomena.</p>
<p>Of these, logic, and the notion of proposition, with an absolute truth
value attached to it, was made the foundation of all the
others. Mumford’s argument is that instead, the random variable is (or
should be) the “paradigmatic mental object”, on which all others can
be based. People are constantly weighing likelihoods, evaluating
plausibility, and sampling from posterior distributions to refine
estimates.</p>
<p>As such, random variables are rooted in our inspection of our own
mental processes, in the self-conscious analysis of our minds. Compare
to areas of mathematics arising from our experience with the physical
world, through our perception of space (geometry), of forces and
accelerations (analysis), or through composition of actions (algebra).</p>
<p>The paper then proceeds to do a quick historical overview of the
principal notions of probability, which mostly mirror the detailed
historical perspective in <span class="citation" data-cites="hacking2006_emerg_probab">(Hacking 2006)</span>. There is
also a short summary of the work into the foundations of mathematics.</p>
<p>Mumford also claims that although there were many advances in the
foundation of probability (e.g. Galton, Gibbs for statistical physics,
Keynes in economics, Wiener for control theory, Shannon for
information theory), most important statisticians (R. A. Fisher)
insisted on keeping the scope of statistics fairly limited to
empirical data: the so-called “frequentist” school. (This is a vision
of the whole frequentist vs Bayesian debate that I hadn’t seen
before. The Bayesian school can be seen as the one who claims that
statistical inference can be applied more widely, even to real-life
complex situations and thought processes. In this point of view, the
emergence of the probabilistic method in various areas of science
would be the strongest argument in favour of bayesianism.)</p>
<h2 id="what-is-a-random-variable">What is a “random variable”?</h2>
<p>Random variables are difficult to define. They are the core concept of
any course in probability of statistics, but their full, rigorous
definition relies on advanced measure theory, often unapproachable to
beginners. Nevertheless, practitioners tend to be productive with
basic introductions to probability and statistics, even without
being able to formulate the explicit definition.</p>
<p>Here, Mumford discusses the various definitions we can apply to the
notion of random variable, from an intuitive and a formal point of
view. The conclusion is essentially that a random variable is a
complex entity that do not easily accept a satisfying definition,
except from a purely formal and axiomatic point of view.</p>
<p>This situation is very similar to the one for the notion of
“set”. Everybody can manipulate them on an intuitive level and grasp
the basic properties, but the specific axioms are hard to grasp, and
no definition is fully satisfying, as the debates on the foundations
of mathematics can attest.</p>
<h2 id="putting-random-variables-into-the-foundations">Putting random variables into the foundations</h2>
<p>The usual way of defining random variables is:</p>
<ol type="1">
<li>predicate logic,</li>
<li>sets,</li>
<li>natural numbers,</li>
<li>real numbers,</li>
<li>measures,</li>
<li>random variables.</li>
</ol>
<p>Instead, we could put random variables at the foundations, and define
everything else in terms of that.</p>
<p>There is no complete formulation of such a foundation, nor is it clear
that it is possible. However, to make his case, Mumford presents two
developments. One is from <a href="https://en.wikipedia.org/wiki/Edwin_Thompson_Jaynes">E. T. Jaynes</a>, who has a complete formalism
of Bayesian probability from a notion of “plausibility”. With a few
axioms, we can obtain an isomorphism between an intuitive notion of
plausibility and a true probability function.</p>
<p>The other example is a proof that the continuum hypothesis is false,
using a probabilistic argument, due to Christopher Freiling. This
proof starts from a notion of random variable that is incompatible
with the usual definition in terms of measure theory. However, this
leads Mumford to question whether a foundation of mathematics based on
such a notion could get us rid of “one of the meaningless conundrums
of set theory”.</p>
<h2 id="stochastic-methods-have-invaded-classical-mathematics">Stochastic methods have invaded classical mathematics</h2>
<p>This is probably the most convincing argument to give a greater
importance to probability and statistical methods in the foundations
of mathematics: there tend to be everywhere, and extremely
productive. A prime example is obviously graph theory, where the
“probabilistic method” has had a deep impact, thanks notably to
Erdős. (See <span class="citation" data-cites="alon2016_probab_method">(Alon and Spencer 2016)</span> and <a href="https://www.college-de-france.fr/site/timothy-gowers/index.htm">Timothy Gowers’ lessons at the Collège de France</a><span class="sidenote-wrapper"><label for="sn-1" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sn-1" class="margin-toggle" /><span class="sidenote">In French, but see also <a href="https://www.youtube.com/c/TimothyGowers0">his YouTube channel</a>.<br />
<br />
</span></span> on the probabilistic method for combinatorics and number
theory.) Probabilistic methods also have a huge importance in the
analysis of differential equations, chaos theory, and mathematical
physics in general.</p>
<h2 id="thinking-as-bayesian-inference">Thinking as Bayesian inference</h2>
<p>I think this is not very controversial in cognitive science: we do not
think by composing propositions into syllogisms, but rather by
inferring probabilities of certain statements being true. Mumford
illustrates this very well with an example from Judea Pearl, which
uses graphical models to represent thought processes. There is also a
link with formal definitions of induction, such as PAC learning, which
is very present in machine learning.</p>
<p>I’ll conclude this post by quoting directly the last paragraph of the
article:</p>
<blockquote>
<p>My overall conclusion is that I believe stochastic methods will
transform pure and applied mathematics in the beginning of the third
millennium. Probability and statistics will come to be viewed as the
natural tools to use in mathematical as well as scientific modeling.
The intellectual world as a whole will come to view logic as a
beautiful elegant idealization but to view statistics as the standard
way in which we reason and think.</p>
</blockquote>
<h2 class="unnumbered" id="references">References</h2>
<div id="refs" class="references csl-bib-body hanging-indent" data-entry-spacing="0" role="list">
<div id="ref-alon2016_probab_method" class="csl-entry" role="listitem">
Alon, Noga, and Joel H. Spencer. 2016. <em>The Probabilistic Method</em>. 4th ed. Wiley.
</div>
<div id="ref-davis2012_mathem_exper_study_edition" class="csl-entry" role="listitem">
Davis, Philip J., Reuben Hersh, and Elena Anne Marchisotto. 2012. <em>The Mathematical Experience, Study Edition</em>. Modern Birkh<span>ä</span>user Classics. Birkh<span>ä</span>user Boston. <a href="https://doi.org/10.1007/978-0-8176-8295-8">https://doi.org/10.1007/978-0-8176-8295-8</a>.
</div>
<div id="ref-hacking2006_emerg_probab" class="csl-entry" role="listitem">
Hacking, Ian. 2006. <em>The Emergence of Probability: A Philosophical Study of Early Ideas about Probability, Induction and Statistical Inference</em>. 2nd ed. Cambridge University Press. <a href="https://doi.org/10.1017/CBO9780511817557">https://doi.org/10.1017/CBO9780511817557</a>.
</div>
<div id="ref-mumford2000_dawnin_age_stoch" class="csl-entry" role="listitem">
Mumford, David. 2000. <span>“The Dawning of the Age of Stochasticity.”</span> <em>Atti Della Accademia Nazionale Dei Lincei. Classe Di Scienze Fisiche, Matematiche e Naturali. Rendiconti Lincei. Matematica e Applicazioni</em> 11 (December): 107–25. <a href="http://eudml.org/doc/289648">http://eudml.org/doc/289648</a>.
</div>
</div>
</section>
]]></summary>
</entry>
<entry>
    <title>Planning and scheduling for project management</title>
    <link href="https://www.lozeve.com/posts/planning-and-scheduling.html" />
    <id>https://www.lozeve.com/posts/planning-and-scheduling.html</id>
    <published>2021-04-13T00:00:00Z</published>
    <updated>2021-04-13T00:00:00Z</updated>
    <summary type="html"><![CDATA[<section>
  <div id="toc"><h2>Table of Contents</h2><ul>
<li><a href="#general-project-management-workflow" id="toc-general-project-management-workflow">General project management workflow</a></li>
<li><a href="#the-project-scheduling-problem" id="toc-the-project-scheduling-problem">The project scheduling problem</a></li>
<li><a href="#classification-of-scheduling-problems" id="toc-classification-of-scheduling-problems">Classification of scheduling problems</a></li>
<li><a href="#algorithms-for-project-scheduling" id="toc-algorithms-for-project-scheduling">Algorithms for project scheduling</a>
<ul>
<li><a href="#without-workforce-constraints" id="toc-without-workforce-constraints">Without workforce constraints</a></li>
<li><a href="#with-workforce-constraints" id="toc-with-workforce-constraints">With workforce constraints</a></li>
<li><a href="#further-reading" id="toc-further-reading">Further reading</a></li>
</ul></li>
<li><a href="#automating-project-management" id="toc-automating-project-management">Automating project management</a></li>
<li><a href="#references" id="toc-references">References</a></li>
</ul></div>
<p>Every project, no matter its size, requires some kind of organization
and planning. Whether you’re thinking about what you need to do when
you wake up (shower, make breakfast, brush your teeth) or planning a
new space programme, you will need to think about the tasks, in what
order to do them, and how long it will take. This is called
<em>scheduling</em>.</p>
<p>Planning projects requires balancing dependencies between tasks,
resource allocation, and complex constraints in order to find a
complete and feasible schedule. How much of this can be made rigorous?
What is the limit of automation in this scenario?</p>
<p>In this post, I want to explore the problem of planning and scheduling
in the specific context of project management. The goal is to set up
the problem of project planning rigorously, and investigate what
techniques we can apply to have a better understanding of our projects
and reach our objectives faster.</p>
<h2 id="general-project-management-workflow">General project management workflow</h2>
<p>When starting a new project<span class="sidenote-wrapper"><label for="sn-0" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sn-0" class="margin-toggle" /><span class="sidenote">The definition of a project here is highly subjective,
and has been strongly influenced by what I’ve read (see the
references) and how I actually do things at work. In particular, most
of the model and concepts can be found in Microsoft Project.<br />
<br />
</span></span>, I generally follow a rough
workflow that goes like this:</p>
<ol type="1">
<li>Define the global constraints of the project: functional
specification, deadlines, overall resources available, etc.</li>
<li>Subdivide the projects into tasks and subtasks. Low-level tasks
should be self-contained and doable by few people (ideally only
one). Tasks can then be grouped together for better visualising
what is happening at various scales. This gives us a global
hierarchy of tasks, culminating in the overall project.</li>
<li>Specify the dependencies between tasks, ideally with an explicit
deliverable for each dependency relationship.</li>
<li>Estimate the work required for each task.</li>
<li>Affect a resource to each task, deriving task durations
accordingly. For instance, if Bob will be working part-time on this
task (because he has other things to do at the same time), the task
will take longer to complete than the nominal amount of work that
it requires.</li>
<li>Find an order in which to execute all the tasks, respecting
workforce and time constraints (Bob cannot spend 50% of this time
on three tasks simultaneously). This is called a <em>schedule</em>.</li>
<li>Iterate on the order until a minimal completion date is
found. Generally, the objective is to complete the project as soon
as possible, but there may be additional requirements (overall
deadline, lateness penalties, maximal resource utilization).</li>
</ol>
<p>Given this process, a natural question would be to ask: how can we
simplify it? What can we automate? The obvious task is the scheduling
part (steps 6 and 7): this step does not require any human
decision-making, and for which it will be difficult and tiresome to
achieve optimality. Most <a href="https://en.wikipedia.org/wiki/Project_management_software">project management software</a> (e.g. <a href="https://en.wikipedia.org/wiki/Microsoft_Project">Microsoft Project</a>) focus on this part.</p>
<p>However, in practice, resource allocation is also extremely time
consuming. Most importantly, it will constrain the final schedule: a
bad allocation can push back the final completion date by a wide
margin. Therefore, it makes sense to want to take into account both
resource allocation and task ordering at the same time when looking
for an optimal schedule.</p>
<p>Going even further, we could look into subdividing tasks further:
maybe splitting a task in two, allowing a small lag between the
completion of the first half and the start of the second half, could
improve the overall objective. By allowing <a href="https://en.wikipedia.org/wiki/Preemption_(computing)">preemption</a>, we could
optimize further our schedule.</p>
<p>To understand all of this, we’ll need to formalize our problem a
little bit. This will allow us to position it in the overall schema of
problems studied in the operations research literature, and use their
conclusions to choose the best approach as a trade-off between manual
and automatic scheduling.</p>
<h2 id="the-project-scheduling-problem">The project scheduling problem</h2>
<p>A <em>project</em> is simply a set of tasks<span class="sidenote-wrapper"><label for="sn-1" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sn-1" class="margin-toggle" /><span class="sidenote">A task is often called a <em>job</em> or an <em>activity</em> in project
scheduling. I will use these terms interchangeably.<br />
<br />
</span></span>. Each
task is a specific action with a certain amount of work that needs to
be done. More importantly, a task can <em>depend</em> on other tasks: for
instance, I can’t send the satellite in the space if you haven’t built
it yet.</p>
<p>Other <em>constraints</em> may also be present: there are nearly always
deadlines (my satellite needs to be up and running on 2024-05-12), and
sometimes other kind of temporal constraints (for legal reasons, I
can’t start building my death ray before 2023-01-01). Most
importantly, there are constraints on resource usage (I need either
Alice or Bob to work on these tasks, so I will be able to work on at
most two of them at the same time).</p>
<p>Finally, the <em>objective</em> is to finish the project (i.e. complete all
the tasks) as early as possible. This is called the <em>makespan</em>.</p>
<p>You may have noticed a nice pattern here: objective, constraints? We
have a great optimization problem! As it turns out, <a href="https://en.wikipedia.org/wiki/Schedule#In_operations_research">scheduling</a> is an
entire branch of operations research<span class="sidenote-wrapper"><label for="sn-2" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sn-2" class="margin-toggle" /><span class="sidenote">See my previous blog post on <a href="./operations-research-references.html">operations research</a>.<br />
<br />
</span></span>. In the
literature, this kind of problem is referred to as
<em>resource-constrained project scheduling</em>, or as <em>project scheduling
with workforce constraints</em>.</p>
<h2 id="classification-of-scheduling-problems">Classification of scheduling problems</h2>
<p>There is a lot of room to modify the problem to other settings.
<span class="citation" data-cites="brucker1999_resour_const_projec_sched">Brucker et al. (1999)</span> propose an interesting
classification scheme for project scheduling. In this system, any
problem can be represented by a triplet <span class="math inline">\(\alpha|\beta|\gamma\)</span>, where
<span class="math inline">\(\alpha\)</span> is the resource environment, <span class="math inline">\(\beta\)</span> are the activity
characteristics, and <span class="math inline">\(\gamma\)</span> is the objective function.</p>
<p>The <em>resource environment</em> <span class="math inline">\(\alpha\)</span> describes the available quantity
of each type of resources. Resource can be renewable, like people, who
supply a fixed quantity of work in each time period, or non-renewable,
like raw materials.</p>
<p>The <em>activity characteristics</em> <span class="math inline">\(\beta\)</span> describe how tasks are
constrained: how the dependencies are specified (with a graph, or with
temporal constraints between the starts and ends of different tasks),
whether there are global constraints like deadlines, and whether
processing times are constant for all tasks, can vary, or even can be
stochastic.</p>
<p>Finally, the <em>objective</em> <span class="math inline">\(\gamma\)</span> can be one of several
possibilities. The most common are the makespan which seeks to
minimize the total duration of the project, and resource-levelling
which seeks to minimize some measure of variation of resource
utilization.</p>
<p>Some important problems (<span class="math inline">\(\mathrm{PS}\)</span> means “project scheduling”
without any restrictions on resources):</p>
<ul>
<li><span class="math inline">\(\mathrm{PS} \;|\; \mathrm{prec} \;|\; C_{\max}\)</span>: the “simple”
project scheduling setup, which corresponds to the practical
application that interests us here. Although this is the base
problem, it is still quite challenging. Removing the resource
constraints renders the problem much easier from a computational
point of view <span class="citation" data-cites="pinedo2009_plann_sched_manuf_servic">(Pinedo 2009, chap. 4)</span>.</li>
<li><span class="math inline">\(\mathrm{PS} \;|\; \mathrm{temp} \;|\; C_{\max}\)</span>: when you add time
lag constraints (e.g. two tasks that must start within two days of
each other), the problem becomes much more difficult.</li>
<li><span class="math inline">\(\mathrm{PS} \;|\; \mathrm{temp} \;| \sum c_{k} f\left(r_{k}(S,
t)\right)\)</span>: this is the resource-levelling problem: you want to
minimize the costs of using an amount <span class="math inline">\(r_k(S, t)\)</span> of each resource
<span class="math inline">\(k\)</span>, when each unit of resource costs <span class="math inline">\(c_k\)</span>.</li>
</ul>
<h2 id="algorithms-for-project-scheduling">Algorithms for project scheduling</h2>
<h3 id="without-workforce-constraints">Without workforce constraints</h3>
<p>First, we need a way to represent a project. We can use the so-called
<em>job-on-node</em> format<span class="sidenote-wrapper"><label for="sn-3" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sn-3" class="margin-toggle" /><span class="sidenote">There is also a <em>job-on-arc</em> format that is apparently
widely used, but less practical in most applications.<br />
<br />
</span></span>. The nodes represent the tasks in
the precedence graph, and arcs represent the dependency relationships
between tasks.</p>
<p>This representation leads to a natural algorithm for project
scheduling in the absence of any resource constraints. The <a href="https://en.wikipedia.org/wiki/Critical_path_method">critical path method</a> (CPM) consists in finding a chain of dependent tasks in
the job-on-node graph that are <em>critical</em>: their completion time is
fixed by their dependencies.</p>
<p>It consists of two procedures, one to determine the earliest possible
completion time of each task (forward procedure), and one to determine
the latest possible completion time of each task that does not
increase total project duration (backward procedure). The tasks for
which these two times are equal form the <em>critical
path</em><span class="sidenote-wrapper"><label for="sn-4" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sn-4" class="margin-toggle" /><span class="sidenote">Note that the critical path is not necessarily
unique, and several critical paths may be overlapping.<br />
<br />
</span></span>. Non-critical tasks have a certain amount of
<em>slack</em>: it is possible to schedule them freely between the two
extremities without affecting the makespan.</p>
<p>An extension of the critical path method is the <a href="https://en.wikipedia.org/wiki/Program_evaluation_and_review_technique">program evaluation and review technique</a> (PERT). We still consider we have unlimited
resources, but the processing time of each task is allowed to be a
random variable instead of a fixed quantity. The algorithm must be
amended correspondingly to take into account pessimistic and
optimistic estimates of each task duration.</p>
<p>These techniques have been widely employed in various
industries<span class="sidenote-wrapper"><label for="sn-5" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sn-5" class="margin-toggle" /><span class="sidenote">Wikipedia tells us that <a href="https://en.wikipedia.org/wiki/Critical_path_method#History">CPM</a> and <a href="https://en.wikipedia.org/wiki/Program_evaluation_and_review_technique#History">PERT</a> were partly
developed by the US Navy, and applied to several large-scale projects,
like skyscraper buildings, aerospace and military projects, the
Manhattan project, etc.<br />
<br />
</span></span>, and show that the project scheduling
problem without workforce constraints can be solved extremely
efficiently. See <span class="citation" data-cites="pinedo2009_plann_sched_manuf_servic">Pinedo (2009)</span> for more
details on these algorithms and some examples.</p>
<h3 id="with-workforce-constraints">With workforce constraints</h3>
<p>With resource constraints, the problem becomes much harder to
solve. It is not possible to formulate this problem as a linear
program: workforce constraints are intrinsically combinatorial in
nature, so the problem is formulated as an integer
program<span class="sidenote-wrapper"><label for="sn-6" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sn-6" class="margin-toggle" /><span class="sidenote">The full integer program can be found in
<span class="citation" data-cites="pinedo2009_plann_sched_manuf_servic">Pinedo (2009, sec. 4.6)</span>.<br />
<br />
</span></span>.</p>
<p>The problem is modelled with 0-1 variables <span class="math inline">\(x_{jt}\)</span> which take the
value 1 if job <span class="math inline">\(j\)</span> is completed exactly at time <span class="math inline">\(t\)</span>, and 0
otherwise. The objective is to minimize the makespan, i.e. the
completion time of a dummy job that depends on all other jobs. There
are three constraints:</p>
<ul>
<li>if job <span class="math inline">\(j\)</span> is a dependency of job <span class="math inline">\(k\)</span>, the completion time of job
<span class="math inline">\(k\)</span> is larger than the completion time of job <span class="math inline">\(j\)</span> plus the duration
of job <span class="math inline">\(k\)</span>,</li>
<li>at any given time, we do not exceed the total amount of resources
available for each type of resources,</li>
<li>all jobs are completed at the end of the project.</li>
</ul>
<p>This problem quickly becomes challenging from a computational point of
view when the number of tasks increase. Variations on the <a href="https://en.wikipedia.org/wiki/Branch_and_bound">branch and bound</a> method have been developed to solve the resource-constrained
project scheduling problem efficiently, and in practice most
applications rely on heuristics to approximate the full
problem. However, even special cases may be extremely challenging to
solve. The project scheduling problem is a generalization of the <a href="https://en.wikipedia.org/wiki/Job_shop_scheduling">job shop scheduling problem</a>, which is itself a generalization of the
<a href="https://en.wikipedia.org/wiki/Travelling_salesman_problem">travelling salesman problem</a>: all of these are therefore NP-hard.</p>
<p>See <span class="citation" data-cites="brucker1999_resour_const_projec_sched">Brucker et al. (1999)</span> for a short survey of
algorithms and heuristics, and extensions to the harder problems
(multi-mode case, time-cost trade-offs, other objective
functions). <span class="citation" data-cites="pinedo2016_sched">Pinedo (2016)</span> contains a much more extensive
discussion of all kinds of scheduling problems, algorithms, and
implementation considerations.</p>
<h3 id="further-reading">Further reading</h3>
<p><span class="citation" data-cites="brucker1999_resour_const_projec_sched">Brucker et al. (1999)</span> is a great survey of the
algorithms available for project scheduling. For longer books,
<span class="citation" data-cites="pinedo2016_sched">Pinedo (2016)</span>, <span class="citation" data-cites="brucker2007_sched_algor">Brucker (2007)</span>,
<span class="citation" data-cites="conway2003_theor_sched">Conway, Maxwell, and Miller (2003)</span>, and <span class="citation" data-cites="leung2004_handb_sched">Leung (2004)</span> are good
references for the theoretical aspects, and
<span class="citation" data-cites="pinedo2009_plann_sched_manuf_servic">Pinedo (2009)</span> and
<span class="citation" data-cites="błażewicz2001_sched_comput_manuf_proces">Błażewicz et al. (2001)</span> for applications.</p>
<p><span class="citation" data-cites="atabakhsh1991_survey_const_based_sched_system">Atabakhsh (1991)</span>,
<span class="citation" data-cites="noronha1991_knowl_based_approac_sched_probl">Noronha and Sarma (1991)</span>, and
<span class="citation" data-cites="smith1992_knowl_based_produc_manag_approac_resul_prosp">Smith (1992)</span> contain
algorithms that use methods from artificial intelligence to complement
the traditional operations research approach.</p>
<h2 id="automating-project-management">Automating project management</h2>
<p>Let us review our workflow from the beginning. Even for the general
case of project scheduling with workforce and temporal constraints,
algorithms exist that are able to automate the entire scheduling
problem (except maybe for the largest projects). Additional
manipulations can easily be encoded with these two types of
constraints.</p>
<p>Most tools today seem to rely on a variant of CPM or
PERT<span class="sidenote-wrapper"><label for="sn-7" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sn-7" class="margin-toggle" /><span class="sidenote">This seems to be the case for <a href="https://en.wikipedia.org/wiki/Microsoft_Project#Features">Microsoft Project</a> at
least. However, I should note that it is an <em>enormous</em> piece of
software, and I barely scratched the surface of its capabilities. In
particular, it can do much more that project scheduling: there are
options for <a href="https://en.wikipedia.org/wiki/Resource_management">resource levelling</a> and budgeting, along with a lot of
visualization and reporting features (<a href="https://en.wikipedia.org/wiki/Gantt_chart">Gantt charts</a>).<br />
<br />
</span></span>. As a result, you still have to manually allocate
resources, which can be really time-consuming on large projects:
ensuring that each resource is not over-allocated, and finding which
task to reschedule while minimizing the impact on the overall project
duration is not obvious at all.</p>
<p>As a result, a tool that would allow me to choose the level of control
I want in resource allocation would be ideal. I could explicitly set
the resources used by some tasks, and add some global limits on which
resources are available for the overall project, and the algorithm
would do the rest.</p>
<p>We could then focus on automating further, allowing preemption of
tasks, time-cost trade-offs, etc. Finding the right abstractions and
selecting the best algorithm for each case would be a challenging
project, but I think it would be extremely interesting!</p>
<h2 class="unnumbered" id="references">References</h2>
<div id="refs" class="references csl-bib-body hanging-indent" data-entry-spacing="0" role="list">
<div id="ref-atabakhsh1991_survey_const_based_sched_system" class="csl-entry" role="listitem">
Atabakhsh, H. 1991. <span>“A Survey of Constraint Based Scheduling Systems Using an Artificial Intelligence Approach.”</span> <em>Artificial Intelligence in Engineering</em> 6 (2): 58–73. <a href="https://doi.org/10.1016/0954-1810(91)90001-5">https://doi.org/10.1016/0954-1810(91)90001-5</a>.
</div>
<div id="ref-błażewicz2001_sched_comput_manuf_proces" class="csl-entry" role="listitem">
Błażewicz, Jacek, Klaus H. Ecker, Erwin Pesch, Günter Schmidt, and Jan Węglarz. 2001. <em>Scheduling Computer and Manufacturing Processes</em>. 2nd ed. Springer Berlin Heidelberg. <a href="https://doi.org/10.1007/978-3-662-04363-9">https://doi.org/10.1007/978-3-662-04363-9</a>.
</div>
<div id="ref-brucker2007_sched_algor" class="csl-entry" role="listitem">
Brucker, Peter. 2007. <em>Scheduling Algorithms</em>. 5th ed. Springer Berlin Heidelberg. <a href="https://doi.org/10.1007/978-3-540-69516-5">https://doi.org/10.1007/978-3-540-69516-5</a>.
</div>
<div id="ref-brucker1999_resour_const_projec_sched" class="csl-entry" role="listitem">
Brucker, Peter, Andreas Drexl, Rolf Möhring, Klaus Neumann, and Erwin Pesch. 1999. <span>“Resource-Constrained Project Scheduling: Notation, Classification, Models, and Methods.”</span> <em>European Journal of Operational Research</em> 112 (1): 3–41. <a href="https://doi.org/10.1016/s0377-2217(98)00204-5">https://doi.org/10.1016/s0377-2217(98)00204-5</a>.
</div>
<div id="ref-conway2003_theor_sched" class="csl-entry" role="listitem">
Conway, Richard, William L. Maxwell, and Louis W. Miller. 2003. <em>Theory of Scheduling</em>. Mineola, N.Y: Dover.
</div>
<div id="ref-leung2004_handb_sched" class="csl-entry" role="listitem">
Leung, Joseph. 2004. <em>Handbook of Scheduling : Algorithms, Models, and Performance Analysis</em>. Boca Raton: Chapman &amp; Hall/CRC.
</div>
<div id="ref-noronha1991_knowl_based_approac_sched_probl" class="csl-entry" role="listitem">
Noronha, S. J., and V. V. S. Sarma. 1991. <span>“Knowledge-Based Approaches for Scheduling Problems: A Survey.”</span> <em>IEEE Transactions on Knowledge and Data Engineering</em> 3 (2): 160–71. <a href="https://doi.org/10.1109/69.87996">https://doi.org/10.1109/69.87996</a>.
</div>
<div id="ref-pinedo2009_plann_sched_manuf_servic" class="csl-entry" role="listitem">
Pinedo, Michael L. 2009. <em>Planning and Scheduling in Manufacturing and Services</em>. 2nd ed. Springer Series in Operations Research and Financial Engineering. New York: Springer. <a href="https://doi.org/10.1007/978-1-4419-0910-7">https://doi.org/10.1007/978-1-4419-0910-7</a>.
</div>
<div id="ref-pinedo2016_sched" class="csl-entry" role="listitem">
———. 2016. <em>Scheduling: Theory, Algorithms, and Systems</em>. 5th ed. Springer International Publishing. <a href="https://doi.org/10.1007/978-3-319-26580-3">https://doi.org/10.1007/978-3-319-26580-3</a>.
</div>
<div id="ref-smith1992_knowl_based_produc_manag_approac_resul_prosp" class="csl-entry" role="listitem">
Smith, Stephen F. 1992. <span>“Knowledge-Based Production Management Approaches, Results and Prospects.”</span> <em>Production Planning &amp; Control</em> 3 (4): 350–80. <a href="https://doi.org/10.1080/09537289208919407">https://doi.org/10.1080/09537289208919407</a>.
</div>
</div>
</section>
]]></summary>
</entry>
<entry>
    <title>Solving a problem with mathematical programming</title>
    <link href="https://www.lozeve.com/posts/ponder-this-2021-03.html" />
    <id>https://www.lozeve.com/posts/ponder-this-2021-03.html</id>
    <published>2021-04-02T00:00:00Z</published>
    <updated>2021-04-02T00:00:00Z</updated>
    <summary type="html"><![CDATA[<section>
  <p>Every month, IBM Research publish an interesting puzzle on their
<a href="https://www.research.ibm.com/haifa/ponderthis/index.shtml">Ponder This</a> page. Last month <a href="https://www.research.ibm.com/haifa/ponderthis/challenges/March2021.html">puzzle</a> was a nice optimization problem
about a rover exploring the surface of Mars.</p>
<p>In this post, I will explore how to formulate the problem as a
mixed-integer linear program (MILP)<span class="sidenote-wrapper"><label for="sn-0" class="margin-toggle">⊕</label><input type="checkbox" id="sn-0" class="margin-toggle" /><span class="marginnote">See <a href="./operations-research-references.html ">my previous post</a> for additional
background and references on operations research and optimization.<br />
<br />
</span></span>, and how
to solve it with Julia’s <a href="https://jump.dev/">JuMP</a> package.</p>
<h2 id="the-problem">The problem</h2>
<p>The surface of Mars is represented as a <span class="math inline">\(N \times N\)</span> grid, where each
cell has a “score” (i.e. a reward for exploring the cell), and a
constant exploration cost of 128. The goal is to find the set of cell
which maximizes the total score. There is an additional constraint:
each cell can only be explored if all its upper neighbors were also
explored.<span class="sidenote-wrapper"><label for="sn-1" class="margin-toggle">⊕</label><input type="checkbox" id="sn-1" class="margin-toggle" /><span class="marginnote">The full problem statement is <a href="https://www.research.ibm.com/haifa/ponderthis/challenges/March2021.html">here</a>, along with an
example on a small grid.<br />
<br />
</span></span></p>
<p>This problem has a typical structure: we have to choose some variables
to maximize a specific quantity, subject to some constraints. Here,
JuMP will make it easy for us to formulate and solve this problem,
with minimal code.</p>
<h2 id="solution">Solution</h2>
<p>The grid scores are represented as a 20 × 20 array of hexadecimal
numbers:</p>
<div class="sourceCode" id="cb1"><pre class="sourceCode txt"><code class="sourceCode default"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a>BC E6 56 29 99 95 AE 27 9F 89 88 8F BC B4 2A 71 44 7F AF 96</span>
<span id="cb1-2"><a href="#cb1-2" aria-hidden="true" tabindex="-1"></a>72 57 13 DD 08 44 9E A0 13 09 3F D5 AA 06 5E DB E1 EF 14 0B</span>
<span id="cb1-3"><a href="#cb1-3" aria-hidden="true" tabindex="-1"></a>42 B8 F3 8E 58 F0 FA 7F 7C BD FF AF DB D9 13 3E 5D D4 30 FB</span>
<span id="cb1-4"><a href="#cb1-4" aria-hidden="true" tabindex="-1"></a>60 CA B4 A1 73 E4 31 B5 B3 0C 85 DD 27 42 4F D0 11 09 28 39</span>
<span id="cb1-5"><a href="#cb1-5" aria-hidden="true" tabindex="-1"></a>1B 40 7C B1 01 79 52 53 65 65 BE 0F 4A 43 CD D7 A6 FE 7F 51</span>
<span id="cb1-6"><a href="#cb1-6" aria-hidden="true" tabindex="-1"></a>25 AB CC 20 F9 CC 7F 3B 4F 22 9C 72 F5 FE F9 BF A5 58 1F C7</span>
<span id="cb1-7"><a href="#cb1-7" aria-hidden="true" tabindex="-1"></a>EA B2 E4 F8 72 7B 80 A2 D7 C1 4F 46 D1 5E FA AB 12 40 82 7E</span>
<span id="cb1-8"><a href="#cb1-8" aria-hidden="true" tabindex="-1"></a>52 BF 4D 37 C6 5F 3D EF 56 11 D2 69 A4 02 0D 58 11 A7 9E 06</span>
<span id="cb1-9"><a href="#cb1-9" aria-hidden="true" tabindex="-1"></a>F6 B2 60 AF 83 08 4E 11 71 27 60 6F 9E 0A D3 19 20 F6 A3 40</span>
<span id="cb1-10"><a href="#cb1-10" aria-hidden="true" tabindex="-1"></a>B7 26 1B 3A 18 FE E3 3C FB DA 7E 78 CA 49 F3 FE 14 86 53 E9</span>
<span id="cb1-11"><a href="#cb1-11" aria-hidden="true" tabindex="-1"></a>1A 19 54 BD 1A 55 20 3B 59 42 8C 07 BA C5 27 A6 31 87 2A E2</span>
<span id="cb1-12"><a href="#cb1-12" aria-hidden="true" tabindex="-1"></a>36 82 E0 14 B6 09 C9 F5 57 5B 16 1A FA 1C 8A B2 DB F2 41 52</span>
<span id="cb1-13"><a href="#cb1-13" aria-hidden="true" tabindex="-1"></a>87 AC 9F CC 65 0A 4C 6F 87 FD 30 7D B4 FA CB 6D 03 64 CD 19</span>
<span id="cb1-14"><a href="#cb1-14" aria-hidden="true" tabindex="-1"></a>DC 22 FB B1 32 98 75 62 EF 1A 14 DC 5E 0A A2 ED 12 B5 CA C0</span>
<span id="cb1-15"><a href="#cb1-15" aria-hidden="true" tabindex="-1"></a>05 BE F3 1F CB B7 8A 8F 62 BA 11 12 A0 F6 79 FC 4D 97 74 4A</span>
<span id="cb1-16"><a href="#cb1-16" aria-hidden="true" tabindex="-1"></a>3C B9 0A 92 5E 8A DD A6 09 FF 68 82 F2 EE 9F 17 D2 D5 5C 72</span>
<span id="cb1-17"><a href="#cb1-17" aria-hidden="true" tabindex="-1"></a>76 CD 8D 05 61 BB 41 94 F9 FD 5C 72 71 21 54 3F 3B 32 E6 8F</span>
<span id="cb1-18"><a href="#cb1-18" aria-hidden="true" tabindex="-1"></a>45 3F 00 43 BB 07 1D 85 FC E2 24 CE 76 2C 96 40 10 FB 64 88</span>
<span id="cb1-19"><a href="#cb1-19" aria-hidden="true" tabindex="-1"></a>FB 89 D1 E3 81 0C E1 4C 37 B2 1D 60 40 D1 A5 2D 3B E4 85 87</span>
<span id="cb1-20"><a href="#cb1-20" aria-hidden="true" tabindex="-1"></a>E5 D7 05 D7 7D 9C C9 F5 70 0B 17 7B EF 18 83 46 79 0D 49 59 </span></code></pre></div>
<p>We can parse it easily with the <a href="https://docs.julialang.org/en/v1/stdlib/DelimitedFiles/">DelimitedFiles</a> module from Julia’s
standard library.
<span class="sidenote-wrapper"><label for="sn-2" class="margin-toggle">⊕</label><input type="checkbox" id="sn-2" class="margin-toggle" /><span class="marginnote"><img src="../images/ponderthis_202103_grid.svg" /><br />
<br />
</span></span></p>
<div class="sourceCode" id="cb2"><pre class="sourceCode julia"><code class="sourceCode julia"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="im">using</span> <span class="bu">DelimitedFiles</span></span>
<span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a><span class="kw">function</span> <span class="fu">readgrid</span>(filename)</span>
<span id="cb2-4"><a href="#cb2-4" aria-hidden="true" tabindex="-1"></a>    <span class="fu">open</span>(filename) <span class="cf">do</span> f</span>
<span id="cb2-5"><a href="#cb2-5" aria-hidden="true" tabindex="-1"></a>        <span class="fu">parse</span>.(<span class="dt">Int</span>, <span class="fu">readdlm</span>(f, <span class="dt">String</span>); base<span class="op">=</span><span class="fl">16</span>) <span class="op">.-</span> <span class="fl">128</span></span>
<span id="cb2-6"><a href="#cb2-6" aria-hidden="true" tabindex="-1"></a>    <span class="cf">end</span></span>
<span id="cb2-7"><a href="#cb2-7" aria-hidden="true" tabindex="-1"></a><span class="kw">end</span></span>
<span id="cb2-8"><a href="#cb2-8" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb2-9"><a href="#cb2-9" aria-hidden="true" tabindex="-1"></a>grid <span class="op">=</span> <span class="fu">readgrid</span>(<span class="st">&quot;grid.txt&quot;</span>)</span></code></pre></div>
<p>We now need to define the actual optimization problem. First, we load
JuMP and a <a href="https://jump.dev/JuMP.jl/stable/installation/#Supported-solvers">solver</a> which supports MILP (for instance <a href="https://www.gnu.org/software/glpk/">GLPK</a>).</p>
<div class="sourceCode" id="cb3"><pre class="sourceCode julia"><code class="sourceCode julia"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a><span class="im">using</span> <span class="bu">JuMP</span>, <span class="bu">GLPK</span></span></code></pre></div>
<p>Defining a model consists of three stages<span class="sidenote-wrapper"><label for="sn-3" class="margin-toggle">⊕</label><input type="checkbox" id="sn-3" class="margin-toggle" /><span class="marginnote">Check out the <a href="https://jump.dev/JuMP.jl/stable/quickstart/">Quick Start Guide</a> for more info.<br />
<br />
</span></span>:</p>
<ul>
<li>declare some variables, their types, and their bounds,</li>
<li>add some constraints,</li>
<li>specify an objective.</li>
</ul>
<p>In our case, we have a single binary variable for each cell, which
will be 1 if the cell is explored by the rover and 0 otherwise. After
creating the model, we use the <code>@variable</code> macro to declare our
variable <code>x</code> of size <code>(n, n)</code>.</p>
<div class="sourceCode" id="cb4"><pre class="sourceCode julia"><code class="sourceCode julia"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a>n <span class="op">=</span> <span class="fu">size</span>(grid, <span class="fl">1</span>)</span>
<span id="cb4-2"><a href="#cb4-2" aria-hidden="true" tabindex="-1"></a>model <span class="op">=</span> <span class="fu">Model</span>(GLPK.Optimizer)</span>
<span id="cb4-3"><a href="#cb4-3" aria-hidden="true" tabindex="-1"></a><span class="pp">@variable</span>(model, x[<span class="fl">1</span><span class="op">:</span>n, <span class="fl">1</span><span class="op">:</span>n], Bin)</span></code></pre></div>
<p>The “upper neighbors” of a cell <code>(i, j)</code> are <code>[(i-1, j-1), (i-1, j), (i-1, j+1)]</code>. Ensuring that a cell is explored only if all of its
upper neighbors are also explored means ensuring that <code>x[i, j]</code> is 1
only if it is also 1 for all the upper neighbors. We also have to
check that these neighbors are not outside the grid.</p>
<div class="sourceCode" id="cb5"><pre class="sourceCode julia"><code class="sourceCode julia"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a><span class="cf">for</span> i <span class="op">=</span> <span class="fl">2</span><span class="op">:</span>n, j <span class="op">=</span> <span class="fl">1</span><span class="op">:</span>n</span>
<span id="cb5-2"><a href="#cb5-2" aria-hidden="true" tabindex="-1"></a>    <span class="cf">if</span> j <span class="op">&gt;</span> <span class="fl">1</span></span>
<span id="cb5-3"><a href="#cb5-3" aria-hidden="true" tabindex="-1"></a>        <span class="pp">@constraint</span>(model, x[i, j] <span class="op">&lt;=</span> x[i<span class="op">-</span><span class="fl">1</span>, j<span class="op">-</span><span class="fl">1</span>])</span>
<span id="cb5-4"><a href="#cb5-4" aria-hidden="true" tabindex="-1"></a>    <span class="cf">end</span></span>
<span id="cb5-5"><a href="#cb5-5" aria-hidden="true" tabindex="-1"></a>    <span class="pp">@constraint</span>(model, x[i, j] <span class="op">&lt;=</span> x[i<span class="op">-</span><span class="fl">1</span>, j])</span>
<span id="cb5-6"><a href="#cb5-6" aria-hidden="true" tabindex="-1"></a>    <span class="cf">if</span> j <span class="op">&lt;</span> n</span>
<span id="cb5-7"><a href="#cb5-7" aria-hidden="true" tabindex="-1"></a>        <span class="pp">@constraint</span>(model, x[i, j] <span class="op">&lt;=</span> x[i<span class="op">-</span><span class="fl">1</span>, j<span class="op">+</span><span class="fl">1</span>])</span>
<span id="cb5-8"><a href="#cb5-8" aria-hidden="true" tabindex="-1"></a>    <span class="cf">end</span></span>
<span id="cb5-9"><a href="#cb5-9" aria-hidden="true" tabindex="-1"></a><span class="cf">end</span></span></code></pre></div>
<p>Finally, the objective is to maximize the total of all rewards on explored cells:</p>
<div class="sourceCode" id="cb6"><pre class="sourceCode julia"><code class="sourceCode julia"><span id="cb6-1"><a href="#cb6-1" aria-hidden="true" tabindex="-1"></a><span class="pp">@objective</span>(model, Max, <span class="fu">sum</span>(grid[i, j] <span class="op">*</span> x[i, j] <span class="cf">for</span> i <span class="op">=</span> <span class="fl">1</span><span class="op">:</span>n, j <span class="op">=</span> <span class="fl">1</span><span class="op">:</span>n))</span></code></pre></div>
<p>We now can send our model to the solver to be optimized. We retrieve
the objective value and the values of our variable <code>x</code>, and do some
additional processing to get it in the expected format (0-based
indices while Julia uses 1-based indexing).<span class="sidenote-wrapper"><label for="sn-4" class="margin-toggle">⊕</label><input type="checkbox" id="sn-4" class="margin-toggle" /><span class="marginnote">In practice, you should also check that the solver
actually found an optimal solution, didn’t find that the model is
infeasible, and did not run into numerical issues, using
<code>termination_status(model)</code>.<br />
<br />
</span></span></p>
<div class="sourceCode" id="cb7"><pre class="sourceCode julia"><code class="sourceCode julia"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a><span class="fu">optimize!</span>(model)</span>
<span id="cb7-2"><a href="#cb7-2" aria-hidden="true" tabindex="-1"></a>obj <span class="op">=</span> <span class="fu">Int</span>(<span class="fu">objective_value</span>(model))</span>
<span id="cb7-3"><a href="#cb7-3" aria-hidden="true" tabindex="-1"></a>indices <span class="op">=</span> <span class="fu">Tuple</span>.(<span class="fu">findall</span>(<span class="fu">value</span>.(x) <span class="op">.&gt;</span> <span class="fl">0</span>))</span>
<span id="cb7-4"><a href="#cb7-4" aria-hidden="true" tabindex="-1"></a>indices <span class="op">=</span> <span class="fu">sort</span>([(a<span class="op">-</span><span class="fl">1</span>, b<span class="op">-</span><span class="fl">1</span>) for (a, b) <span class="op">=</span> indices])</span></code></pre></div>
<p><span class="sidenote-wrapper"><label for="sn-5" class="margin-toggle">⊕</label><input type="checkbox" id="sn-5" class="margin-toggle" /><span class="marginnote"><img src="../images/ponderthis_202103_explore.svg" /><br />
<br />
</span></span></p>
<p>The resulting objective value is 1424, and the explored indices are</p>
<div class="sourceCode" id="cb8"><pre class="sourceCode txt"><code class="sourceCode default"><span id="cb8-1"><a href="#cb8-1" aria-hidden="true" tabindex="-1"></a>[(0, 0), (0, 1), (0, 2), (0, 3), (0, 4), (0, 5), (0, 6), (0, 7), (0, 8), (0, 9),</span>
<span id="cb8-2"><a href="#cb8-2" aria-hidden="true" tabindex="-1"></a> (0, 10), (0, 11), (0, 12), (0, 13), (0, 14), (0, 15), (0, 16), (0, 17), (0, 18),</span>
<span id="cb8-3"><a href="#cb8-3" aria-hidden="true" tabindex="-1"></a> (0, 19), (1, 0), (1, 1), (1, 2), (1, 3), (1, 4), (1, 5), (1, 6), (1, 7), (1, 8),</span>
<span id="cb8-4"><a href="#cb8-4" aria-hidden="true" tabindex="-1"></a> (1, 9), (1, 10), (1, 11), (1, 12), (1, 13), (1, 14), (1, 15), (1, 16), (1, 17),</span>
<span id="cb8-5"><a href="#cb8-5" aria-hidden="true" tabindex="-1"></a> (1, 18), (1, 19), (2, 0), (2, 1), (2, 2), (2, 3), (2, 4), (2, 5), (2, 6), (2, 7),</span>
<span id="cb8-6"><a href="#cb8-6" aria-hidden="true" tabindex="-1"></a> (2, 8), (2, 9), (2, 10), (2, 11), (2, 12), (2, 13), (2, 14), (2, 15), (2, 16),</span>
<span id="cb8-7"><a href="#cb8-7" aria-hidden="true" tabindex="-1"></a> (2, 17), (2, 18), (2, 19), (3, 0), (3, 1), (3, 2), (3, 3), (3, 4), (3, 5), (3, 6),</span>
<span id="cb8-8"><a href="#cb8-8" aria-hidden="true" tabindex="-1"></a> (3, 7), (3, 8), (3, 9), (3, 10), (3, 11), (3, 12), (3, 13), (3, 14), (3, 15),</span>
<span id="cb8-9"><a href="#cb8-9" aria-hidden="true" tabindex="-1"></a> (3, 16), (3, 17), (3, 18), (4, 0), (4, 1), (4, 2), (4, 3), (4, 4), (4, 5), (4, 6),</span>
<span id="cb8-10"><a href="#cb8-10" aria-hidden="true" tabindex="-1"></a> (4, 7), (4, 8), (4, 9), (4, 10), (4, 11), (4, 12), (4, 13), (4, 14), (4, 15),</span>
<span id="cb8-11"><a href="#cb8-11" aria-hidden="true" tabindex="-1"></a> (4, 16), (4, 17), (5, 0), (5, 1), (5, 2), (5, 3), (5, 4), (5, 5), (5, 6), (5, 7),</span>
<span id="cb8-12"><a href="#cb8-12" aria-hidden="true" tabindex="-1"></a> (5, 8), (5, 9), (5, 10), (5, 11), (5, 12), (5, 13), (5, 14), (5, 15), (5, 16),</span>
<span id="cb8-13"><a href="#cb8-13" aria-hidden="true" tabindex="-1"></a> (6, 0), (6, 1), (6, 2), (6, 3), (6, 4), (6, 5), (6, 6), (6, 7), (6, 8), (6, 9),</span>
<span id="cb8-14"><a href="#cb8-14" aria-hidden="true" tabindex="-1"></a> (6, 12), (6, 14), (6, 15), (7, 0), (7, 1), (7, 2), (7, 4), (7, 7), (8, 0), (8, 1),</span>
<span id="cb8-15"><a href="#cb8-15" aria-hidden="true" tabindex="-1"></a> (9, 0)]</span></code></pre></div>
<h2 id="exporting-the-model-for-external-solvers">Exporting the model for external solvers</h2>
<p>JuMP supports a wide variety of solvers, and this model is quite small
so open-source solvers are more than sufficient. However, let’s see
how to use the <a href="https://neos-server.org/neos/">NEOS Server</a> to give this problem to state-of-the-art
solvers!</p>
<p>Depending on the solver you plan to use, you will have to submit the
problem in a specific format. Looking at the <a href="https://neos-server.org/neos/solvers/index.html">solvers page</a>, we can use
<a href="https://www.gurobi.com/documentation/9.1/refman/mps_format.html">MPS</a> or <a href="https://www.gurobi.com/documentation/9.1/refman/lp_format.html">LP</a> format to use CPLEX or Gurobi for instance. Luckily, JuMP
(or more accurately <a href="https://github.com/jump-dev/MathOptInterface.jl">MathOptInterface</a>) supports these formats (among
<a href="https://jump.dev/MathOptInterface.jl/stable/apireference/#File-Formats">others</a>).</p>
<div class="sourceCode" id="cb9"><pre class="sourceCode julia"><code class="sourceCode julia"><span id="cb9-1"><a href="#cb9-1" aria-hidden="true" tabindex="-1"></a><span class="fu">write_to_file</span>(model, <span class="st">&quot;rover.lp&quot;</span>)  <span class="co"># or &quot;rover.mps&quot;</span></span></code></pre></div>
<p>We can now upload this file to the NEOS Server, and sure enough, a few
seconds later, we get Gurobi’s output:</p>
<div class="sourceCode" id="cb10"><pre class="sourceCode txt"><code class="sourceCode default"><span id="cb10-1"><a href="#cb10-1" aria-hidden="true" tabindex="-1"></a>Gurobi Optimizer version 9.1.1 build v9.1.1rc0 (linux64)</span>
<span id="cb10-2"><a href="#cb10-2" aria-hidden="true" tabindex="-1"></a>Thread count: 32 physical cores, 64 logical processors, using up to 4 threads</span>
<span id="cb10-3"><a href="#cb10-3" aria-hidden="true" tabindex="-1"></a>Optimize a model with 1102 rows, 400 columns and 2204 nonzeros</span>
<span id="cb10-4"><a href="#cb10-4" aria-hidden="true" tabindex="-1"></a>Model fingerprint: 0x69169161</span>
<span id="cb10-5"><a href="#cb10-5" aria-hidden="true" tabindex="-1"></a>Variable types: 0 continuous, 400 integer (400 binary)</span>
<span id="cb10-6"><a href="#cb10-6" aria-hidden="true" tabindex="-1"></a>Coefficient statistics:</span>
<span id="cb10-7"><a href="#cb10-7" aria-hidden="true" tabindex="-1"></a>  Matrix range     [1e+00, 1e+00]</span>
<span id="cb10-8"><a href="#cb10-8" aria-hidden="true" tabindex="-1"></a>  Objective range  [1e+00, 1e+02]</span>
<span id="cb10-9"><a href="#cb10-9" aria-hidden="true" tabindex="-1"></a>  Bounds range     [1e+00, 1e+00]</span>
<span id="cb10-10"><a href="#cb10-10" aria-hidden="true" tabindex="-1"></a>  RHS range        [0e+00, 0e+00]</span>
<span id="cb10-11"><a href="#cb10-11" aria-hidden="true" tabindex="-1"></a>Found heuristic solution: objective 625.0000000</span>
<span id="cb10-12"><a href="#cb10-12" aria-hidden="true" tabindex="-1"></a>Presolve removed 116 rows and 45 columns</span>
<span id="cb10-13"><a href="#cb10-13" aria-hidden="true" tabindex="-1"></a>Presolve time: 0.01s</span>
<span id="cb10-14"><a href="#cb10-14" aria-hidden="true" tabindex="-1"></a>Presolved: 986 rows, 355 columns, 1972 nonzeros</span>
<span id="cb10-15"><a href="#cb10-15" aria-hidden="true" tabindex="-1"></a>Variable types: 0 continuous, 355 integer (355 binary)</span>
<span id="cb10-16"><a href="#cb10-16" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb10-17"><a href="#cb10-17" aria-hidden="true" tabindex="-1"></a>Root relaxation: objective 1.424000e+03, 123 iterations, 0.00 seconds</span>
<span id="cb10-18"><a href="#cb10-18" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb10-19"><a href="#cb10-19" aria-hidden="true" tabindex="-1"></a>    Nodes    |    Current Node    |     Objective Bounds      |     Work</span>
<span id="cb10-20"><a href="#cb10-20" aria-hidden="true" tabindex="-1"></a> Expl Unexpl |  Obj  Depth IntInf | Incumbent    BestBd   Gap | It/Node Time</span>
<span id="cb10-21"><a href="#cb10-21" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb10-22"><a href="#cb10-22" aria-hidden="true" tabindex="-1"></a>*    0     0               0    1424.0000000 1424.00000  0.00%     -    0s</span>
<span id="cb10-23"><a href="#cb10-23" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb10-24"><a href="#cb10-24" aria-hidden="true" tabindex="-1"></a>Explored 0 nodes (123 simplex iterations) in 0.01 seconds</span>
<span id="cb10-25"><a href="#cb10-25" aria-hidden="true" tabindex="-1"></a>Thread count was 4 (of 64 available processors)</span>
<span id="cb10-26"><a href="#cb10-26" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb10-27"><a href="#cb10-27" aria-hidden="true" tabindex="-1"></a>Solution count 2: 1424 625</span>
<span id="cb10-28"><a href="#cb10-28" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb10-29"><a href="#cb10-29" aria-hidden="true" tabindex="-1"></a>Optimal solution found (tolerance 1.00e-04)</span>
<span id="cb10-30"><a href="#cb10-30" aria-hidden="true" tabindex="-1"></a>Best objective 1.424000000000e+03, best bound 1.424000000000e+03, gap 0.0000%</span>
<span id="cb10-31"><a href="#cb10-31" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb10-32"><a href="#cb10-32" aria-hidden="true" tabindex="-1"></a>********** Begin .sol file *************</span>
<span id="cb10-33"><a href="#cb10-33" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb10-34"><a href="#cb10-34" aria-hidden="true" tabindex="-1"></a># Solution for model obj</span>
<span id="cb10-35"><a href="#cb10-35" aria-hidden="true" tabindex="-1"></a># Objective value = 1424</span>
<span id="cb10-36"><a href="#cb10-36" aria-hidden="true" tabindex="-1"></a>[...]</span></code></pre></div>
<p>We get the same solution!</p>
<h2 id="code">Code</h2>
<p>My complete solution is available <a href="https://github.com/dlozeve/ponder-this/blob/master/202103/rover.jl">on GitHub</a>.</p>
</section>
]]></summary>
</entry>

</feed>
