<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"
    xmlns:dc="http://purl.org/dc/elements/1.1/">
    <channel>
        <title>Dimitri Lozeve's Blog</title>
        <link>https://www.lozeve.com</link>
        <description><![CDATA[Recent posts]]></description>
        <atom:link href="https://www.lozeve.com/rss.xml" rel="self"
                   type="application/rss+xml" />
        <lastBuildDate>Wed, 13 May 2026 00:00:00 UT</lastBuildDate>
        <item>
    <title>Overengineering Dice Rolling</title>
    <link>https://www.lozeve.com/posts/diceroll.html</link>
    <description><![CDATA[<section>
  <p>We have a weekly D&amp;D game with a few of my colleagues. We play at
lunchtime. A few weeks ago, something terrible happened: our DM forgot
his dice.</p>
<p><img src="../images/diceroll/slack_screenshot_1.png" />
<span class="sidenote-wrapper"><label for="sn-0" class="margin-toggle">⊕</label><input type="checkbox" id="sn-0" class="margin-toggle" /><span class="marginnote">Translation from the French: “I didn’t realize, but I didn’t
bring any of my D&amp;D stuff at all. I have 0 dice, 0 screen”<br />
<br />
</span></span></p>
<p>There are already thousands of websites and mobile apps for rolling
dice in role-playing games, but that would be boring. What if I made
one myself? <span class="sidenote-wrapper"><label for="sn-1" class="margin-toggle">⊕</label><input type="checkbox" id="sn-1" class="margin-toggle" /><span class="marginnote">Another tragic event happened in game: my character
died during combat. And this particular warlock believes in a higher
power, and the coming of the Great Old One in an apocalypse that will
consume the world. The warlock’s interpretation of his death is that
fate is rigged, and the dice are probably wrong.<br />
<br />
</span></span></p>
<p>The initial idea was very simple. Character sheets and player
handbooks use “dice expressions” like <code class="verbatim">6d4 + 4</code>. This means “roll six
four-sided dice, add the values, and add 4”. This gives you the damage
you deal to a monster while hitting it with your amazing sword or
badass magical spell. So we could simply type a dice expression, and
the tool would simulate the dice rolls, compute the sum, and give you
the result. Quick and easy.</p>
<p>Dear reader: it ended up doing way more than that. Way too much.</p>
<h2 id="the-first-version">The first version</h2>
<p>Famously, all problems can be solved with regular expressions. So my
first version was basically this regex:
<code>(?&lt;sign&gt;[+-]?)(?:(?&lt;count&gt;\d*)[dD](?&lt;sides&gt;\d+)|(?&lt;num&gt;\d+))</code>, along
with some code that went through each capture group and called
<code>random_range(1..6)</code> if it encountered <code class="verbatim">d6</code>.</p>
<p>Then it’s just a matter of reading the command-line arguments, parsing
them with the regex, and printing the result. I did it in Rust because
I was already familiar with it, and I could quickly whip up a small
command-line utility like this. And it gave me a single binary that I
could share with the team.</p>
<p><img src="../images/diceroll/slack_screenshot_2.png" /></p>
<p>Great.</p>
<p>But not enough.</p>
<h2 id="the-rabbit-hole">The rabbit hole</h2>
<p>Even a cursory glance at the player’s handbook will let you know that
dice expressions can become <em>much</em> more complex than that. And my tool
would not be a <em>serious</em> tool if it didn’t support everything, right?
So I rewrote the parser to use a proper parser combinator library
(<a href="https://crates.io/crates/nom">nom</a>) instead of a regex, and I happily started to implement necessary
and not-so-necessary features:</p>
<ul>
<li>adding and subtracting dice expressions: <code class="verbatim">4d10 - 2d3 + 4</code>,</li>
<li>parenthesized groups and multipliers: <code class="verbatim">d20 + (2d6+3)*2 + 5</code>,</li>
<li>keep/drop highest/lowest rolls: <code class="verbatim">2d20kh1</code> rolls a <code class="verbatim">d20</code> with
advantage, <code class="verbatim">4d6dl2</code> drops the two lowest values,</li>
<li>re-rolls: <code class="verbatim">6d10r</code> re-rolls 1s until the die stops showing 1,</li>
<li>exploding dice, minimum and maximum results, count matching, and so
on. All of these modifiers can obviously be combined!</li>
</ul>
<p>But what about the user experience? Surely our dear users<span class="sidenote-wrapper"><label for="sn-2" class="margin-toggle">⊕</label><input type="checkbox" id="sn-2" class="margin-toggle" /><span class="marginnote">At
this stage, “users” means “myself”.<br />
<br />
</span></span> will grow tired of the simple CLI
interface? So I added a REPL, and JSON output for programmatic use,
and a web server because why not? Someone might want to use it through
an HTTP API!</p>
<p>At that point it also stopped being just a dice rolling utility. I’m
quite happy with the <code class="verbatim">stats</code> command, which rolls the dice expression
many times and computes statistics, which is useful to see what you
can expect from a given combination. What is the standard deviation of
the damage dealt by your weapon? What chance do you have of hitting
that goblin with your spell?</p>
<h2 id="the-website">The website</h2>
<p>At this point I was actively looking for opportunities to overengineer
this even more. What useless feature could I possibly add?</p>
<p>Well, I had a nice command-line tool in Rust. Being a serious person
working on a serious project, I had a clean architecture, with a
library and binary that called it. So where could I use that library?
Rust can compile to WebAssembly, why not try that? I could call the
library from a small website and roll <em>blazing fast</em> dice in WASM
directly in the browser!</p>
<p>I found a cheap domain name, and shortly after, the result was
accessible on <a href="https://diceroll.run/">diceroll.run</a>.</p>
<p><img src="../images/diceroll/diceroll-run.png" /></p>
<p>With the website, I could experiment with a lot of stuff that I wanted
to explore:</p>
<ul>
<li>the WASM module itself of course, how to build it and use it from JS
in the browser,</li>
<li>generating a random seed at the start of the session for
reproducibility,</li>
<li>saving all the state (random seed and input history) in the URL
query parameters, for easily sharing a session with other people!</li>
</ul>
<h2 id="coding-agents-and-learning-opportunities">Coding agents and learning opportunities</h2>
<p>Among these vast amounts of overengineering and unnecessary features,
there was at least a nice outcome: I learned a lot.</p>
<p>As mentioned earlier, the initial version was basically a single
script generated with an AI agent. After that, as I became more
invested in the features, I continued iterating with AI. But this was
mostly for fun and a learning opportunity: I was under no pressure to
actually deliver any of those features! So I didn’t “vibe code”, I
used the agent to iterate on the code, reviewing it line-by-line as I
would have done if I had written it myself.</p>
<p>This allowed me to iterate rapidly on the features and on the code
structure. Rust was really helpful as well. I decided on the data
structures and the libraries, and from there the LLM could generate
very good code within the constraints of the data models and the
compiler’s checks.</p>
<p>The parser in particular is code that can quickly become a little bit
boring, even with a parser combinator library. Here the LLM could
generate a battery of unit tests, and I could very easily check that
it covered the most important cases, both expected successes and
expected failures. Without AI assistance I don’t think I ever would
have had the patience to write these tests for a side project! It’s in
this sense that I believe AI agents can lead us to write much better
code, especially for one-off, unimportant side projects like this.</p>
<p>The only part that was fully AI-generated with (almost) no supervision
is the website. The HTML, CSS, and the bit of JS wrapping the WASM
module was fully “vibe coded” in the sense that I didn’t read the code
in detail, I just looked at the output, tried it, and iterated from
there. Given that most of the logic is in the Rust library, I didn’t
want to spend too much time changing CSS by hand…</p>
<p>With or without AI, small projects like these are a great learning
opportunity. I could experiment with new (to me) libraries for parser
combinators, minimal HTTP servers, CLI argument parsing, etc. The AI
agent is also a great source of learning, studying the code of simple
examples!</p>
<h2 id="use-it-if-you-want">Use it (if you want)</h2>
<p>The code is <a href="https://github.com/dlozeve/diceroll">on GitHub</a> (you can download a pre-built binary in the
releases), and the website is on <a href="https://diceroll.run/">diceroll.run</a>! As you can imagine, I
would be <em>very</em> happy to hear about your ideas for new features and
improvements 😛</p>
</section>
]]></description>
    <pubDate>Wed, 13 May 2026 00:00:00 UT</pubDate>
    <guid>https://www.lozeve.com/posts/diceroll.html</guid>
    <dc:creator>Dimitri Lozeve</dc:creator>
</item>
<item>
    <title>Why I built Leibniz: in defense of Gantt charts and project scheduling</title>
    <link>https://www.lozeve.com/posts/leibniz-product-vision.html</link>
    <description><![CDATA[<section>
  <p>In the past few months, I have been working on a new project:
<a href="https://getleibniz.com">Leibniz</a>. I designed it to improve the way we plan projects. Like most
side projects, it started because of my own dissatisfaction with the
existing tools and processes during my work. Here, I want to explain
the story, the frustrations I encountered, and the vision I
have.<span class="sidenote-wrapper"><label for="sn-0" class="margin-toggle">⊕</label><input type="checkbox" id="sn-0" class="margin-toggle" /><span class="marginnote">This is the origin story and the product
vision. I hope to have more to say about the technical details and the
juicy algorithms later!<br />
<br />
</span></span></p>
<h2 id="project-management-the-old-school-way">Project management, the old-school way</h2>
<p>In my previous job, I was in charge of project management. The company
was very old-school in its approach, and although quite small, copied
most of its processes from large industrial companies like Thales and
Safran.</p>
<p>This meant that most projects were managed in the following
manner:</p>
<ul>
<li>a project manager designed the entire project from scratch, dividing
the project into many <em>tasks</em>,</li>
<li>she mapped the <em>dependencies</em> between each task,</li>
<li>she assigned a <em>resource</em> (usually a job title, like “backend
developer” or “research engineer”) to each task,</li>
<li>she set a <em>duration</em> for each task<span class="sidenote-wrapper"><label for="sn-1" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sn-1" class="margin-toggle" /><span class="sidenote">This only covers the initial part of project
planning. But the main difficulty actually arises during project
execution: adapting activities, dependencies, and schedule to
external, unforeseen events.<br />
<br />
</span></span>.</li>
</ul>
<p>After that, the project manager set up a meeting with the company’s
top management to present the plan. It got challenged, often in very
minute details, including time estimates. The goal of the project
manager at this point was to obtain top management’s approval for
running the project, and above all to negotiate the deadline (she
wanted to set it as late as possible, because she would be held
accountable if she missed it), and the resources made available to her
(i.e. people).</p>
<p>And then the dance started: there was a big meeting with all the other
project managers (each with their own projects and requirements), and
they would all haggle over the “resources” (once again, actual people,
fellow colleagues working right next to them). This often degenerated
into apothecary-style counting: a typical sentence in these meetings
was “ok, you can have Bob on week 3 of April, but then I will need 0.5
Alice on week 4 to compensate”. Not only were these arrangements
nearly meaningless (how can a developer work for 2.5 days on a given
project then switch to another?), but the vocabulary encouraged
project managers to think of people as “resources”, machines that
could be subdivided, reallocated, and moved as easily as equipment can
be repurposed.</p>
<p>But that was not the worst part. The worst part was the tooling we had
to use.</p>
<p>I managed small projects, mainly software. Think a few months, at most
a year, and a small team of between 5 and 10 people. Something that
most startups handle very easily. We could have done it with an agile
team iterating rapidly, but to fit the process, I had to build a full
project plan, including the list of tasks with all the metadata, a
Gantt chart, and so on. It could have been done very easily. But the
company used <a href="https://en.wikipedia.org/wiki/Microsoft_Project">Microsoft Project</a>.</p>
<p>Microsoft Project has… a few issues. The software was basically
built in the 90s and never touched since. The developers had clearly
never spoken to an actual user in their entire lives. If it was made
for anyone, it was for project managers on extremely large projects
(perhaps for designing and building an entire airplane).<span class="sidenote-wrapper"><label for="sn-2" class="margin-toggle">⊕</label><input type="checkbox" id="sn-2" class="margin-toggle" /><span class="marginnote">I should note that this is a personal impression based
on very superficial use. We are talking about a very large piece of
software in which Microsoft has probably invested <em>decades</em> of work. I
am sure many people have mastered it and find it invaluable. But it
was not what I wanted for my simple use cases. And it certainly didn’t
spark joy.<br />
<br />
</span></span>
There were thousands of features I would never need, and the basic
ones were very hard to use.</p>
<ul>
<li>The entire interface is built like an Excel spreadsheet. Each
activity on its own line, with ID, name, duration, resources, and
dependencies. A project is inherently a graph, and project planning
(at least for me) is an inherently visual activity. Yet the only
place where you can actually edit the project is a very dense table.</li>
<li>Dependencies are represented in the most obscure way possible: each
activity has a “dependencies” column where you fill the IDs of the
activities on which it depends. Good luck figuring out what it means
when you see that your activities depend on 127, 134, and 34. It
basically feels like you are looking at the underlying database
schema, not at something built for actual users.</li>
<li>It is very hard to collaborate on projects. Even though the essence
of a project is to organize an entire team, you cannot share it
easily other people. If one person has it opened, you have to wait
until they close it before being able to open it yourself. And the
interface is so dense and unclear that only people who are used to
work with it know what to look for, so it’s useless to show to
individual developers for instance.</li>
<li>The scheduling algorithm feels limited: you have hundreds of options
to configure different dependency types, but it many cases the
program just gives up and settles for highlighting conflicts between
activities (generally resource conflicts). Then you have to fix them
manually, moving tasks and hardcoding their start date, defeating
the entire purpose of a scheduling software.</li>
</ul>
<h2 id="project-scheduling-is-a-good-abstraction">Project scheduling is a good abstraction</h2>
<p>None of this was great, both in terms of process and tooling. It’s
easy to conclude that we should throw out the whole thing and do
something else entirely (agile, scrum, whatever you want to call
it). But underneath the unpleasantness, I still thought that the
underlying <em>model</em> was interesting. The problem is not the Gantt
chart; the problem is sticking rigidly to it. The problem is not
decomposing the project into small tasks and assigning them to people;
it’s treating people as consumable, impersonal resources. The model
was still useful for helping me structure my thoughts on large and
ambitious projects, and to <em>communicate</em> the tradeoffs and decisions
to other people (both the members of my team and my managers).</p>
<p>And to be honest, the maths nerd in me was also quickly fascinated by
the underlying algorithms for project scheduling<span class="sidenote-wrapper"><label for="sn-3" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sn-3" class="margin-toggle" /><span class="sidenote">The mathematical
model of a project, as used in most algorithms, is very clean and
elegant, with very few concepts. This convinced me that there exists a
good UX that will help people manipulate it easily.<br />
<br />
</span></span>.</p>
<p>So I went looking for a way to leverage that model while challenging
both the tooling and the process. As a small-scale manager, there
wasn’t much I could do about the process directly, but I could propose
better tooling and use that as a means to challenge the process
itself.</p>
<h2 id="looking-for-alternative-tooling">Looking for alternative tooling</h2>
<p>Given the 1990s look and feel of Microsoft Project, I naturally
assumed there would be something better that I could propose to my
boss. I had already managed to convince him to let me migrate my team
to Linear, while the rest of the company was still on Jira. Where was
the Linear to Microsoft Project’s Jira?</p>
<p>It turns out it didn’t really exist. Most competitors were either as
bad as Microsoft Project (with fewer features), or were focused on
some other area I didn’t need (mostly CRMs, ERPs, and things like
that), with project scheduling bolted on the side<span class="sidenote-wrapper"><label for="sn-4" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sn-4" class="margin-toggle" /><span class="sidenote">Of course I
can’t pretend I looked at <a href="https://en.wikipedia.org/wiki/Comparison_of_project_management_software">each and every one of them</a>…<br />
<br />
</span></span>.</p>
<p>Looking into it, the reason everyone gave was that project scheduling
was a “hard problem” requiring dedicated research teams and
state-of-the-art algorithms. This piqued my curiosity, and without
much ambition, just out of curiosity, I looked into the literature on
project scheduling. I quickly learned that simple algorithms (like the
<a href="https://en.wikipedia.org/wiki/Critical_path_method">critical path method</a>) cover the most basic use cases very simply. But
the theory was much richer than I expected: there are hundreds of
variants of the scheduling problem, and there’s a very active research
community with close ties to the industry. (I wrote about it a bit in
a <a href="./planning-and-scheduling.html">previous post</a>.) I eventually bought a book on project scheduling
with time windows and resource constraints, had a very basic
implementation running within a few days in my spare time.</p>
<h2 id="the-algorithm-is-not-the-product">The algorithm is not the product</h2>
<p>As is often the case, the algorithm is not the difficult part. The
difficult part is the product. What did I want from this software? As
my own first user, my needs were clear:</p>
<ul>
<li>An intuitive user experience. This is table stakes: nobody wants to
spend months mastering a project scheduling tool.</li>
<li>Easy visualization and manipulation of dependencies. Dependencies
are the key part of the model. Everything, from the resource
allocations to the final schedule, hinges on them.</li>
<li>A collaborative experience from the start. If you can’t share it
with anyone, the project plan loses 90% of its use.</li>
<li>A state-of-the-art scheduling algorithm, that never forces you to
adjust manually.</li>
<li>A fast, responsive experience. When I change something, I want the
changes to be reflected immediately in the schedule.</li>
</ul>
<p>Above all, I wanted it to be useful to everyone: not only to people
working with traditional project management processes, but also to
people who work in more agile, flexible environments. And from small
projects of a few tasks to enormous undertakings involving thousands.</p>
<h2 id="going-forward">Going forward</h2>
<p>Over the past months, I have built a tool that answers most of my
needs. I am my own first user (and apart from a few colleagues, the
only one). Despite no longer working at the same company, and
following vastly different processes, I still use it every day.</p>
<p>Project scheduling algorithms are so general that they can be applied
to many different contexts. In the next few posts I will probably talk
more about some design choices and some technical decisions I
made. For spoilers, the key features are <a href="https://getleibniz.com/">on the website</a>! And maybe I
will also explain the name…</p>
</section>
]]></description>
    <pubDate>Sun, 03 May 2026 00:00:00 UT</pubDate>
    <guid>https://www.lozeve.com/posts/leibniz-product-vision.html</guid>
    <dc:creator>Dimitri Lozeve</dc:creator>
</item>
<item>
    <title>A Mastodon bot for interesting integer sequences</title>
    <link>https://www.lozeve.com/posts/oeis-bot.html</link>
    <description><![CDATA[<section>
  <p>A few days ago I stumbled into the <a href="https://oeis.org/">OEIS</a> again via their <a href="https://oeis.org/OEIS_pics.html">pictures page</a>. Looking at some of the pages there gave me an idea for a new,
small project, so I gave it a shot.</p>
<h2 id="what-is-the-oeis">What is the OEIS?</h2>
<p>The Online Encyclopedia of Integer Sequences, or “OEIS” for short, is
a collaborative project whose objective is to collect all sequences of
integers into a comprehensive repository. It is useful to all
mathematicians (professional or amateur!) who stumble into a sequence
and wish to identify it: they can type a few terms into the search
bar, and the OEIS will return all sequences that match.</p>
<p>The OEIS was started by Neil J. A. Sloane, after publishing a few
books collecting many important sequences. Many people had interesting
sequences they wanted to contribute, and sent letters to Sloane with
suggestions. The collection began to be so large that Sloane started
an online version, which became the OEIS<span class="sidenote-wrapper"><label for="sn-0" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sn-0" class="margin-toggle" /><span class="sidenote">The full history can be
found <a href="https://oeisf.org/#HISTORY">here</a>.<br />
<br />
</span></span>.</p>
<p>Each sequence in the database has a unique identifier like <a href="https://oeis.org/A000108">A000108</a>. An
entry includes a description, the known terms of the sequence, and a
lot of additional data: explicit formulas when they exist, comments
and bibliographical references, programs to compute the sequence, and
other related sequences in the database. The <a href="https://oeis.org/demo1.html">demo</a> is a good place to
start for how to use the database.</p>
<p>Like most people, I have no real need of the OEIS on a day-to-day
basis. However I still find it fascinating. Many sequences are
well-known: the prime numbers (<a href="https://oeis.org/A000040">A000040</a>), the Fibonacci numbers
(<a href="https://oeis.org/A000045">A000045</a>), etc. Others I have encountered elsewhere in popular
mathematics, like the Busy Beaver problem (<a href="https://oeis.org/A060843">A060843</a>), whose 5th term
has been <a href="https://www.quantamagazine.org/amateur-mathematicians-find-fifth-busy-beaver-turing-machine-20240702/">computed recently</a>. And others I have discovered directly on
the OEIS, like the “peaceable coexisting armies of queens” (<a href="https://oeis.org/A250000">A250000</a>,
look at the ASCII chess boards in the examples!).</p>
<p>What I like about these sequences is that they are often at the
intersection of many domains in mathematics. Some of these sequences
have very deep connections to many areas, and let you discover new
interesting concepts and ideas. Moreover the relationships between
sequences is often interesting as well, and let you continue the
exploration.</p>
<h2 id="the-idea">The idea</h2>
<p>The only problem is that the OEIS is more a search engine, and
therefore most sequences are not very discoverable. Contributors and
editors add keywords to the sequences, some are <a href="https://oeis.org/search?q=keyword%3Anice&amp;language=english&amp;go=Search">nice</a>, some have a nice
<a href="https://oeis.org/search?q=keyword%3Alook&amp;language=english&amp;go=Search">graph</a> (click on “graph” under the sequence itself!), and the wiki
lists some <a href="https://oeis.org/wiki/Index_to_OEIS:_Section_Cor#core">important sequences</a>. But other than this it is not easy to
discover interesting sequences randomly.</p>
<p>So I got the idea of implementing a bot that would share a random
sequence regularly, so I could read the full OEIS page if the
description seemed interesting. Since I am mostly on <a href="https://mathstodon.xyz/@dlzv">Mastodon</a>, I
wanted to implement this as a Mastodon bot so that others could follow
it if they want.</p>
<h2 id="the-implementation">The implementation</h2>
<p>The general idea is as follows:</p>
<ol type="1">
<li>Pick a random sequence ID out of the nearly 400,000 sequences
currently in the database.</li>
<li>Retrieve the metadata for this sequence (its name, the first few
terms, keywords and comments).</li>
<li>Format this as a Mastodon post and post it.</li>
</ol>
<p>It is a very simple process (basically a GET request to the OEIS
followed by a POST request to a Mastodon instance). Mastodon servers
have a public API that is very easy to use and very well documented: I
only needed the <a href="https://docs.joinmastodon.org/methods/statuses/#create">statuses</a> endpoint. It requires a token that can be
generated easily from the account settings, by creating an application
with the <code class="verbatim">write:statuses</code> scope.</p>
<p>The OEIS API has a <a href="https://oeis.org/wiki/JSON_Format">JSON output format</a>. I wanted to parse this
correctly. Looking at the <a href="https://oeis.org/eishelp2.html">documentation</a> of the format, I realized that
many sequences should be excluded because they are duplicates,
unimportant, or too obscure<span class="sidenote-wrapper"><label for="sn-1" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sn-1" class="margin-toggle" /><span class="sidenote">Specifically, I excluded sequences
tagged with <code class="verbatim">dead</code>, <code class="verbatim">dumb</code>, <code class="verbatim">dupe</code>, <code class="verbatim">less</code>, <code class="verbatim">obsc</code>, <code class="verbatim">probation</code>, or
<code class="verbatim">uned</code>.<br />
<br />
</span></span>. So I added a step in the process above to make sure that the
sequence picked at random was sufficiently interesting. Given the
number of available sequences, there should be more than enough
interesting ones!</p>
<p>I implemented it in Rust<span class="sidenote-wrapper"><label for="sn-2" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sn-2" class="margin-toggle" /><span class="sidenote">Mostly because I am comfortable with it,
it has a very good ecosystem, and it can be deployed as a single
binary.<br />
<br />
</span></span>. Claude Code was fairly helpful to build all this, especially
for the serde types to deserialize the OEIS internal format. All in
all, I had a fully functional program in under an hour, for something
that would have taken me at least a few hours if I had done everything
“manually” (and which would not have been particularly interesting or
fun to do). I am reasonably happy with the overall quality of the
code, that you can find <a href="https://github.com/dlozeve/oeis_bot">on GitHub</a>.</p>
<p>The final program is deployed on my VPS, I just copied the executable
and set up a <a href="https://github.com/dlozeve/oeis_bot/blob/main/oeis-bot.timer">systemd timer</a> to call it every 4 hours. The token for
Mastodon is given as an environment variable in the <a href="https://github.com/dlozeve/oeis_bot/blob/main/oeis-bot.service">service file</a>.</p>
<h2 id="the-result">The result</h2>
<p>I created an account on <a href="https://mathstodon.xyz/">Mathstodon</a>, the instance I use for my <a href="https://mathstodon.xyz/@dlzv">personal account</a>. As this is my first bot, I wasn’t sure about the process, but
the administrators of the instance were happy to allow the bot as long
as it is clearly identified as such and does not spam too much. I also
made it clear that the bot is not operated officially by the
OEIS. Should they decide to join Mastodon, I would be happy to cede
them the account if they wish.</p>
<p>If you want to follow it, the account is <a href="https://mathstodon.xyz/@OEISbot">@OEISbot</a>!</p>
<blockquote lang="en" cite="https://mathstodon.xyz/@OEISbot/116069813960351345" data-source="fediverse">
  <p>OEIS sequence A177263<br />Triangle read by rows: T(n,k) is the number of permutations of {1,2,...,n} having k as the last entry in the first block (1&lt;=k&lt;=n).</p><p>1, 0, 2, 1, 1, 4, 4, 5, 5, 10, 18, 22, 23, 23, 34, 96, 114, 118, 119, 119, 154, 600, 696, 714, 718, 719, 719, 874, 4320, 4920, 5016, 5034, 5038, 5039, 5039, 5914, 35280, 39600, 40200, 40296, 40314, 40318, 40319, 40319, 46234, 322560, 357840, 362160, 362760, 362856, 362874, 362878, 362879, 362879, 409114</p><p><a href="https://oeis.org/A177263" target="_blank" rel="nofollow noopener" translate="no"><span class="invisible">https://</span><span class>oeis.org/A177263</span><span class="invisible"></span></a></p>
  <footer>
     — OEIS bot (@OEISbot) <a href="https://mathstodon.xyz/@OEISbot/116069813960351345"><time datetime="2026-02-14T16:00:07.743Z">2/14/2026, 5:00:07 PM</time></a>
  </footer>
</blockquote>

<blockquote lang="en" cite="https://mathstodon.xyz/@OEISbot/116067926081324392" data-source="fediverse">
  <p>OEIS sequence A071904<br />Odd composite numbers.</p><p>9, 15, 21, 25, 27, 33, 35, 39, 45, 49, 51, 55, 57, 63, 65, 69, 75, 77, 81, 85, 87, 91, 93, 95, 99, 105, 111, 115, 117, 119, 121, 123, 125, 129, 133, 135, 141, 143, 145, 147, 153, 155, 159, 161, 165, 169, 171, 175, 177, 183, 185, 187, 189, 195, 201, 203, 205</p><p><a href="https://oeis.org/A71904" target="_blank" rel="nofollow noopener" translate="no"><span class="invisible">https://</span><span class>oeis.org/A71904</span><span class="invisible"></span></a></p>
  <footer>
     — OEIS bot (@OEISbot) <a href="https://mathstodon.xyz/@OEISbot/116067926081324392"><time datetime="2026-02-14T08:00:00.996Z">2/14/2026, 9:00:00 AM</time></a>
  </footer>
</blockquote>

<p>For now only a few sequences have been posted, but I’m excited to see
what interesting sequences come up!</p>
</section>
]]></description>
    <pubDate>Sat, 14 Feb 2026 00:00:00 UT</pubDate>
    <guid>https://www.lozeve.com/posts/oeis-bot.html</guid>
    <dc:creator>Dimitri Lozeve</dc:creator>
</item>
<item>
    <title>The Bulgarian Solitaire</title>
    <link>https://www.lozeve.com/posts/bulgarian-solitaire.html</link>
    <description><![CDATA[<section>
  <p>Imagine you are the director of a unit in a large company. You have 5
teams, with various numbers of people in each team. An important new
project has come up, and you would like to set up a new team to tackle
it. The problem is: you don’t have any budget to hire new people!</p>
<p>What can you do? Well, you could take one person from each existing
team, and build a new team with these 5 people.</p>
<p>It works well for a while, but of course soon enough, a new project
comes up, and you still don’t have any additional budget! It worked
the first time, so you reshuffle the teams once more in the same
way. One of the existing team had only one person left, so they go to
the newly created team and their existing team just disappears.</p>
<p>Now the CEO is getting concerned with your unconventional management
practices. She summons you to her office and asks you: What would
happen if you do this repeatedly for every new project? How many teams
will you end up with, with how many people in each one of them?</p>
<h2 id="the-bulgarian-solitaire">The Bulgarian Solitaire</h2>
<p>Instead of this (admittedly far-fetched) management scenario, you can
do the same process with playing cards. You have <span class="math inline">\(n\)</span> cards divided
arbitrarily into several piles. At each step, you remove one card from
each pile and collect them in a new pile next to the existing ones. If
you repeat this operation, how many piles of cards will you get? With
how many cards in each of them? This problem is called the <em>Bulgarian
Solitaire</em>. It is described in <a href="https://arxiv.org/abs/1503.00885">this paper</a><span class="sidenote-wrapper"><label for="sn-0" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sn-0" class="margin-toggle" /><span class="sidenote">Which also describes the
history of the problem, including why it is called “Bulgarian”.<br />
<br />
</span></span>.</p>
<p>With <span class="math inline">\(n\)</span> cards, the Bulgarian Solitaire is an example of a discrete
dynamical system on partitions of <span class="math inline">\(n\)</span>. A <em>partition</em> is an unordered
sequence of numbers that sum to <span class="math inline">\(n\)</span>. For instance, <span class="math inline">\((4,3,2,1)\)</span> and
<span class="math inline">\((8,3,2,2)\)</span> are partitions of 15. At each step, the Bulgarian
Solitaire will transform a partition like <span class="math inline">\((8,3,2,2)\)</span> into a new one,
in this case <span class="math inline">\((7,2,1,1,4)\)</span> (the new pile is at the end, but we don’t
care about the order so we can also write it as <span class="math inline">\((7,4,2,1,1)\)</span>).</p>
<p>What we are trying to understand is in fact a standard question in
dynamical systems: where is the system going? Are there stable states,
cycles? What are they?</p>
<h2 id="visualizing-the-bulgarian-solitaire-process">Visualizing the Bulgarian Solitaire process</h2>
<p>As a first example, let’s take <span class="math inline">\(n=10\)</span> cards, in piles of 4, 3, and 3
cards. We take one card from the top of each pile (in orange), and we
make a new pile of size 3:</p>
<p><img src="../images/bulgarian-solitaire/cards.svg" width="300" /></p>
<p>Visualizing cards like this is not very practical, but as it turns
out, mathematicians have already devised a nice way to visualize
partitions: <a href="https://en.wikipedia.org/wiki/Young_tableau#Diagrams">Young diagrams</a>. They represent a partition as rows of
boxes. For example, the partition <span class="math inline">\((6,5,3,1)\)</span> of the number 15 can be
visualized as:</p>
<p><picture><source srcset="../images/bulgarian-solitaire/young-dark.png" media="(prefers-color-scheme: dark)"><img src="../images/bulgarian-solitaire/young.png" alt></picture></p>
<p>In our cases, we can rotate the Young diagram by 90° so that it looks
more like piles of cards:</p>
<p><picture><source srcset="../images/bulgarian-solitaire/young_rotated-dark.png" media="(prefers-color-scheme: dark)"><img src="../images/bulgarian-solitaire/young_rotated.png" alt></picture></p>
<p>A step in the Bulgarian Solitaire will look like this:</p>
<p><picture><source srcset="../images/bulgarian-solitaire/step-dark.png" media="(prefers-color-scheme: dark)"><img src="../images/bulgarian-solitaire/step.png" alt></picture></p>
<p>We take one card from each pile (highlighted in orange in the diagram
above), and create a new pile from them. Then we reorder the piles so
that they are sorted by decreasing size.</p>
<h2 id="convergence-of-the-bulgarian-solitaire">Convergence of the Bulgarian Solitaire</h2>
<p>Now, if you try the Bulgarian Solitaire with 15 cards, for every
possible starting position, you will end up with 4 piles of size 1, 2,
3, and 4. It turns out that this pattern can be observed consistently:</p>
<div class="theorem">
<p>When the number of cards <span class="math inline">\(n = k(k+1)/2\)</span> is triangular<span class="sidenote-wrapper"><label for="sn-1" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sn-1" class="margin-toggle" /><span class="sidenote"><a href="https://en.wikipedia.org/wiki/Triangular_number">Triangular numbers</a> are the sums of consecutive
integers: 1, 1+2, 1+2+3, etc. They get their names from the fact that
they can be arranged in a triangle. (OEIS: <a href="https://oeis.org/A000217">A000217</a>) (Image from
<a href="https://commons.wikimedia.org/wiki/File:First_six_triangular_numbers.svg">Wikimedia</a>, CC-BY-SA.)
<picture><source srcset="../images/bulgarian-solitaire/First_six_triangular_numbers-dark.svg" media="(prefers-color-scheme: dark)"><img src="../images/bulgarian-solitaire/First_six_triangular_numbers.svg" alt></picture><br />
<br />
</span></span>,
the Bulgarian Solitaire will converge to piles of size 1, 2, …, <span class="math inline">\(k\)</span>.</p>
</div>
<p>How can we prove this?</p>
<p>There is a nice trick: we rotate (again!) our Young diagram by 45°:</p>
<p><picture><source srcset="../images/bulgarian-solitaire/young_rotated_45-dark.png" media="(prefers-color-scheme: dark)"><img src="../images/bulgarian-solitaire/young_rotated_45.png" alt></picture></p>
<p>When we do a step in the Bulgarian Solitaire, we move around the boxes:</p>
<p><picture><source srcset="../images/bulgarian-solitaire/young_fall-dark.png" media="(prefers-color-scheme: dark)"><img src="../images/bulgarian-solitaire/young_fall.png" alt></picture></p>
<p>Sometimes we can see that a box (in grey here) is above an empty
space, and therefore it falls down. This corresponds to cases when
there is a need to reorder the piles so that they remain in order of
decreasing size.</p>
<p>We can see that at each step, the total potential energy of the boxes
either stays the same (when all boxes are still resting on other boxes
or on the edges) or decreases (when a box is above an empty space and
falls down). And the lowest possible potential energy is reached when
the configuration is fully triangular, when boxes are spread evenly
across all piles!</p>
<p>So now we just need to prove that all configurations will end up at
this minimum of potential energy, i.e. that there is no local minimum
that prevents the system from reaching this global minimum.</p>
<p>Imagine we are <em>not</em> at this triangular position. Since the total
number of boxes is triangular, there is necessarily one layer with an
empty space for a box, and one layer above with an additional box. At
each step, the boxes will “rotate”, and the empty space and the
additional box will move one space to the right. If the layer with the
empty space is of size <span class="math inline">\(k\)</span>, the layer above it will be of size <span class="math inline">\(k+1\)</span>,
and these numbers are relatively prime. After a certain number of
steps, the additional box will be directly above the empty space and
will fall down.</p>
<p>So no configuration that is not fully triangular is stable: after a
certain number of steps, the potential energy will decrease. The
detailed proof can be found in section 2 of <a href="https://arxiv.org/abs/1503.00885">the paper</a>.</p>
<h2 id="simulating-the-bulgarian-solitaire">Simulating the Bulgarian Solitaire</h2>
<p>Let’s simulate the process (in Python) to visualize the
convergence. The script will generate all partitions of a given
integer <span class="math inline">\(n\)</span>, and generate the transition graph of the Bulgarian
Solitaire. If a step of the Bulgarian Solitaire transforms partition
<span class="math inline">\(p_1\)</span> into partition <span class="math inline">\(p_2\)</span>, we will draw an edge between <span class="math inline">\(p_1\)</span> and
<span class="math inline">\(p_2\)</span>.</p>
<p>We’ll start by defining helpful type aliases: a partition is a tuple
of ints, and the graph we will generate is a dictionary from
partitions to partitions.</p>
<div class="sourceCode" id="cb1"><pre class="sourceCode python"><code class="sourceCode python"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="bu">type</span> Partition <span class="op">=</span> <span class="bu">tuple</span>[<span class="bu">int</span>, ...]</span>
<span id="cb1-2"><a href="#cb1-2" aria-hidden="true" tabindex="-1"></a><span class="bu">type</span> Graph <span class="op">=</span> <span class="bu">dict</span>[Partition, Partition]</span></code></pre></div>
<p>Next, we need a function that enumerates all possible partitions of a
given integer <span class="math inline">\(n\)</span>. It is built recursively: for all possible prefixes
<span class="math inline">\(0 &lt; k \leq n\)</span>, the partitions we are looking for consist of <span class="math inline">\(k\)</span>
followed by all partitions of <span class="math inline">\(n-k\)</span>.</p>
<div class="sourceCode" id="cb2"><pre class="sourceCode python"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> collections.abc <span class="im">import</span> Generator</span>
<span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> integer_partition(n: <span class="bu">int</span>) <span class="op">-&gt;</span> Generator[Partition]:</span>
<span id="cb2-4"><a href="#cb2-4" aria-hidden="true" tabindex="-1"></a>    <span class="cf">if</span> n <span class="op">&lt;=</span> <span class="dv">0</span>:</span>
<span id="cb2-5"><a href="#cb2-5" aria-hidden="true" tabindex="-1"></a>        <span class="cf">return</span></span>
<span id="cb2-6"><a href="#cb2-6" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb2-7"><a href="#cb2-7" aria-hidden="true" tabindex="-1"></a>    <span class="kw">def</span> find_partitions_recursive(target: <span class="bu">int</span>, current_partition: <span class="bu">list</span>[<span class="bu">int</span>]):</span>
<span id="cb2-8"><a href="#cb2-8" aria-hidden="true" tabindex="-1"></a>        <span class="cf">if</span> target <span class="op">==</span> <span class="dv">0</span>:</span>
<span id="cb2-9"><a href="#cb2-9" aria-hidden="true" tabindex="-1"></a>            <span class="cf">yield</span> current_partition</span>
<span id="cb2-10"><a href="#cb2-10" aria-hidden="true" tabindex="-1"></a>            <span class="cf">return</span></span>
<span id="cb2-11"><a href="#cb2-11" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb2-12"><a href="#cb2-12" aria-hidden="true" tabindex="-1"></a>        max_val <span class="op">=</span> current_partition[<span class="op">-</span><span class="dv">1</span>] <span class="cf">if</span> current_partition <span class="cf">else</span> target</span>
<span id="cb2-13"><a href="#cb2-13" aria-hidden="true" tabindex="-1"></a>        <span class="cf">for</span> i <span class="kw">in</span> <span class="bu">range</span>(<span class="bu">min</span>(target, max_val), <span class="dv">0</span>, <span class="op">-</span><span class="dv">1</span>):</span>
<span id="cb2-14"><a href="#cb2-14" aria-hidden="true" tabindex="-1"></a>            current_partition.append(i)</span>
<span id="cb2-15"><a href="#cb2-15" aria-hidden="true" tabindex="-1"></a>            <span class="cf">yield</span> <span class="cf">from</span> find_partitions_recursive(target <span class="op">-</span> i, current_partition)</span>
<span id="cb2-16"><a href="#cb2-16" aria-hidden="true" tabindex="-1"></a>            <span class="co"># Backtrack</span></span>
<span id="cb2-17"><a href="#cb2-17" aria-hidden="true" tabindex="-1"></a>            current_partition.pop()</span>
<span id="cb2-18"><a href="#cb2-18" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb2-19"><a href="#cb2-19" aria-hidden="true" tabindex="-1"></a>    <span class="cf">yield</span> <span class="cf">from</span> <span class="bu">map</span>(<span class="bu">tuple</span>, find_partitions_recursive(n, []))</span></code></pre></div>
<p>The <code>step</code> function is the Bulgarian Solitaire itself:</p>
<div class="sourceCode" id="cb3"><pre class="sourceCode python"><code class="sourceCode python"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> step(partition: Partition) <span class="op">-&gt;</span> Partition:</span>
<span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a>    <span class="bu">next</span> <span class="op">=</span> [<span class="bu">len</span>(partition)] <span class="op">+</span> [k <span class="op">-</span> <span class="dv">1</span> <span class="cf">for</span> k <span class="kw">in</span> partition <span class="cf">if</span> k <span class="op">&gt;</span> <span class="dv">1</span>]</span>
<span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a>    <span class="cf">return</span> <span class="bu">tuple</span>(<span class="bu">sorted</span>(<span class="bu">next</span>, reverse<span class="op">=</span><span class="va">True</span>))</span></code></pre></div>
<p>We can now compute the transition graph:</p>
<div class="sourceCode" id="cb4"><pre class="sourceCode python"><code class="sourceCode python"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> compute_graph(n: <span class="bu">int</span>) <span class="op">-&gt;</span> Graph:</span>
<span id="cb4-2"><a href="#cb4-2" aria-hidden="true" tabindex="-1"></a>    <span class="cf">return</span> {partition: step(partition) <span class="cf">for</span> partition <span class="kw">in</span> integer_partition(n)}</span></code></pre></div>
<p>Now for display, we generate some <a href="https://graphviz.org/">Graphviz</a> code. The colors are
adapted for viewing directly in the terminal.</p>
<div class="sourceCode" id="cb5"><pre class="sourceCode python"><code class="sourceCode python"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> print_graphviz(graph: Graph) <span class="op">-&gt;</span> <span class="va">None</span>:</span>
<span id="cb5-2"><a href="#cb5-2" aria-hidden="true" tabindex="-1"></a>    <span class="bu">print</span>(<span class="st">&quot;&quot;&quot;digraph {</span></span>
<span id="cb5-3"><a href="#cb5-3" aria-hidden="true" tabindex="-1"></a><span class="st">    bgcolor=&quot;transparent&quot;;</span></span>
<span id="cb5-4"><a href="#cb5-4" aria-hidden="true" tabindex="-1"></a><span class="st">    node [shape=&quot;rect&quot;, fontcolor=&quot;white&quot;, color=&quot;white&quot;];</span></span>
<span id="cb5-5"><a href="#cb5-5" aria-hidden="true" tabindex="-1"></a><span class="st">    edge [fontcolor=&quot;white&quot;, color=&quot;white&quot;];</span></span>
<span id="cb5-6"><a href="#cb5-6" aria-hidden="true" tabindex="-1"></a><span class="st">    &quot;&quot;&quot;</span>)</span>
<span id="cb5-7"><a href="#cb5-7" aria-hidden="true" tabindex="-1"></a>    <span class="cf">for</span> k, v <span class="kw">in</span> graph.items():</span>
<span id="cb5-8"><a href="#cb5-8" aria-hidden="true" tabindex="-1"></a>        <span class="bu">print</span>(<span class="ss">f'    &quot;</span><span class="sc">{</span>k<span class="sc">}</span><span class="ss">&quot; -&gt; &quot;</span><span class="sc">{</span>v<span class="sc">}</span><span class="ss">&quot;;'</span>)</span>
<span id="cb5-9"><a href="#cb5-9" aria-hidden="true" tabindex="-1"></a>    <span class="bu">print</span>(<span class="st">&quot;}&quot;</span>)</span></code></pre></div>
<p>To finish, let’s add a command-line interface:</p>
<div class="sourceCode" id="cb6"><pre class="sourceCode python"><code class="sourceCode python"><span id="cb6-1"><a href="#cb6-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> argparse</span>
<span id="cb6-2"><a href="#cb6-2" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb6-3"><a href="#cb6-3" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> main():</span>
<span id="cb6-4"><a href="#cb6-4" aria-hidden="true" tabindex="-1"></a>    parser <span class="op">=</span> argparse.ArgumentParser()</span>
<span id="cb6-5"><a href="#cb6-5" aria-hidden="true" tabindex="-1"></a>    parser.add_argument(<span class="st">&quot;n&quot;</span>, <span class="bu">type</span><span class="op">=</span><span class="bu">int</span>)</span>
<span id="cb6-6"><a href="#cb6-6" aria-hidden="true" tabindex="-1"></a>    args <span class="op">=</span> parser.parse_args()</span>
<span id="cb6-7"><a href="#cb6-7" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb6-8"><a href="#cb6-8" aria-hidden="true" tabindex="-1"></a>    graph <span class="op">=</span> compute_graph(args.n)</span>
<span id="cb6-9"><a href="#cb6-9" aria-hidden="true" tabindex="-1"></a>    print_graphviz(graph)</span>
<span id="cb6-10"><a href="#cb6-10" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb6-11"><a href="#cb6-11" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb6-12"><a href="#cb6-12" aria-hidden="true" tabindex="-1"></a><span class="cf">if</span> <span class="va">__name__</span> <span class="op">==</span> <span class="st">&quot;__main__&quot;</span>:</span>
<span id="cb6-13"><a href="#cb6-13" aria-hidden="true" tabindex="-1"></a>    main()</span></code></pre></div>
<p>The script can then be called directly:</p>
<div class="sourceCode" id="cb7" data-org-language="sh"><pre class="sourceCode bash"><code class="sourceCode bash"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Save the graphviz file</span></span>
<span id="cb7-2"><a href="#cb7-2" aria-hidden="true" tabindex="-1"></a><span class="ex">python</span> bulgarian_solitaire.py 10 <span class="op">&gt;</span> graph10.dot</span>
<span id="cb7-3"><a href="#cb7-3" aria-hidden="true" tabindex="-1"></a><span class="co"># Generate a PNG graph</span></span>
<span id="cb7-4"><a href="#cb7-4" aria-hidden="true" tabindex="-1"></a><span class="ex">python</span> bulgarian_solitaire.py 10 <span class="kw">|</span> <span class="ex">dot</span> <span class="at">-Tpng</span> <span class="op">&gt;</span> graph_10.png</span>
<span id="cb7-5"><a href="#cb7-5" aria-hidden="true" tabindex="-1"></a><span class="co"># Display the image directly in the terminal (if it supports image display)</span></span>
<span id="cb7-6"><a href="#cb7-6" aria-hidden="true" tabindex="-1"></a><span class="ex">python</span> bulgarian_solitaire.py 12 <span class="kw">|</span> <span class="ex">dot</span> <span class="at">-Tpng</span> <span class="kw">|</span> <span class="ex">viu</span> <span class="at">-</span></span></code></pre></div>
<p>Here is the output for <span class="math inline">\(n=6\)</span>:</p>
<p><img src="../images/bulgarian-solitaire/graph_6.svg" style="background-color: #151515;" /></p>
<p>And for <span class="math inline">\(n=10\)</span>:</p>
<p><img src="../images/bulgarian-solitaire/graph_10.svg" style="background-color: #151515;" /></p>
<p>As we can see, all partitions indeed converge to the triangular
configurations <span class="math inline">\((3,2,1)\)</span> and <span class="math inline">\((4,3,2,1)\)</span>.</p>
<p>Now what happens when the number <span class="math inline">\(n\)</span> is not triangular?</p>
<p>For <span class="math inline">\(n=7\)</span>:</p>
<p><img src="../images/bulgarian-solitaire/graph_7.svg" style="background-color: #151515;" /></p>
<p>The dynamical system converges to a cycle between <span class="math inline">\((4,2,1)\)</span>,
<span class="math inline">\((3,3,1)\)</span>, <span class="math inline">\((3,2,2)\)</span>, and <span class="math inline">\((3,2,1,1)\)</span>. When we look at this cycle, we
realize that it is the triangular configuration <span class="math inline">\((3,2,1)\)</span>, plus an
additional box cycling above.</p>
<p>For <span class="math inline">\(n=8\)</span>:</p>
<p><img src="../images/bulgarian-solitaire/graph_8.svg" style="background-color: #151515;" /></p>
<p>Here we get two connected components in the graph. And here again, we
get the triangular partition but with two additional boxes cycling
above.</p>
<p>This property can be proved (Theorem 2 in the paper), and the number
of connected components can be computed exactly for all <span class="math inline">\(n\)</span>.</p>
<h2 id="conclusion">Conclusion</h2>
<p>The Bulgarian Solitaire is a nice example of a discrete dynamical
system. The proof is particularly interesting because it relies on
visualization and physical intuition about the “potential energy” of
the system, which translates directly into the formal proof.</p>
<p><a href="https://arxiv.org/abs/1503.00885">The paper</a> also mentions generalizations. I am interested in particular
in the <em>stochastic</em> Bulgarian Solitaire. Instead of taking one card
from each pile at each step, we take one card from each pile
independently with probability <span class="math inline">\(0&lt;p&lt;1\)</span>. The dynamical system becomes a
Markov chain, and it would be interesting to study (even
experimentally) its convergence properties.</p>
<h2 id="bonus-bqn-implementation-of-the-simulator">Bonus: <a href="https://mlochbaum.github.io/BQN/">BQN</a> implementation of the simulator</h2>
<p>First, the <code>Step</code> function computes one step of the Bulgarian
Solitaire. It removes 1 from each pile (<code>¯1⊸+</code>), removes the piles
that became empty (<code>0⊸≠</code>), adds a pile of the original size (<code>≠∾</code>), and
sorts the piles for consistency (<code>∨</code>).</p>
<div class="sourceCode" id="cb8"><pre class="sourceCode txt"><code class="sourceCode default"><span id="cb8-1"><a href="#cb8-1" aria-hidden="true" tabindex="-1"></a>Step←∨≠∾(0⊸≠⊸/¯1⊸+)</span></code></pre></div>
<p>The <code>Partitions</code> function generates all partitions of a given
integer. Conveniently, this was already available in <a href="https://mlochbaum.github.io/bqncrate/?q=integer%20partitions">BQNcrate</a>!</p>
<div class="sourceCode" id="cb9"><pre class="sourceCode txt"><code class="sourceCode default"><span id="cb9-1"><a href="#cb9-1" aria-hidden="true" tabindex="-1"></a>Partitions←{∾⊢´∾⟜(&lt;-⟜↕∘≠{𝕨∾¨∾(-𝕨⌊≠𝕩)↑𝕩}¨⊢)⍟𝕩⋈⋈⋈↕0}</span></code></pre></div>
<p>We can now compute the graph, by first computing all partitions of the
input, then building the edges by calling <code>Step</code> on each partition.</p>
<div class="sourceCode" id="cb10"><pre class="sourceCode txt"><code class="sourceCode default"><span id="cb10-1"><a href="#cb10-1" aria-hidden="true" tabindex="-1"></a>ComputeGraph←{&gt;(⊢⋈Step)¨Partitions𝕩}</span></code></pre></div>
<p>We now generate and print the graphviz code:</p>
<div class="sourceCode" id="cb11"><pre class="sourceCode txt"><code class="sourceCode default"><span id="cb11-1"><a href="#cb11-1" aria-hidden="true" tabindex="-1"></a>EdgeToGraphviz←{∾⟜(&quot;&quot;&quot;;&quot;∾@+10)5↓∾&quot;&quot;&quot; -&gt; &quot;&quot;&quot;⊸∾¨•Fmt¨𝕩}</span>
<span id="cb11-2"><a href="#cb11-2" aria-hidden="true" tabindex="-1"></a>graphvizheader←&quot;bgcolor=&quot;&quot;transparent&quot;&quot;;</span>
<span id="cb11-3"><a href="#cb11-3" aria-hidden="true" tabindex="-1"></a>node [shape=&quot;&quot;rect&quot;&quot;,fontcolor=&quot;&quot;white&quot;&quot;,color=&quot;&quot;white&quot;&quot;];</span>
<span id="cb11-4"><a href="#cb11-4" aria-hidden="true" tabindex="-1"></a>edge [fontcolor=&quot;&quot;white&quot;&quot;,color=&quot;&quot;white&quot;&quot;];&quot;</span>
<span id="cb11-5"><a href="#cb11-5" aria-hidden="true" tabindex="-1"></a>ToGraphviz←{&quot;digraph {&quot;∾graphvizheader∾(∾EdgeToGraphviz¨&lt;˘𝕩)∾&quot;}&quot;}</span>
<span id="cb11-6"><a href="#cb11-6" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb11-7"><a href="#cb11-7" aria-hidden="true" tabindex="-1"></a>n←•ParseFloat⊑•args</span>
<span id="cb11-8"><a href="#cb11-8" aria-hidden="true" tabindex="-1"></a>•Out ToGraphviz ComputeGraph n</span></code></pre></div>
<p>Similarly to the Python version, we can run the script with:</p>
<div class="sourceCode" id="cb12" data-org-language="sh"><pre class="sourceCode bash"><code class="sourceCode bash"><span id="cb12-1"><a href="#cb12-1" aria-hidden="true" tabindex="-1"></a><span class="ex">bqn</span> bulgarian_solitaire.bqn 6 <span class="kw">|</span> <span class="ex">dot</span> <span class="at">-Tpng</span> <span class="op">&gt;</span> graph_6.png</span></code></pre></div>
</section>
]]></description>
    <pubDate>Sun, 08 Feb 2026 00:00:00 UT</pubDate>
    <guid>https://www.lozeve.com/posts/bulgarian-solitaire.html</guid>
    <dc:creator>Dimitri Lozeve</dc:creator>
</item>
<item>
    <title>Language models are injective and hence invertible</title>
    <link>https://www.lozeve.com/posts/language-models-are-injective.html</link>
    <description><![CDATA[<section>
  <p>This is the title of a recent preprint
<span class="citation" data-cites="nikolaou2025_languag_model">(Nikolaou et al. 2025)</span><span class="sidenote-wrapper"><label for="sn-0" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sn-0" class="margin-toggle" /><span class="sidenote">Quick link: <a href="https://arxiv.org/abs/2510.15511">ArXiv:2510.15511</a>.<br />
<br />
</span></span>, that has some
interesting insights on the mathematical properties of Transformers
when viewed as functions.</p>
<p>The immediate reaction is “this is nonsense”, and the title may be
viewed as either misleading or deliberately provocative in order to
gather interest. Indeed, most functions inside neural
networks<span class="sidenote-wrapper"><label for="sn-1" class="margin-toggle">⊕</label><input type="checkbox" id="sn-1" class="margin-toggle" /><span class="marginnote"><img src="../images/language-models-are-injective/swiglu.png" /> The
SwiGLU activation function. It is clearly not injective.<br />
<br />
</span></span> and particularly Transformers are not
injective. One thinks immediately of activation functions,
normalization layers, and so on, counter-examples abound. So how can
an entire language model be injective?</p>
<p>At a higher level, injectivity seems like an even taller order. From
the user’s point of view, plenty of input prompts will lead to the
same answers (one can think of all prompts ending with “answer with
yes or no”)<span class="sidenote-wrapper"><label for="sn-2" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sn-2" class="margin-toggle" /><span class="sidenote">Indeed, the public discourse around this paper was
dismal, with thousands of users on X/Twitter posting ChatGPT
screenshots “disproving” injectivity…<br />
<br />
</span></span>.</p>
<p>So, let’s actually go through the paper to understand the claim more
precisely!</p>
<h2 id="language-models-as-functions">Language models as functions</h2>
<p>There are many different ways of viewing language models as
mathematical functions. In the first intuition above, we considered
models as functions from vectors to vectors, or from text to
text. This is what makes the result seem counter-intuitive at first.</p>
<p>I think one of the main contribution of the paper is to slightly
change the context: they see language models as functions from a
<em>prompt</em> (i.e. a finite sequence of discrete tokens) to <em>activations</em>
on the output layers (vectors).<span class="sidenote-wrapper"><label for="sn-3" class="margin-toggle">⊕</label><input type="checkbox" id="sn-3" class="margin-toggle" /><span class="marginnote"><img src="../images/language-models-are-injective/map.png" /> The
function we are studying: from prompt space (sequence of tokens) to
latent space (vector of activations).<br />
<br />
</span></span></p>
<p>Formally, we have a finite vocabulary <span class="math inline">\(\mathcal{V}\)</span>, and context
length <span class="math inline">\(K\)</span>. From a sequence of tokens <span class="math inline">\(s \in \mathcal{V}^{\leq K}\)</span> and
parameters <span class="math inline">\(\mathbf{\theta}\)</span>, the language model gives us <em>last-token
representations</em> <span class="math inline">\(\mathbf{r}(s, \mathbf{\theta}) \in \mathbb{R}^d\)</span>.</p>
<p>The central claim of the paper is that <em>this</em> function is injective.</p>
<h2 id="why-is-injectivity-important">Why is injectivity important?</h2>
<p>Since the various components of the language models are not themselves
injective, people tend to assume that there is some “forgetting” and
privacy is built into all language models. Since the input prompt is
transformed into an intermediate representation that goes through many
transformations, information is lost and the input prompt cannot be
reconstructed exactly. This intuition is wrong.</p>
<p>This has practical consequences for systems that store intermediate
activations (e.g. embeddings), in the assumption that this is somehow
more secure, or more respectful of their users privacy, than storing
the user inputs directly. In practice, all the information is in the
activations, and the paper provides both a formal proof and a
practical reconstruction method.</p>
<h2 id="structure-of-the-paper">Structure of the paper</h2>
<p>The paper has two main parts.</p>
<ul>
<li>The <em>theoretical</em> part defines a Transformer-based language model
mathematically as a function from sequences of tokens to real-valued
vectors. In this context, they prove that language models based on
common blocks (feed-forward layers, embedding layers, self-attention
blocks, common activations, etc) are almost surely injective after
random initialization of the parameters, and that this property is
preserved during training.</li>
<li>The <em>practical</em> part builds on these insights and builds a fairly
straightforward algorithm to reconstruct user inputs from the
activations of any intermediate layer, including the output
layer. They show 100% reconstruction accuracy on widely-used
language models.</li>
</ul>
<h2 id="the-theoretical-argument">The theoretical argument</h2>
<h3 id="a-primer-on-analytic-functions">A primer on analytic functions</h3>
<p>To prove the main results, the paper uses the notion of
<em>analyticity</em><span class="sidenote-wrapper"><label for="sn-4" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sn-4" class="margin-toggle" /><span class="sidenote">See Appendix A of the paper for the full
background on analytic functions.<br />
<br />
</span></span>. Analytic functions are highly
regular, actually even “smoother” than infinitely differentiable
functions. It is this “smoothness” that will have important
consequences for proving injectivity.</p>
<p>Intuitively, analytic functions are “locally polynomial”:</p>
<div class="definition">
<p>Let <span class="math inline">\(\mathcal{U} \subseteq \mathbb{R}^m\)</span> be open.
A function <span class="math inline">\(f : \mathcal{U} \to \mathbb{R}\)</span> is <em>real-analytic</em> on <span class="math inline">\(\mathcal{U}\)</span> if,
for every <span class="math inline">\(\mathbf{y} \in \mathcal{U}\)</span>, there exists coefficients
<span class="math inline">\(\{c_\mathbf{\alpha} \in \mathbb{R}\}_{\alpha \in \mathbb{N}^m}\)</span>
and <span class="math inline">\(r&gt;0\)</span> such that</p>
<p><span class="math display">\[
f(\mathbf{x}) = \sum_{\mathbf{\alpha} \in \mathbb{N}^m} c_\mathbf{\alpha} (\mathbf{x} - \mathbf{y})^\mathbf{\alpha}
\]</span></p>
<p>for all <span class="math inline">\(\mathbf{x} \in \mathcal{U}\)</span> with <span class="math inline">\(\lVert \mathbf{x} - \mathbf{y} \rVert_2 &lt; r\)</span>.</p>
<p>The set of analytic functions on <span class="math inline">\(\mathcal{U}\)</span> is denoted by <span class="math inline">\(C^\omega(\mathcal{U})\)</span>.</p>
</div>
<p>Intuitively, not only the function is infinitely differentiable, and
therefore can be locally approximated by its Taylor series, but
actually the coefficients are zero after some point, so the function
is locally a polynomial. This gives enormous regularity guarantees,
and analytic functions are among the most well-behaved.</p>
<p>A vector-valued function is analytic if all its components are
analytic, and we can similarly extend the definition to matrix-valued
functions. This is very important for our discussions since the
component blocks of Transformers are often matrix-valued,
e.g. attention layers.</p>
<p>Importantly, the set of analytic functions is closed under common
operations: addition, product, quotient, and composition. This means
that we only have to prove the analyticity of common building blocks
of Transformers, and through addition and composition the analyticity
of the overall function will be immediately proven.</p>
<p>Moreover, analytic functions have a key property relating to their
zero sets:</p>
<div class="proposition">
<p>Let <span class="math inline">\(\mathcal{U} \subseteq \mathbb{R}^m\)</span> be connected and open,
and let <span class="math inline">\(f \in C^\omega(\mathcal{U}; \mathbb{R}^n)\)</span>.
If <span class="math inline">\(f \not\equiv \mathbf{0}_n\)</span>, then its zero set
<span class="math display">\[
Z(f) := f^{-1}(\{\mathbf{0}_n\}) = \{\mathbf{x} \in \mathcal{U} : f(\mathbf{x}) = \mathbf{0}_n\}
\]</span>
has Lebesgue measure zero in <span class="math inline">\(\mathbb{R}^m\)</span>.</p>
</div>
<div class="proof">
<p>The proof can be found in <span class="citation" data-cites="mityagin2020_zero_set">Mityagin (2020)</span><span class="sidenote-wrapper"><label for="sn-5" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sn-5" class="margin-toggle" /><span class="sidenote">The paper actually cites the (identical) <a href="https://arxiv.org/abs/1512.07276">preprint version</a>.<br />
<br />
</span></span>.</p>
</div>
<p>The gist is simply this: if <span class="math inline">\(f\)</span> is not zero everywhere, then it is
zero almost nowhere. If there is an entire continuous region where <span class="math inline">\(f\)</span>
is zero, it means that <span class="math inline">\(f\)</span> is zero everywhere! This is a very strong
property of analytic functions: fundamentally, they are so regular
that if they are zero on any significant portion of their domain,
their smoothness imposes that they become zero everywhere.</p>
<p>This proposition is key to the paper as it will be used to prove
injectivity. We can now go on to the actual results of the paper.</p>
<h3 id="language-models-are-analytic">Language models are analytic</h3>
<p>A language model is a composition of many building blocks. Transformer
blocks are themselves composed of self-attention layers and MLPs,
which are themselves compositions of stacking operations,
normalizations, softmax, exponentials, and polynomials. By going
bottom-up, we can start from basic blocks (polynomial functions, the
exponential function), show that they are analytic, and since analytic
functions are closed under composition, show that the entire language
model is analytic.</p>
<p>The proofs that these building blocks are analytics are detailed in
Appendix A of the paper. The full language model is defined
mathematically as a composition of these building blocks in
Appendix B, with Proposition B.3 putting everything together and
proving that Transformers are analytic.</p>
<h3 id="injectivity-at-initialization">Injectivity at initialization</h3>
<p>Let us draw a set of parameters <span class="math inline">\(\mathbf{\theta}\)</span> at random, from a
distribution with a density. For any two prompts <span class="math inline">\(s, s' \in
\mathcal{V}^{\leq K}\)</span>, the probability that the output activations
<span class="math inline">\(\mathbf{r}(s, \mathbf{\theta})\)</span> and <span class="math inline">\(\mathbf{r}(s', \mathbf{\theta})\)</span>
are equal is zero.</p>
<p>This is because the function <span class="math inline">\(\mathbf{\theta} \mapsto \mathbf{r}(s,
\mathbf{\theta}) - \mathbf{r}(s', \mathbf{\theta})\)</span> is a sum of
analytic functions, and is therefore analytic. By the key proposition
above, its zero set has measure zero. So by drawing random parameters
<span class="math inline">\(\mathbf{\theta}\)</span>, you have almost surely <span class="math inline">\(\mathbf{r}(s,
\mathbf{\theta}) \neq \mathbf{r}(s', \mathbf{\theta})\)</span>.</p>
<p>We have just proven that if <span class="math inline">\(s \neq s'\)</span>, then, with probability 1 over
<span class="math inline">\(\mathbf{\theta}\)</span>, we have that <span class="math inline">\(\mathbf{r}(s, \mathbf{\theta}) \neq
\mathbf{r}(s', \mathbf{\theta})\)</span> i.e., <span class="math inline">\(\mathbf{r}\)</span> is injective!</p>
<h3 id="injectivity-is-preserved-during-training">Injectivity is preserved during training</h3>
<p>However, we only draw <span class="math inline">\(\mathbf{\theta}\)</span> at random during
initialization. What if during training, as <span class="math inline">\(\mathbf{\theta}\)</span> changes,
the language model loses its injectivity?</p>
<p>It turns out that gradient descent itself is also analytic (and so are
its variants, stochastic gradient descent, minibatch GD, etc). It is
therefore once again very smooth, and can only stretch and bend the
parameter space, but not collapse regions of positive volume into
single points. So as the parameters change during training, the points
that are apart stay apart and do not collapse onto each
other<span class="sidenote-wrapper"><label for="sn-6" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sn-6" class="margin-toggle" /><span class="sidenote">The argument is made formally in the paper using
topological properties of the Jacobian of the gradient descent
function. I won’t go into the details here, see Section 2 of the paper
for an overview, and Appendix C for the full proof.<br />
<br />
</span></span>.</p>
<p>The injectivity property established at initialization is therefore
preserved at each step of the training. This concludes the main result
of the paper: a fully-trained language model based on Transformers is
injective. Two different input prompts will always lead to two
different output activations in the final layer (and in any
intermediate layer for that matter).</p>
<h3 id="remarks">Remarks</h3>
<p>The paper bases all the proofs on a definition of a “standard”
language model based on Transformers. However, there are architecture
choices that can break the fundamental hypotheses. In particular, some
activation functions like ReLU are not analytic (indeed, it is not
even differentiable), and things like quantization also break the
analytic property of the entire model. In these cases, the model will
not be injective.</p>
<p>However, all modern language model architectures use smooth
activations<span class="sidenote-wrapper"><label for="sn-7" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sn-7" class="margin-toggle" /><span class="sidenote">See <a href="https://jcarlosroldan.com/post/348/">this great article</a> on SwiGLU and family for an
overview of activation functions used in LLMs.<br />
<br />
</span></span>, and quantization is not the norm
everywhere. As we will see in the next section, the theoretical result
hold very well in practice for LLMs that are widely used today.</p>
<h2 id="practical-application">Practical application</h2>
<p>Proving a theorem might not be enough, as there are plenty of things
that are ignored in the mathematical formulation (e.g. floating-point
precision). Moreover, real-world LLMs are very complex systems with
thousands of building blocks. For instance, activation functions are
part of a very active research domain and change frequently, so there
is no guarantee that all activation functions found in the wild follow
the requirements of the theorem.</p>
<p>The first practical test is to check for collisions: two input prompts
that lead to the same activation vector. If we can exhibit two
sequences of tokens like this, we will have broken the injectivity.</p>
<p>For large vocabularies, the input space is too big to be checked
exhaustively, so they can’t check everything. Instead they start with
a dataset of real-world prompts and find the ones that are the closest
in the output. They append all the possible sequences of tokens to
these prompts. Therefore the test is “exhaustive” starting from
“prefix” prompts that were already very close.</p>
<p>What they find is that all input pairs are very far from a collision
threshold of <span class="math inline">\(10^{-6}\)</span> (probably chosen to account for floating-point
error). As this check is not truly exhaustive and the thresholds have
been fixed somewhat arbitrarily, it’s not very significant, but it
acts as a nice “sanity check” for the theory.</p>
<p><img src="../images/language-models-are-injective/collisions.png" />
<span class="sidenote-wrapper"><label for="sn-8" class="margin-toggle">⊕</label><input type="checkbox" id="sn-8" class="margin-toggle" /><span class="marginnote">Distance between final layer and intermediate layer outputs
for the “exhaustive” check on input prompts. All are significantly
above the collision threshold as defined in the paper.<br />
<br />
</span></span></p>
<p>The second practical experiments is to exploit the injectivity result
to build an algorithm to invert a language model. Starting from the
last-layer activations, they build a simple gradient-guided search
that reconstruct the input sequence of tokens. The results are
impressive: they get 100% token-level accuracy on all prompts, and in
a remarkably quick process. The following results are for the GPT-2
Small model:</p>
<table>
<thead>
<tr>
<th>Method</th>
<th>Mean Time (s)</th>
<th>Accuracy</th>
</tr>
</thead>
<tbody>
<tr>
<td>HardPrompts</td>
<td>6132.59 ± 104.61</td>
<td>0.00</td>
</tr>
<tr>
<td>BruteForce</td>
<td>3889.61 ± 691.17</td>
<td><strong>1.00</strong></td>
</tr>
<tr>
<td>SipIt</td>
<td><strong>28.01 ± 35.87</strong></td>
<td><strong>1.00</strong></td>
</tr>
</tbody>
</table>
<h2 id="conclusion-rebuilding-our-intuition">Conclusion: rebuilding our intuition</h2>
<h3 id="expressivity-and-regularization">Expressivity and regularization</h3>
<p>Transformers (and neural networks in general) are very useful because
they generalize very well. The class of learned functions needs to be
very large to be as expressive as possible, but still constrained to
avoid overfitting. The regularization built into our language models
can be understood as properties of the functions they
represent. Analytic functions are a class of very smooth, very regular
functions, that have a lot of nice property, so it is no accident that
they are very good targets of a machine learning algorithm.</p>
<h3 id="curse-of-dimensionality">Curse of dimensionality</h3>
<p>The main result of the paper does not seem so unintuitive if its
hypotheses are laid out clearly. What we have is a function with
inputs on a <em>discrete space</em> (finite sequences on a finite
vocabulary), and outputs in <em>continuous, high-dimension</em> vector
space. Since the output space is so much larger, it would actually be
very surprising that a non-pathological function makes several inputs
“collapse” into a single high-dimension vector!</p>
<p>Moreover, there is an instance of the <em>curse of dimensionality</em>: if
you draw two vectors with random directions in a high-dimension space,
they will very probably be orthogonal
<span class="citation" data-cites="vershynin2018_high_dimen_probab">(Vershynin 2018, sec. 3.3.3)</span>. High dimensions
tend to “push everything to the edges” and so collisions become
increasingly unlikely. So another way to see this result is that it is
unintuitive only because probability on high-dimension vector spaces
is in itself quite unintuitive!</p>
<h2 class="unnumbered" id="references">References</h2>
<div id="refs" class="references csl-bib-body hanging-indent" data-entry-spacing="0" role="list">
<div id="ref-mityagin2020_zero_set" class="csl-entry" role="listitem">
Mityagin, Boris S. 2020. <span>“The Zero Set of a Real Analytic Function.”</span> <em>Mathematical Notes</em> 107 (3–4): 529–30. <a href="https://doi.org/10.1134/s0001434620030189">https://doi.org/10.1134/s0001434620030189</a>.
</div>
<div id="ref-nikolaou2025_languag_model" class="csl-entry" role="listitem">
Nikolaou, Giorgos, Tommaso Mencattini, Donato Crisostomi, Andrea Santilli, Yannis Panagakis, and Emanuele Rodolà. 2025. <span>“Language Models Are Injective and Hence Invertible.”</span> <a href="https://arxiv.org/abs/2510.15511">https://arxiv.org/abs/2510.15511</a>.
</div>
<div id="ref-vershynin2018_high_dimen_probab" class="csl-entry" role="listitem">
Vershynin, Roman. 2018. <em>High-Dimensional Probability: An Introduction with Applications in Data Science</em>. Cambridge Series in Statistical and Probabilistic Mathematics 47. <span>Cambridge</span>: <span>Cambridge University Press</span>. <a href="https://www.math.uci.edu/~rvershyn/papers/HDP-book/HDP-book.html">https://www.math.uci.edu/~rvershyn/papers/HDP-book/HDP-book.html</a>.
</div>
</div>
</section>
]]></description>
    <pubDate>Tue, 25 Nov 2025 00:00:00 UT</pubDate>
    <guid>https://www.lozeve.com/posts/language-models-are-injective.html</guid>
    <dc:creator>Dimitri Lozeve</dc:creator>
</item>
<item>
    <title>Reading notes: Kolmogorov-Arnold Networks</title>
    <link>https://www.lozeve.com/posts/kolmogorov-arnold-networks.html</link>
    <description><![CDATA[<section>
  <p>This paper <span class="citation" data-cites="liu2024_kan">(Liu et al. 2024)</span> proposes an alternative to multi-layer
perceptrons (MLPs) in machine learning.</p>
<p>The basic idea is that MLPs have parameters on the nodes of the
computation graph (the weights and biases on each cell), and that KANs
have the parameters on the edges. Each edge has a learnable activation
function parameterized as a spline.</p>
<p>The network is learned at two levels, which allows for “adjusting
locally”:</p>
<ul>
<li>the overall shape of the computation graph and its connexions
(external degrees of freedom, to learn the compositional structure),</li>
<li>the parameters of each activation function (internal degrees of
freedom).</li>
</ul>
<p>It is based on the <a href="https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Arnold_representation_theorem">Kolmogorov-Arnold representation theorem</a>, which
says that any continuous multivariate function can be represented as a
sum of continuous univariate functions. We recover the distinction
between the compositional structure of the sum and the structure of
each internal univariate function.</p>
<p>The theorem can be interpreted as two layers, and the paper then
generalizes it to multiple layer of arbitrary width. In the theorem,
the univariate functions are arbitrary and can be complex (even
fractal), so the hope is that allowing for arbitrary depth and width
will allow to only use splines. They derive an approximation theorem:
when replacing the arbitrary continuous functions of the
Kolmogorov-Arnold representation with splines, we can bound the error
independently of the dimension. (However there is a constant which
depends on the function and its representation, and therefore on the
dimension…) Theoretical scaling laws in the number of parameters are
much better than for MLPs, and moreover, experiments show that KANs
are much closer to their theoretical bounds than MLPs.</p>
<p>KANs have interesting properties:</p>
<ul>
<li>The splines are interpolated on grid points which can be iteratively
refined. The fact that there is a notion of “fine-grainedness” is
very interesting, it allows to add parameters without having to
retrain everything.</li>
<li>Larger is not always better: the quality of the reconstruction
depends on finding the optimal shape of the network, which should
match the structure of the function we want to approximate. Finding
this optimal shape is found via sparsification, pruning, and
regularization (non-trivial).</li>
<li>We can have a “human in the loop” during training, guiding pruning,
and “symbolifying” some activations (i.e. by recognizing that an
activation function is actually a cos function, replace it
directly). This symbolic discovery can be guided by a symbolic
system recognizing some functions. It’s therefore a mix of symbolic
regression and numerical regression.</li>
</ul>
<p>They test mostly with scientific applications in mind: reconstructing
equations from physics and pure maths. Conceptually, it has a lot of
overlap with Neural Differential Equations
<span class="citation" data-cites="chenNeuralOrdinaryDifferential2018 ruthotto2024_differ_equat">(Chen et al. 2018; Ruthotto 2024)</span>
and “scientific ML” in general.</p>
<p>There is an interesting discussion at the end about KANs as the model
of choice for the “language of science”. The idea is that LLMs are
important because they are useful for natural language, and KANs
could fill the same role for the language of functions. The
interpretability and adaptability (being able to be manipulated and
guided during training by a domain expert) is thus a core feature that
traditional deep learning models lack.</p>
<p>There are still challenges, mostly it’s unclear how it performs on
other types of data and other modalities, but it is very
encouraging. There is also a computational challenges, they are
obviously much slower to train, but there has been almost no
engineering work on them to optimize this, so it’s expected. The fact
that the operations are not easily batchable (compared to matrix
multiplication) is however worrying for scalability to large networks.</p>
<h2 class="unnumbered" id="references">References</h2>
<div id="refs" class="references csl-bib-body hanging-indent" data-entry-spacing="0" role="list">
<div id="ref-chenNeuralOrdinaryDifferential2018" class="csl-entry" role="listitem">
Chen, Ricky T. Q., Yulia Rubanova, Jesse Bettencourt, and David Duvenaud. 2018. <span>“Neural <span>Ordinary Differential Equations</span>,”</span> June. <a href="http://arxiv.org/abs/1806.07366">http://arxiv.org/abs/1806.07366</a>.
</div>
<div id="ref-liu2024_kan" class="csl-entry" role="listitem">
Liu, Ziming, Yixuan Wang, Sachin Vaidya, Fabian Ruehle, James Halverson, Marin Soljačić, Thomas Y. Hou, and Max Tegmark. 2024. <span>“<span>KAN</span>: <span>Kolmogorov</span>-<span>Arnold</span> <span>Networks</span>.”</span> arXiv. <a href="https://doi.org/10.48550/arXiv.2404.19756">https://doi.org/10.48550/arXiv.2404.19756</a>.
</div>
<div id="ref-ruthotto2024_differ_equat" class="csl-entry" role="listitem">
Ruthotto, Lars. 2024. <span>“Differential <span>Equations</span> for <span>Continuous</span>-<span>Time</span> <span>Deep</span> <span>Learning</span>.”</span> <em>Notices of the American Mathematical Society</em> 71 (05). <a href="https://doi.org/10.1090/noti2930">https://doi.org/10.1090/noti2930</a>.
</div>
</div>
</section>
]]></description>
    <pubDate>Sat, 08 Jun 2024 00:00:00 UT</pubDate>
    <guid>https://www.lozeve.com/posts/kolmogorov-arnold-networks.html</guid>
    <dc:creator>Dimitri Lozeve</dc:creator>
</item>
<item>
    <title>Randomness and Uncertainty: from random noise to predictable oscillations via differential equations</title>
    <link>https://www.lozeve.com/posts/randomness-and-uncertainty.html</link>
    <description><![CDATA[<section>
  <p>This <a href="https://www.lms.ac.uk/sites/default/files/inline-files/NLMS_505_for%20web.pdf#page=24">article (PDF)</a> by Nick Trefethen in the <a href="https://www.lms.ac.uk/publications/lms-newsletter"><em>London Mathematical Society Newsletter</em></a> demonstrates an interesting relationship between
what we perceive as “randomness” and what we perceive as “certainty”.</p>
<p>There are many ways to generate pseudo-random numbers that look
perfectly “random” but are actually the output of fully deterministic
processes. Trefethen gives an example of a chaotic system based on a
logistic equation.</p>
<p>But more interesting (to me) and more original may be the other way
around: how to get certainty from randomness. There are ordinary
differential equations that can take random noise as input, and whose
solution is very stable, oscillating between two possible
values. Given a function <span class="math inline">\(f\)</span> approximating random white noise, the
solution to the ODE
<span class="math display">\[ y' = y - y^3 + Cf(t) \]</span>
is “bistable” and remains always around -1 and 1. The parameter <span class="math inline">\(C\)</span>
allows to control the half-life of the transitions.</p>
<p>To explore this behaviour, I replicated Trefethen’s experiments in
Python with the <a href="https://docs.kidger.site/diffrax/">Diffrax</a> library (a differential equations solver based
on <a href="https://jax.readthedocs.io/">JAX</a>). The full code is <a href="https://gist.github.com/dlozeve/4924e71097e1d86933e8d5528cd2f6b4">in this Gist</a>.</p>
<p>It suffices to define a simple function for the vector field and to
give it to a solver:</p>
<div class="sourceCode" id="cb1"><pre class="sourceCode python"><code class="sourceCode python"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> f(t, y, args):</span>
<span id="cb1-2"><a href="#cb1-2" aria-hidden="true" tabindex="-1"></a>    <span class="cf">return</span> y <span class="op">-</span> y<span class="op">**</span><span class="dv">3</span> <span class="op">+</span> <span class="fl">0.4</span> <span class="op">*</span> args[t.astype(<span class="bu">int</span>)]</span></code></pre></div>
<p>where <code>args</code> will be the input, as a simple array of
normally-distributed random values. <span class="math inline">\(C\)</span> is hardcoded as 0.4 as in the
article, but could be passed through <code>args</code> as well (it can be a
dictionary).</p>
<p><img src="../images/randomness_uncertainty.png" /></p>
</section>
]]></description>
    <pubDate>Fri, 20 Oct 2023 00:00:00 UT</pubDate>
    <guid>https://www.lozeve.com/posts/randomness-and-uncertainty.html</guid>
    <dc:creator>Dimitri Lozeve</dc:creator>
</item>
<item>
    <title>High reliability organizations</title>
    <link>https://www.lozeve.com/posts/high-reliability-organizations.html</link>
    <description><![CDATA[<section>
  <p><span class="citation" data-cites="dietterich2018_robus_artif_intel_robus_human_organ">Dietterich (2018)</span> is an
interesting article about how to make <em>robust</em> AI. High risk
situations require the combined AI and human system to operate as a
high reliability organization (HRO). Only such an organization can
have sufficiently strong safety and reliability properties to ensure
that powerful AI systems will not amplify human mistakes.</p>
<h2 id="reliability-and-high-reliability-organizations">Reliability and high-reliability organizations</h2>
<p>The concept of high reliability organization (HRO) comes from
<span class="citation" data-cites="weick1999_organ">Weick, Sutcliffe, and Obstfeld (1999)</span>. Examples of HROs include nuclear power
plants, aircraft carriers, air traffic control systems, and space
shuttles. They share several characteristics: an unforgiving
environment, vast potential for error, and dramatic scales in the case
of a failure.</p>
<p>The paper identifies five processes common to HROs, that they group
into the concept of <em>mindfulness</em> (a kind of “enriched
awareness”). Mindfulness is about allocating and conserving attention
of the group. It includes both being consciously aware of the
situation and <em>acting</em> on this understanding.</p>
<p>This mindfulness leads to the capacity to discover and manage
unexpected events, which in turn leads to reliability.</p>
<h2 id="characteristics-of-a-high-reliability-organization">Characteristics of a high reliability organization</h2>
<p>An HRO is an organization with the following five attributes.</p>
<h3 id="preoccupation-with-failure">Preoccupation with failure</h3>
<p>Failures in HROs are extremely rare. To make it easier to learn from
them, the organization has to broaden the data set by expanding the
definition of failure and studying all types of anomalies and near
misses. Additionally, the analysis is much richer, and always
considers the reliability of the entire system, even for localized
failures.</p>
<p>HROS also study the <em>absence</em> of failure: why it didn’t fail, and the
possibility that no flaws were identified because there wasn’t enough
attention to potential flaws.</p>
<p>To further increase the number of data point to study, HROs encourage
reporting all mistakes and anomalies by anyone. Contrary to most
organizations, members are rewarded for reporting potential failures,
even if their analysis is wrong or if they are responsible for
them. This creates an atmosphere of “psychological safety” essential
for transparency and honesty in anomaly reporting.</p>
<h3 id="reluctance-to-simplify-interpretations">Reluctance to simplify interpretations</h3>
<p>HROs avoid having a single interpretation for a given event. They
encourage generating multiple, complex, contradicting interpretations
for every phenomenon. These varied interpretations enlarge the number
of concurrent precautions. Redundancy is implemented not only via
duplication, but via skepticism of existing systems.</p>
<p>People are encouraged to have different views, different backgrounds,
and are re-trained often. To resolve the contradictions and the
oppositions of views, interpersonal and human skills are highly
valued, possibly more than technical skills.</p>
<h3 id="sensitivity-to-operations">Sensitivity to operations</h3>
<p>HROs rely a lot on “situational awareness”. They are ensuring that no
<a href="https://en.wikipedia.org/wiki/Emergence">emergent phenomena</a> emerge in the system: all outputs should always be
explained by the known inputs. Otherwise, there might be other forces
at work that need to be identified and dealt with. A small group of
people may be dedicated to this awareness at all times.</p>
<h3 id="commitments-to-resilience">Commitments to resilience</h3>
<p>HROs train people to be experts at combining all processes and events
to improve their reactions and their improvisation skills. Everyone
should be an expert at anticipating potential adverse events, and
managing surprise. When events get outside normal operational
boundaries, organizations members self-organize into small dedicated
teams to improvise solutions to novel problems.</p>
<h3 id="underspecification-of-structures">Underspecification of structures</h3>
<p>There is no fixed reporting path, anyone can raise an alarm and halt
operations. Everyone can take decisions related to their technical
expertise. Information is spread directly through the organization, so
that people with the right expertise are warned first. Power is
delegated to operation personal, but management is completely
available at all times.</p>
<h2 id="hros-vs-non-hros">HROs vs non-HROs</h2>
<p>Non-HROs increasingly exhibit some properties of HROs. This may be due
to the fact that highly competitive environments with short cycles
create unforgiving conditions (high performance standards, low
tolerance for errors). However, most everyday organizations do not put
failure at the heart of their thinking.</p>
<p>Failures in non-HROs come from the same sources: cultural assumptions
on the effectiveness or accuracy of previous precautions measures.</p>
<p>Preoccupation with failure also reveal the couplings and the complex
interactions in the manipulated systems. This in turn leads to
uncoupling and less emergent behaviour over time. People understand
better long-term, complex interactions.</p>
<h2 id="reliability-vs-performance-and-the-importance-of-learning">Reliability vs performance, and the importance of learning</h2>
<p>An interesting discussion is around the (alleged) trade-off between
reliability and performance. It is assumed that HROs put the focus on
reliability at the cost of throughput. As a consequence, it may not
make sense for ordinary organizations to put as much emphasis on
safety and reliability, as the cost to the business may be
prohibitive.</p>
<p>However, investments in safety can also be viewed as investments in
<em>learning</em>. HROs view safety and reliability as a process of search
and learning (constant search for anomalies, learning the interactions
between the parts of a complex system, ensuring we can link outputs to
known inputs). As such, investments in safety encourage collective
knowledge production and dissemination.</p>
<p>Mindfulness also stimulates intrinsic motivation and perceptions of
efficacy and control, which increase individual performance. (People
who strongly believe they are in control of their own output are more
motivated and more efficient.)</p>
<p>HROs may encourage mindfulness based on operational necessity in front
of the catastrophic consequences of any failure, but non-HROs can
adopt the same practice to boost efficiency and learning to gain
competitive advantage.</p>
<p>Additional lessons that can be learned from HROs (implicit in the
previous discussion):</p>
<ol type="1">
<li>The expectation of surprise is an organizational resource because
it promotes real-time attentiveness and discovery.</li>
<li>Anomalous events should be treated as outcomes rather than
accidents, to encourage search for sources and causes.</li>
<li>Errors should be made as conspicuous as possible to undermine
self-deception and concealment.</li>
<li>Reliability requires diversity, duplication, overlap, and a varied
response repertoire, whereas efficiency requires homogeneity,
specialization, non-redundancy, and standardization.</li>
<li>Interpersonal skills are just as important in HROs as are technical
skills.</li>
</ol>
<h2 class="unnumbered" id="references">References</h2>
<div id="refs" class="references csl-bib-body hanging-indent" data-entry-spacing="0" role="list">
<div id="ref-dietterich2018_robus_artif_intel_robus_human_organ" class="csl-entry" role="listitem">
Dietterich, Thomas G. 2018. <span>“Robust Artificial Intelligence and Robust Human Organizations.”</span> <em>CoRR</em>. <a href="http://arxiv.org/abs/1811.10840">http://arxiv.org/abs/1811.10840</a>.
</div>
<div id="ref-weick1999_organ" class="csl-entry" role="listitem">
Weick, Karl E., Kathleen M. Sutcliffe, and David Obstfeld. 1999. <span>“Organizing for High Reliability: Processes of Collective Mindfulness.”</span> In <em>Research in Organizational Behavior</em>, edited by R. S. Sutton and B. M. Staw, 21:81–123. Research in Organizational Behavior, Vol. 21. Stanford: Elsevier Science/JAI Press. <a href="https://archive.org/details/organizing-for-high-reliability">https://archive.org/details/organizing-for-high-reliability</a>.
</div>
</div>
</section>
]]></description>
    <pubDate>Fri, 03 Jun 2022 00:00:00 UT</pubDate>
    <guid>https://www.lozeve.com/posts/high-reliability-organizations.html</guid>
    <dc:creator>Dimitri Lozeve</dc:creator>
</item>
<item>
    <title>How to train your differentiable filter</title>
    <link>https://www.lozeve.com/posts/how-to-train-your-differentiable-filter.html</link>
    <description><![CDATA[<section>
  <p>This is a short overview of the following paper <span class="citation" data-cites="kloss2021_how">(Kloss, Martius, and Bohg 2021)</span>:</p>
<blockquote>
<p>Kloss, Alina, Georg Martius, and Jeannette Bohg. 2021. “How to Train
Your Differentiable Filter.” <em>Autonomous Robots</em> 45 (4):
561–78. <a href="https://doi.org/10.1007/s10514-021-09990-9">https://doi.org/10.1007/s10514-021-09990-9</a>.</p>
</blockquote>
<h2 id="bayesian-filtering-for-state-estimation">Bayesian filtering for state estimation</h2>
<p>Bayesian filters<span class="sidenote-wrapper"><label for="sn-0" class="margin-toggle">⊕</label><input type="checkbox" id="sn-0" class="margin-toggle" /><span class="marginnote"><span class="citation" data-cites="thrun2006_probab_robot">(Thrun 2006)</span> contains a
great explanation of Bayesian filters (including Kalman and particle
filters), in the context of robotics, which is relevant for this
paper. For a more complete overview of Kalman filters, see
<span class="citation" data-cites="anderson2005_optim_filter">(Anderson and Moore 2005)</span>.<br />
<br />
</span></span> are the standard method for
probabilistic state estimation. Common examples are (extended,
unscented) <a href="https://en.wikipedia.org/wiki/Kalman_filter">Kalman filters</a> and <a href="https://en.wikipedia.org/wiki/Particle_filter">particle filters</a>. These filters require
a <em>process model</em> predicting how the state evolves over time, and an
<em>observation model</em> relating an sensor value to the underlying state.</p>
<p>The objective of a filter for state estimation is to estimate a latent
state <span class="math inline">\(\mathbf{x}\)</span> of a dynamical system at any time step <span class="math inline">\(t\)</span> given an
initial belief <span class="math inline">\(\mathrm{bel}(\mathbf{x}_0) = p(\mathbf{x}_0)\)</span>, a
sequence of observations <span class="math inline">\(\mathbf{z}_{1\ldots t}\)</span>, and controls
<span class="math inline">\(\mathbf{u}_{0\ldots t}\)</span>.</p>
<p>We make the Markov assumption (i.e. states and observations are
conditionally independent from the history of past states).</p>
<p><span class="math display">\[
\begin{align*}
\mathrm{bel}(\mathbf{x}_t) &amp;= \eta p(\mathbf{z}_t | \mathbf{x}_t) \int p(\mathbf{x}_t | \mathbf{x}_{t-1}, \mathbf{u}_{t-1}) \mathrm{bel}(\mathbf{x}_{t-1}) d\mathbf{x}_{t-1}\\
&amp;= \eta p(\mathbf{z}_t | \mathbf{x}_t) \overline{\mathrm{bel}}(\mathbf{x}_t),
\end{align*}
\]</span></p>
<p>where <span class="math inline">\(\eta\)</span> is a normalization factor. Computing
<span class="math inline">\(\overline{\mathrm{bel}}(\mathbf{x}_t)\)</span> is the <em>prediction step</em>, and
applying <span class="math inline">\(p(\mathbf{z}_t | \mathbf{x}_t)\)</span> is the <em>update step</em> (or the
<em>observation step</em>).</p>
<p>We model the dynamics of the system through a process model <span class="math inline">\(f\)</span> and an
observation model <span class="math inline">\(h\)</span>:</p>
<p><span class="math display">\[
\begin{align*}
\mathbf{x}_t &amp;= f(\mathbf{x}_{t-1}, \mathbf{u}_{t-1}, \mathbf{q}_{t-1})\\
\mathbf{z}_t &amp;= h(\mathbf{x}_t, \mathbf{r}_t),
\end{align*}
\]</span>
where <span class="math inline">\(\mathbf{q}\)</span> and <span class="math inline">\(\mathbf{r}\)</span> are random variables representing
process and observation noise, respectively.</p>
<h2 id="differentiable-bayesian-filters">Differentiable Bayesian filters</h2>
<p>These models are often difficult to formulate and specify, especially
when the application has complex dynamics, with complicated noises,
nonlinearities, high-dimensional state or observations, etc.</p>
<p>To improve this situation, the key idea is to <em>learn</em> these complex
dynamics and noise models from data. Instead of spending hours in
front of a blackboard deriving the equations, we could give a simple
model a lot of data and learn the equations from them!</p>
<p>In the case of Bayesian filters, we have to define the process,
observation, and noise processes as parameterized functions
(e.g. neural networks), and learn their parameters end-to-end, through
the entire apparatus of the filter. To learn these parameters, we will
use the simplest method: gradient descent. Our filter have to become
<em>differentiable</em>.</p>
<p>The paper shows that such <em>differentiable filters</em> (trained
end-to-end) outperform unstructured <a href="https://en.wikipedia.org/wiki/Long_short-term_memory">LSTMs</a>, and outperform standard
filters where the process and observation models are fixed in advance
(i.e. analytically derived or even trained separately in isolation).</p>
<p>In most applications, the process and observation noises are often
assumed to be uncorrelated Gaussians, with zero mean and constant
covariance (which is a hyperparameter of the filter). With end-to-end
training, we can learn these parameters (mean and covariance of the
noise), but we can even go further, and use <a href="https://en.wikipedia.org/wiki/Heteroscedasticity">heteroscedastic</a> noise
models. In this model, the noise can depend on the state of the system
and the applied control.</p>
<h2 id="learnable-process-and-observation-models">Learnable process and observation models</h2>
<p>The observation model <span class="math inline">\(f\)</span> can be implemented as a simple feed-forward
neural network. Importantly, this NN is trained to output the
<em>difference</em> between the next and the current state (<span class="math inline">\(\mathbf{x}_{t+1} - \mathbf{x}_t\)</span>).
This ensure stable gradients and an easier initialization near the
identity.</p>
<p>For the observation model, we could do the same and model <span class="math inline">\(g\)</span> as a
generative neural network predicting the output of the
sensors. However, the observation space is often high-dimensional, and
the network is thus difficult to train. Consequently, the authors use
a <em>discriminative</em> neural network to reduce the dimensionality of the
raw sensory output.</p>
<h2 id="learnable-noise-models">Learnable noise models</h2>
<p>In the Gaussian case, we use neural networks to predict the covariance
matrix of the noise processes. To ensure positive-definiteness, the
network predicts an upper-triangular matrix <span class="math inline">\(\mathbf{L}_t\)</span> and the
noise covariance matrix is set to <span class="math inline">\(\mathbf{Q}_t = \mathbf{L}_t \mathbf{L}_t^T\)</span>.</p>
<p>In the heteroscedastic case, the noise covariance is predicted from
the state and the control input.</p>
<h2 id="loss-function">Loss function</h2>
<p>We assume that we have access to the ground-truth trajectory <span class="math inline">\(\mathbf{x}_{1\ldots T}\)</span>.</p>
<p>We can then use the mean squared error (MSE) between the ground truth
and the mean of the belief:</p>
<p><span class="math display">\[ L_\mathrm{MSE} = \frac{1}{T} \sum_{t=0}^T (\mathbf{x}_t - \mathbf{\mu}_t)^T (\mathbf{x}_t - \mathbf{\mu}_t). \]</span></p>
<p>Alternatively, we can compute the negative log-likelihood of the true
state under the belief distribution (represented by a Gaussian of mean
<span class="math inline">\(\mathbf{\mu}_t\)</span> and covariance <span class="math inline">\(\mathbf{\Sigma}_t\)</span>):</p>
<p><span class="math display">\[ L_\mathrm{NLL} = \frac{1}{2T} \sum_{t=0}^T \log(|\mathbf{\Sigma}_t|) + (\mathbf{x}_t - \mathbf{\mu}_t)^T \mathbf{\Sigma}^{-1} (\mathbf{x}_t - \mathbf{\mu}_t). \]</span></p>
<h2 id="implementation-issues">Implementation issues</h2>
<p>We need to implement the filters (<a href="https://en.wikipedia.org/wiki/Extended_Kalman_filter">EKF</a>, <a href="https://en.wikipedia.org/wiki/Kalman_filter#Unscented_Kalman_filter">UKF</a>, <a href="https://en.wikipedia.org/wiki/Particle_filter">PF</a>) in a <a href="https://en.wikipedia.org/wiki/Differentiable_programming">differentiable programming</a> framework. The authors use <a href="https://en.wikipedia.org/wiki/Differentiable_programming">TensorFlow</a>. Their code is
available <a href="https://github.com/akloss/differentiable_filters">on GitHub</a>.</p>
<p>Some are easy because they use only differentiable operations (mostly
simple linear algebra). For the EKF, we also need to compute
Jacobians. This can be done automatically via automatic
differentiation, but the authors have encountered technical
difficulties with this (memory consumption or slow computations), so
they recommend computing Jacobians manually.<span class="sidenote-wrapper"><label for="sn-1" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sn-1" class="margin-toggle" /><span class="sidenote">It is not clear
whether this is a limitation of automatic differentiation, or of their
specific implementation with TensorFlow. Some other projects have
successfully computed Jacobians for EKFs with autodiff libraries, like
<a href="https://github.com/sisl/GaussianFilters.jl">GaussianFilters.jl</a> in Julia.<br />
<br />
</span></span></p>
<p>The particle filter has a resampling step that is not differentiable:
the gradient cannot be propagated to particles that are not selected
by the sampling step. There are apparently specific resampling
algorithms that help mitigate this issue in practice when training.</p>
<h2 id="conclusions">Conclusions</h2>
<p>Differentiable filters achieve better results with fewer parameters
than unstructured models like LSTMs, especially on complex tasks. The
paper runs extensive experiments on various toy models of various
complexity, although unfortunately no real-world application is shown.</p>
<p>Noise models with full covariance improve the tracking
accuracy. Heteroscedastic noise models improve it even more.</p>
<p>The main issue is to keep the training stable. They recommend the
differentiable extended Kalman filter for getting started, as it is
the most simple filter, and is less sensitive to hyperparameter
choices. If the task is strongly non-linear, one should use a
differentiable unscented Kalman filter or a differentiable particle
filter.</p>
<h2 class="unnumbered" id="references">References</h2>
<div id="refs" class="references csl-bib-body hanging-indent" data-entry-spacing="0" role="list">
<div id="ref-anderson2005_optim_filter" class="csl-entry" role="listitem">
Anderson, Brian D. O., and John B. Moore. 2005. <em>Optimal Filtering</em>. Dover Books on Electrical Engineering. Dover Publications.
</div>
<div id="ref-kloss2021_how" class="csl-entry" role="listitem">
Kloss, Alina, Georg Martius, and Jeannette Bohg. 2021. <span>“How to Train Your Differentiable Filter.”</span> <em>Autonomous Robots</em> 45 (4): 561–78. <a href="https://doi.org/10.1007/s10514-021-09990-9">https://doi.org/10.1007/s10514-021-09990-9</a>.
</div>
<div id="ref-thrun2006_probab_robot" class="csl-entry" role="listitem">
Thrun, Sebastian. 2006. <em>Probabilistic Robotics</em>. Cambridge, Massachusetts: The MIT Press. <a href="https://mitpress.mit.edu/books/probabilistic-robotics">https://mitpress.mit.edu/books/probabilistic-robotics</a>.
</div>
</div>
</section>
]]></description>
    <pubDate>Fri, 20 May 2022 00:00:00 UT</pubDate>
    <guid>https://www.lozeve.com/posts/how-to-train-your-differentiable-filter.html</guid>
    <dc:creator>Dimitri Lozeve</dc:creator>
</item>
<item>
    <title>The Dawning of the Age of Stochasticity</title>
    <link>https://www.lozeve.com/posts/dawning-of-the-age-of-stochasticity.html</link>
    <description><![CDATA[<section>
  <blockquote>
<p>Mumford, David. 2000. “The Dawning of the Age of Stochasticity.”
<em>Atti Della Accademia Nazionale Dei Lincei. Classe Di Scienze Fisiche,
Matematiche E Naturali. Rendiconti Lincei. Matematica E Applicazioni</em>
11: 107–25. <a href="http://eudml.org/doc/289648">http://eudml.org/doc/289648</a>.</p>
</blockquote>
<p>This article <span class="citation" data-cites="mumford2000_dawnin_age_stoch">(Mumford 2000)</span> is an interesting call
for a new set of foundations of mathematics on probability and
statistics. It argues that logic has had its time, and now we should
make random variables a first-class concept, as they would make for
better foundations.</p>
<h2 id="the-taxonomy-of-mathematics">The taxonomy of mathematics</h2>
<p><span class="sidenote-wrapper"><label for="sn-0" class="margin-toggle">⊕</label><input type="checkbox" id="sn-0" class="margin-toggle" /><span class="marginnote">This is probably the best definition of mathematics I have
seen. Before that, the most satisfying definition was “mathematics is
what mathematicians do”. It also raises an interesting question: what
would the study of <em>non-reproducible</em> mental objects be?<br />
<br />
</span></span></p>
<blockquote>
<p>The study of mental objects with reproducible properties is called mathematics.
<span class="citation" data-cites="davis2012_mathem_exper_study_edition">(Davis, Hersh, and Marchisotto 2012)</span></p>
</blockquote>
<p>What are the categories of reproducible mental objects? Mumford
considers the principal sub-fields of mathematics (geometry, analysis,
algebra, logic) and argues that they are indeed rooted in common
mental phenomena.</p>
<p>Of these, logic, and the notion of proposition, with an absolute truth
value attached to it, was made the foundation of all the
others. Mumford’s argument is that instead, the random variable is (or
should be) the “paradigmatic mental object”, on which all others can
be based. People are constantly weighing likelihoods, evaluating
plausibility, and sampling from posterior distributions to refine
estimates.</p>
<p>As such, random variables are rooted in our inspection of our own
mental processes, in the self-conscious analysis of our minds. Compare
to areas of mathematics arising from our experience with the physical
world, through our perception of space (geometry), of forces and
accelerations (analysis), or through composition of actions (algebra).</p>
<p>The paper then proceeds to do a quick historical overview of the
principal notions of probability, which mostly mirror the detailed
historical perspective in <span class="citation" data-cites="hacking2006_emerg_probab">(Hacking 2006)</span>. There is
also a short summary of the work into the foundations of mathematics.</p>
<p>Mumford also claims that although there were many advances in the
foundation of probability (e.g. Galton, Gibbs for statistical physics,
Keynes in economics, Wiener for control theory, Shannon for
information theory), most important statisticians (R. A. Fisher)
insisted on keeping the scope of statistics fairly limited to
empirical data: the so-called “frequentist” school. (This is a vision
of the whole frequentist vs Bayesian debate that I hadn’t seen
before. The Bayesian school can be seen as the one who claims that
statistical inference can be applied more widely, even to real-life
complex situations and thought processes. In this point of view, the
emergence of the probabilistic method in various areas of science
would be the strongest argument in favour of bayesianism.)</p>
<h2 id="what-is-a-random-variable">What is a “random variable”?</h2>
<p>Random variables are difficult to define. They are the core concept of
any course in probability of statistics, but their full, rigorous
definition relies on advanced measure theory, often unapproachable to
beginners. Nevertheless, practitioners tend to be productive with
basic introductions to probability and statistics, even without
being able to formulate the explicit definition.</p>
<p>Here, Mumford discusses the various definitions we can apply to the
notion of random variable, from an intuitive and a formal point of
view. The conclusion is essentially that a random variable is a
complex entity that do not easily accept a satisfying definition,
except from a purely formal and axiomatic point of view.</p>
<p>This situation is very similar to the one for the notion of
“set”. Everybody can manipulate them on an intuitive level and grasp
the basic properties, but the specific axioms are hard to grasp, and
no definition is fully satisfying, as the debates on the foundations
of mathematics can attest.</p>
<h2 id="putting-random-variables-into-the-foundations">Putting random variables into the foundations</h2>
<p>The usual way of defining random variables is:</p>
<ol type="1">
<li>predicate logic,</li>
<li>sets,</li>
<li>natural numbers,</li>
<li>real numbers,</li>
<li>measures,</li>
<li>random variables.</li>
</ol>
<p>Instead, we could put random variables at the foundations, and define
everything else in terms of that.</p>
<p>There is no complete formulation of such a foundation, nor is it clear
that it is possible. However, to make his case, Mumford presents two
developments. One is from <a href="https://en.wikipedia.org/wiki/Edwin_Thompson_Jaynes">E. T. Jaynes</a>, who has a complete formalism
of Bayesian probability from a notion of “plausibility”. With a few
axioms, we can obtain an isomorphism between an intuitive notion of
plausibility and a true probability function.</p>
<p>The other example is a proof that the continuum hypothesis is false,
using a probabilistic argument, due to Christopher Freiling. This
proof starts from a notion of random variable that is incompatible
with the usual definition in terms of measure theory. However, this
leads Mumford to question whether a foundation of mathematics based on
such a notion could get us rid of “one of the meaningless conundrums
of set theory”.</p>
<h2 id="stochastic-methods-have-invaded-classical-mathematics">Stochastic methods have invaded classical mathematics</h2>
<p>This is probably the most convincing argument to give a greater
importance to probability and statistical methods in the foundations
of mathematics: there tend to be everywhere, and extremely
productive. A prime example is obviously graph theory, where the
“probabilistic method” has had a deep impact, thanks notably to
Erdős. (See <span class="citation" data-cites="alon2016_probab_method">(Alon and Spencer 2016)</span> and <a href="https://www.college-de-france.fr/site/timothy-gowers/index.htm">Timothy Gowers’ lessons at the Collège de France</a><span class="sidenote-wrapper"><label for="sn-1" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sn-1" class="margin-toggle" /><span class="sidenote">In French, but see also <a href="https://www.youtube.com/c/TimothyGowers0">his YouTube channel</a>.<br />
<br />
</span></span> on the probabilistic method for combinatorics and number
theory.) Probabilistic methods also have a huge importance in the
analysis of differential equations, chaos theory, and mathematical
physics in general.</p>
<h2 id="thinking-as-bayesian-inference">Thinking as Bayesian inference</h2>
<p>I think this is not very controversial in cognitive science: we do not
think by composing propositions into syllogisms, but rather by
inferring probabilities of certain statements being true. Mumford
illustrates this very well with an example from Judea Pearl, which
uses graphical models to represent thought processes. There is also a
link with formal definitions of induction, such as PAC learning, which
is very present in machine learning.</p>
<p>I’ll conclude this post by quoting directly the last paragraph of the
article:</p>
<blockquote>
<p>My overall conclusion is that I believe stochastic methods will
transform pure and applied mathematics in the beginning of the third
millennium. Probability and statistics will come to be viewed as the
natural tools to use in mathematical as well as scientific modeling.
The intellectual world as a whole will come to view logic as a
beautiful elegant idealization but to view statistics as the standard
way in which we reason and think.</p>
</blockquote>
<h2 class="unnumbered" id="references">References</h2>
<div id="refs" class="references csl-bib-body hanging-indent" data-entry-spacing="0" role="list">
<div id="ref-alon2016_probab_method" class="csl-entry" role="listitem">
Alon, Noga, and Joel H. Spencer. 2016. <em>The Probabilistic Method</em>. 4th ed. Wiley.
</div>
<div id="ref-davis2012_mathem_exper_study_edition" class="csl-entry" role="listitem">
Davis, Philip J., Reuben Hersh, and Elena Anne Marchisotto. 2012. <em>The Mathematical Experience, Study Edition</em>. Modern Birkh<span>ä</span>user Classics. Birkh<span>ä</span>user Boston. <a href="https://doi.org/10.1007/978-0-8176-8295-8">https://doi.org/10.1007/978-0-8176-8295-8</a>.
</div>
<div id="ref-hacking2006_emerg_probab" class="csl-entry" role="listitem">
Hacking, Ian. 2006. <em>The Emergence of Probability: A Philosophical Study of Early Ideas about Probability, Induction and Statistical Inference</em>. 2nd ed. Cambridge University Press. <a href="https://doi.org/10.1017/CBO9780511817557">https://doi.org/10.1017/CBO9780511817557</a>.
</div>
<div id="ref-mumford2000_dawnin_age_stoch" class="csl-entry" role="listitem">
Mumford, David. 2000. <span>“The Dawning of the Age of Stochasticity.”</span> <em>Atti Della Accademia Nazionale Dei Lincei. Classe Di Scienze Fisiche, Matematiche e Naturali. Rendiconti Lincei. Matematica e Applicazioni</em> 11 (December): 107–25. <a href="http://eudml.org/doc/289648">http://eudml.org/doc/289648</a>.
</div>
</div>
</section>
]]></description>
    <pubDate>Thu, 24 Mar 2022 00:00:00 UT</pubDate>
    <guid>https://www.lozeve.com/posts/dawning-of-the-age-of-stochasticity.html</guid>
    <dc:creator>Dimitri Lozeve</dc:creator>
</item>

    </channel>
</rss>
