Every project, no matter its size, requires some kind of organization and planning. Whether you’re thinking about what you need to do when you wake up (shower, make breakfast, brush your teeth) or planning a new space programme, you will need to think about the tasks, in what order to do them, and how long it will take. This is called *scheduling*.

Planning projects requires balancing dependencies between tasks, resource allocation, and complex constraints in order to find a complete and feasible schedule. How much of this can be made rigorous? What is the limit of automation in this scenario?

In this post, I want to explore the problem of planning and scheduling in the specific context of project management. The goal is to set up the problem of project planning rigorously, and investigate what techniques we can apply to have a better understanding of our projects and reach our objectives faster.

When starting a new projectThe definition of a project here is highly subjective, and has been strongly influenced by what I’ve read (see the references) and how I actually do things at work. In particular, most of the model and concepts can be found in Microsoft Project.

, I generally follow a rough workflow that goes like this:

- Define the global constraints of the project: functional specification, deadlines, overall resources available, etc.
- Subdivide the projects into tasks and subtasks. Low-level tasks should be self-contained and doable by few people (ideally only one). Tasks can then be grouped together for better visualising what is happening at various scales. This gives us a global hierarchy of tasks, culminating in the overall project.
- Specify the dependencies between tasks, ideally with an explicit deliverable for each dependency relationship.
- Estimate the work required for each task.
- Affect a resource to each task, deriving task durations accordingly. For instance, if Bob will be working part-time on this task (because he has other things to do at the same time), the task will take longer to complete than the nominal amount of work that it requires.
- Find an order in which to execute all the tasks, respecting workforce and time constraints (Bob cannot spend 50% of this time on three tasks simultaneously). This is called a
*schedule*. - Iterate on the order until a minimal completion date is found. Generally, the objective is to complete the project as soon as possible, but there may be additional requirements (overall deadline, lateness penalties, maximal resource utilization).

Given this process, a natural question would be to ask: how can we simplify it? What can we automate? The obvious task is the scheduling part (steps 6 and 7): this step does not require any human decision-making, and for which it will be difficult and tiresome to achieve optimality. Most project management software (e.g. Microsoft Project) focus on this part.

However, in practice, resource allocation is also extremely time consuming. Most importantly, it will constrain the final schedule: a bad allocation can push back the final completion date by a wide margin. Therefore, it makes sense to want to take into account both resource allocation and task ordering at the same time when looking for an optimal schedule.

Going even further, we could look into subdividing tasks further: maybe splitting a task in two, allowing a small lag between the completion of the first half and the start of the second half, could improve the overall objective. By allowing preemption, we could optimize further our schedule.

To understand all of this, we’ll need to formalize our problem a little bit. This will allow us to position it in the overall schema of problems studied in the operations research literature, and use their conclusions to choose the best approach as a trade-off between manual and automatic scheduling.

A *project* is simply a set of tasksA task is often called a *job* or an *activity* in project scheduling. I will use these terms interchangeably.

. Each task is a specific action with a certain amount of work that needs to be done. More importantly, a task can *depend* on other tasks: for instance, I can’t send the satellite in the space if you haven’t built it yet.

Other *constraints* may also be present: there are nearly always deadlines (my satellite needs to be up and running on 2024-05-12), and sometimes other kind of temporal constraints (for legal reasons, I can’t start building my death ray before 2023-01-01). Most importantly, there are constraints on resource usage (I need either Alice or Bob to work on these tasks, so I will be able to work on at most two of them at the same time).

Finally, the *objective* is to finish the project (i.e. complete all the tasks) as early as possible. This is called the *makespan*.

You may have noticed a nice pattern here: objective, constraints? We have a great optimization problem! As it turns out, scheduling is an entire branch of operations researchSee my previous blog post on operations research.

. In the literature, this kind of problem is referred to as *resource-constrained project scheduling*, or as *project scheduling with workforce constraints*.

There is a lot of room to modify the problem to other settings. Brucker et al. (1999) propose an interesting classification scheme for project scheduling. In this system, any problem can be represented by a triplet \(\alpha|\beta|\gamma\), where \(\alpha\) is the resource environment, \(\beta\) are the activity characteristics, and \(\gamma\) is the objective function.

The *resource environment* \(\alpha\) describes the available quantity of each type of resources. Resource can be renewable, like people, who supply a fixed quantity of work in each time period, or non-renewable, like raw materials.

The *activity characteristics* \(\beta\) describe how tasks are constrained: how the dependencies are specified (with a graph, or with temporal constraints between the starts and ends of different tasks), whether there are global constraints like deadlines, and whether processing times are constant for all tasks, can vary, or even can be stochastic.

Finally, the *objective* \(\gamma\) can be one of several possibilities. The most common are the makespan which seeks to minimize the total duration of the project, and resource-levelling which seeks to minimize some measure of variation of resource utilization.

Some important problems (\(\mathrm{PS}\) means “project scheduling” without any restrictions on resources):

- \(\mathrm{PS} \;|\; \mathrm{prec} \;|\; C_{\max}\): the “simple” project scheduling setup, which corresponds to the practical application that interests us here. Although this is the base problem, it is still quite challenging. Removing the resource constraints renders the problem much easier from a computational point of view (Pinedo 2009, chap. 4).
- \(\mathrm{PS} \;|\; \mathrm{temp} \;|\; C_{\max}\): when you add time lag constraints (e.g. two tasks that must start within two days of each other), the problem becomes much more difficult.
- \(\mathrm{PS} \;|\; \mathrm{temp} \;| \sum c_{k} f\left(r_{k}(S, t)\right)\): this is the resource-levelling problem: you want to minimize the costs of using an amount \(r_k(S, t)\) of each resource \(k\), when each unit of resource costs \(c_k\).

First, we need a way to represent a project. We can use the so-called *job-on-node* formatThere is also a *job-on-arc* format that is apparently widely used, but less practical in most applications.

. The nodes represent the tasks in the precedence graph, and arcs represent the dependency relationships between tasks.

This representation leads to a natural algorithm for project scheduling in the absence of any resource constraints. The critical path method (CPM) consists in finding a chain of dependent tasks in the job-on-node graph that are *critical*: their completion time is fixed by their dependencies.

It consists of two procedures, one to determine the earliest possible completion time of each task (forward procedure), and one to determine the latest possible completion time of each task that does not increase total project duration (backward procedure). The tasks for which these two times are equal form the *critical path*Note that the critical path is not necessarily unique, and several critical paths may be overlapping.

. Non-critical tasks have a certain amount of *slack*: it is possible to schedule them freely between the two extremities without affecting the makespan.

An extension of the critical path method is the program evaluation and review technique (PERT). We still consider we have unlimited resources, but the processing time of each task is allowed to be a random variable instead of a fixed quantity. The algorithm must be amended correspondingly to take into account pessimistic and optimistic estimates of each task duration.

These techniques have been widely employed in various industriesWikipedia tells us that CPM and PERT were partly developed by the US Navy, and applied to several large-scale projects, like skyscraper buildings, aerospace and military projects, the Manhattan project, etc.

, and show that the project scheduling problem without workforce constraints can be solved extremely efficiently. See Pinedo (2009) for more details on these algorithms and some examples.

With resource constraints, the problem becomes much harder to solve. It is not possible to formulate this problem as a linear program: workforce constraints are intrinsically combinatorial in nature, so the problem is formulated as an integer programThe full integer program can be found in Pinedo (2009, sec. 4.6).

.

The problem is modelled with 0-1 variables \(x_{jt}\) which take the value 1 if job \(j\) is completed exactly at time \(t\), and 0 otherwise. The objective is to minimize the makespan, i.e. the completion time of a dummy job that depends on all other jobs. There are three constraints:

- if job \(j\) is a dependency of job \(k\), the completion time of job \(k\) is larger than the completion time of job \(j\) plus the duration of job \(k\),
- at any given time, we do not exceed the total amount of resources available for each type of resources,
- all jobs are completed at the end of the project.

This problem quickly becomes challenging from a computational point of view when the number of tasks increase. Variations on the branch and bound method have been developed to solve the resource-constrained project scheduling problem efficiently, and in practice most applications rely on heuristics to approximate the full problem. However, even special cases may be extremely challenging to solve. The project scheduling problem is a generalization of the job shop scheduling problem, which is itself a generalization of the travelling salesman problem: all of these are therefore NP-hard.

See Brucker et al. (1999) for a short survey of algorithms and heuristics, and extensions to the harder problems (multi-mode case, time-cost trade-offs, other objective functions). Pinedo (2016) contains a much more extensive discussion of all kinds of scheduling problems, algorithms, and implementation considerations.

Brucker et al. (1999) is a great survey of the algorithms available for project scheduling. For longer books, Pinedo (2016), Brucker (2007), Conway, Maxwell, and Miller (2003), and Leung (2004) are good references for the theoretical aspects, and Pinedo (2009) and Błażewicz et al. (2001) for applications.

Atabakhsh (1991), Noronha and Sarma (1991), and Smith (1992) contain algorithms that use methods from artificial intelligence to complement the traditional operations research approach.

Let us review our workflow from the beginning. Even for the general case of project scheduling with workforce and temporal constraints, algorithms exist that are able to automate the entire scheduling problem (except maybe for the largest projects). Additional manipulations can easily be encoded with these two types of constraints.

Most tools today seem to rely on a variant of CPM or PERTThis seems to be the case for Microsoft Project at least. However, I should note that it is an *enormous* piece of software, and I barely scratched the surface of its capabilities. In particular, it can do much more that project scheduling: there are options for resource levelling and budgeting, along with a lot of visualization and reporting features (Gantt charts).

. As a result, you still have to manually allocate resources, which can be really time-consuming on large projects: ensuring that each resource is not over-allocated, and finding which task to reschedule while minimizing the impact on the overall project duration is not obvious at all.

As a result, a tool that would allow me to choose the level of control I want in resource allocation would be ideal. I could explicitly set the resources used by some tasks, and add some global limits on which resources are available for the overall project, and the algorithm would do the rest.

We could then focus on automating further, allowing preemption of tasks, time-cost trade-offs, etc. Finding the right abstractions and selecting the best algorithm for each case would be a challenging project, but I think it would be extremely interesting!

Atabakhsh, H. 1991. “A Survey of Constraint Based Scheduling Systems Using an Artificial Intelligence Approach.” *Artificial Intelligence in Engineering* 6 (2): 58–73. https://doi.org/10.1016/0954-1810(91)90001-5.

Brucker, Peter. 2007. *Scheduling Algorithms*. 5th ed. Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-540-69516-5.

Brucker, Peter, Andreas Drexl, Rolf Möhring, Klaus Neumann, and Erwin Pesch. 1999. “Resource-Constrained Project Scheduling: Notation, Classification, Models, and Methods.” *European Journal of Operational Research* 112 (1): 3–41. https://doi.org/10.1016/s0377-2217(98)00204-5.

Błażewicz, Jacek, Klaus H. Ecker, Erwin Pesch, Günter Schmidt, and Jan Węglarz. 2001. *Scheduling Computer and Manufacturing Processes*. 2nd ed. Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-662-04363-9.

Conway, Richard, William L. Maxwell, and Louis W. Miller. 2003. *Theory of Scheduling*. Mineola, N.Y: Dover.

Leung, Joseph. 2004. *Handbook of Scheduling : Algorithms, Models, and Performance Analysis*. Boca Raton: Chapman & Hall/CRC.

Noronha, S. J., and V. V. S. Sarma. 1991. “Knowledge-Based Approaches for Scheduling Problems: A Survey.” *IEEE Transactions on Knowledge and Data Engineering* 3 (2): 160–71. https://doi.org/10.1109/69.87996.

Pinedo, Michael L. 2009. *Planning and Scheduling in Manufacturing and Services*. 2nd ed. Springer Series in Operations Research and Financial Engineering. New York: Springer. https://doi.org/10.1007/978-1-4419-0910-7.

———. 2016. *Scheduling: Theory, Algorithms, and Systems*. 5th ed. Springer International Publishing. https://doi.org/10.1007/978-3-319-26580-3.

Smith, Stephen F. 1992. “Knowledge-Based Production Management Approaches, Results and Prospects.” *Production Planning & Control* 3 (4): 350–80. https://doi.org/10.1080/09537289208919407.

Every month, IBM Research publish an interesting puzzle on their Ponder This page. Last month puzzle was a nice optimization problem about a rover exploring the surface of Mars.

In this post, I will explore how to formulate the problem as a mixed-integer linear program (MILP) See my previous post for additional background and references on operations research and optimization.

, and how to solve it with Julia’s JuMP package.

The surface of Mars is represented as a \(N \times N\) grid, where each cell has a “score” (i.e. a reward for exploring the cell), and a constant exploration cost of 128. The goal is to find the set of cell which maximizes the total score. There is an additional constraint: each cell can only be explored if all its upper neighbors were also explored. The full problem statement is here, along with an example on a small grid.

This problem has a typical structure: we have to choose some variables to maximize a specific quantity, subject to some constraints. Here, JuMP will make it easy for us to formulate and solve this problem, with minimal code.

The grid scores are represented as a 20 × 20 array of hexadecimal numbers:

```
BC E6 56 29 99 95 AE 27 9F 89 88 8F BC B4 2A 71 44 7F AF 96
72 57 13 DD 08 44 9E A0 13 09 3F D5 AA 06 5E DB E1 EF 14 0B
42 B8 F3 8E 58 F0 FA 7F 7C BD FF AF DB D9 13 3E 5D D4 30 FB
60 CA B4 A1 73 E4 31 B5 B3 0C 85 DD 27 42 4F D0 11 09 28 39
1B 40 7C B1 01 79 52 53 65 65 BE 0F 4A 43 CD D7 A6 FE 7F 51
25 AB CC 20 F9 CC 7F 3B 4F 22 9C 72 F5 FE F9 BF A5 58 1F C7
EA B2 E4 F8 72 7B 80 A2 D7 C1 4F 46 D1 5E FA AB 12 40 82 7E
52 BF 4D 37 C6 5F 3D EF 56 11 D2 69 A4 02 0D 58 11 A7 9E 06
F6 B2 60 AF 83 08 4E 11 71 27 60 6F 9E 0A D3 19 20 F6 A3 40
B7 26 1B 3A 18 FE E3 3C FB DA 7E 78 CA 49 F3 FE 14 86 53 E9
1A 19 54 BD 1A 55 20 3B 59 42 8C 07 BA C5 27 A6 31 87 2A E2
36 82 E0 14 B6 09 C9 F5 57 5B 16 1A FA 1C 8A B2 DB F2 41 52
87 AC 9F CC 65 0A 4C 6F 87 FD 30 7D B4 FA CB 6D 03 64 CD 19
DC 22 FB B1 32 98 75 62 EF 1A 14 DC 5E 0A A2 ED 12 B5 CA C0
05 BE F3 1F CB B7 8A 8F 62 BA 11 12 A0 F6 79 FC 4D 97 74 4A
3C B9 0A 92 5E 8A DD A6 09 FF 68 82 F2 EE 9F 17 D2 D5 5C 72
76 CD 8D 05 61 BB 41 94 F9 FD 5C 72 71 21 54 3F 3B 32 E6 8F
45 3F 00 43 BB 07 1D 85 FC E2 24 CE 76 2C 96 40 10 FB 64 88
FB 89 D1 E3 81 0C E1 4C 37 B2 1D 60 40 D1 A5 2D 3B E4 85 87 E5 D7 05 D7 7D 9C C9 F5 70 0B 17 7B EF 18 83 46 79 0D 49 59
```

We can parse it easily with the DelimitedFiles module from Julia’s standard library.

```
using DelimitedFiles
function readgrid(filename)
open(filename) do f
Int, readdlm(f, String); base=16) .- 128
parse.(end
end
= readgrid("grid.txt") grid
```

We now need to define the actual optimization problem. First, we load JuMP and a solver which supports MILP (for instance GLPK).

`using JuMP, GLPK`

Defining a model consists of three stages Check out the Quick Start Guide for more info.

:

- declare some variables, their types, and their bounds,
- add some constraints,
- specify an objective.

In our case, we have a single binary variable for each cell, which will be 1 if the cell is explored by the rover and 0 otherwise. After creating the model, we use the `@variable`

macro to declare our variable `x`

of size `(n, n)`

.

```
= size(grid, 1)
n = Model(GLPK.Optimizer)
model @variable(model, x[1:n, 1:n], Bin)
```

The “upper neighbors” of a cell `(i, j)`

are ```
[(i-1, j-1), (i-1, j),
(i-1, j+1)]
```

. Ensuring that a cell is explored only if all of its upper neighbors are also explored means ensuring that `x[i, j]`

is 1 only if it is also 1 for all the upper neighbors. We also have to check that these neighbors are not outside the grid.

```
for i = 2:n, j = 1:n
if j > 1
@constraint(model, x[i, j] <= x[i-1, j-1])
end
@constraint(model, x[i, j] <= x[i-1, j])
if j < n
@constraint(model, x[i, j] <= x[i-1, j+1])
end
end
```

Finally, the objective is to maximize the total of all rewards on explored cells:

`@objective(model, Max, sum(grid[i, j] * x[i, j] for i = 1:n, j = 1:n))`

We now can send our model to the solver to be optimized. We retrieve the objective value and the values of our variable `x`

, and do some additional processing to get it in the expected format (0-based indices while Julia uses 1-based indexing). In practice, you should also check that the solver actually found an optimal solution, didn’t find that the model is infeasible, and did not run into numerical issues, using `termination_status(model)`

.

```
!(model)
optimize= Int(objective_value(model))
obj = Tuple.(findall(value.(x) .> 0))
indices = sort([(a-1, b-1) for (a, b) = indices]) indices
```

The resulting objective value is 1424, and the explored indices are

```
[(0, 0), (0, 1), (0, 2), (0, 3), (0, 4), (0, 5), (0, 6), (0, 7), (0, 8), (0, 9),
(0, 10), (0, 11), (0, 12), (0, 13), (0, 14), (0, 15), (0, 16), (0, 17), (0, 18),
(0, 19), (1, 0), (1, 1), (1, 2), (1, 3), (1, 4), (1, 5), (1, 6), (1, 7), (1, 8),
(1, 9), (1, 10), (1, 11), (1, 12), (1, 13), (1, 14), (1, 15), (1, 16), (1, 17),
(1, 18), (1, 19), (2, 0), (2, 1), (2, 2), (2, 3), (2, 4), (2, 5), (2, 6), (2, 7),
(2, 8), (2, 9), (2, 10), (2, 11), (2, 12), (2, 13), (2, 14), (2, 15), (2, 16),
(2, 17), (2, 18), (2, 19), (3, 0), (3, 1), (3, 2), (3, 3), (3, 4), (3, 5), (3, 6),
(3, 7), (3, 8), (3, 9), (3, 10), (3, 11), (3, 12), (3, 13), (3, 14), (3, 15),
(3, 16), (3, 17), (3, 18), (4, 0), (4, 1), (4, 2), (4, 3), (4, 4), (4, 5), (4, 6),
(4, 7), (4, 8), (4, 9), (4, 10), (4, 11), (4, 12), (4, 13), (4, 14), (4, 15),
(4, 16), (4, 17), (5, 0), (5, 1), (5, 2), (5, 3), (5, 4), (5, 5), (5, 6), (5, 7),
(5, 8), (5, 9), (5, 10), (5, 11), (5, 12), (5, 13), (5, 14), (5, 15), (5, 16),
(6, 0), (6, 1), (6, 2), (6, 3), (6, 4), (6, 5), (6, 6), (6, 7), (6, 8), (6, 9),
(6, 12), (6, 14), (6, 15), (7, 0), (7, 1), (7, 2), (7, 4), (7, 7), (8, 0), (8, 1), (9, 0)]
```

JuMP supports a wide variety of solvers, and this model is quite small so open-source solvers are more than sufficient. However, let’s see how to use the NEOS Server to give this problem to state-of-the-art solvers!

Depending on the solver you plan to use, you will have to submit the problem in a specific format. Looking at the solvers page, we can use MPS or LP format to use CPLEX or Gurobi for instance. Luckily, JuMP (or more accurately MathOptInterface) supports these formats (among others).

`, "rover.lp") # or "rover.mps" write_to_file(model`

We can now upload this file to the NEOS Server, and sure enough, a few seconds later, we get Gurobi’s output:

```
Gurobi Optimizer version 9.1.1 build v9.1.1rc0 (linux64)
Thread count: 32 physical cores, 64 logical processors, using up to 4 threads
Optimize a model with 1102 rows, 400 columns and 2204 nonzeros
Model fingerprint: 0x69169161
Variable types: 0 continuous, 400 integer (400 binary)
Coefficient statistics:
Matrix range [1e+00, 1e+00]
Objective range [1e+00, 1e+02]
Bounds range [1e+00, 1e+00]
RHS range [0e+00, 0e+00]
Found heuristic solution: objective 625.0000000
Presolve removed 116 rows and 45 columns
Presolve time: 0.01s
Presolved: 986 rows, 355 columns, 1972 nonzeros
Variable types: 0 continuous, 355 integer (355 binary)
Root relaxation: objective 1.424000e+03, 123 iterations, 0.00 seconds
Nodes | Current Node | Objective Bounds | Work
Expl Unexpl | Obj Depth IntInf | Incumbent BestBd Gap | It/Node Time
* 0 0 0 1424.0000000 1424.00000 0.00% - 0s
Explored 0 nodes (123 simplex iterations) in 0.01 seconds
Thread count was 4 (of 64 available processors)
Solution count 2: 1424 625
Optimal solution found (tolerance 1.00e-04)
Best objective 1.424000000000e+03, best bound 1.424000000000e+03, gap 0.0000%
********** Begin .sol file *************
# Solution for model obj
# Objective value = 1424 [...]
```

We get the same solution!

My complete solution is available on GitHub.

]]>This is an introduction to Git from a graph theory point of view. In my view, most introductions to Git focus on the actual commands or on Git internals. In my day-to-day work, I realized that I consistently rely on an internal model of the repository as a directed acyclic graph. I also tend to use this point of view when explaining my workflow to other people, with some success. This is definitely not original, many people have said the same thing, to the point that it is a running joke. However, I have not seen a comprehensive introduction to Git from this point of view, without clutter from the Git command line itself.

How to actually use the command line is not the topic of this article, you can refer to the man pages or the excellent *Pro Git* bookSee “Further reading” below.

. I will reference the relevant Git commands as margin notes.

The basic object in Git is the *commit*. It is constituted of three things: a set of parent commits (at least one, except for the initial commit), a diff representing changes (some lines are removed, some are added), and a commit message. It also has a nameActually, each commit gets a SHA-1 hash that identifies it uniquely. The hash is computed from the parents, the messages, and the diff.

, so that we can refer to it if needed.

A *repository* is fundamentally just a directed acyclic graph (DAG) You can visualize the graph of a repo, or just a subset of it, using `git log`

.

, where nodes are commits and links are parent-child relationships. A DAG means that two essential properties are verified at all time by the graph:

- it is
*oriented*, and the direction always go from parent to child, - it is
*acyclic*, otherwise a commit could end up being an ancestor of itself.

As you can see, these make perfect sense in the context of a version-tracking system.

Here is an example of a repo:

In this representation, each commit points to its childrenIn the actual implementation, the edges are the other way around: each commit points to its parents. But I feel like it is clearer to visualize the graph ordered with time.

, and they were organized from left to right as in a timeline. The *initial commit* is the first one, the root of the graph, on the far left.

Note that a commit can have multiple children, and multiple parents (we’ll come back to these specific commits later).

The entirety of Git operations can be understood in terms of manipulations of the graph. In the following sections, we’ll list the different actions we can take to modify the graph.

Some commits can be annotated: they can have a named label attached to them, that reference a specific commit.

For instance, `HEAD`

references the current commit: your current position in the graph Move around the graph (i.e. move the `HEAD`

pointer), using `git checkout`

. You can give it commit hashes, branch names, tag names, or relative positions like `HEAD~3`

for the great-grandparent of the current commit.

. This is just a convenient name for the current commit.Much like how `.`

is a shorthand for the current directory when you’re navigating the filesystem.

*Branches* are other labels like this. Each of them has a name and acts a simple pointer to a commit. Once again, this is simply an alias, in order to have meaningful names when navigating the graph.

In this example, we have three branches: `master`

, `feature`

, and `bugfix`

Do not name your real branches like this! Find a meaningful name describing what changes you are making.

. Note that there is nothing special about the names: we can use any name we want, and the `master`

branch is not special in any way.

*Tags* Create branches and tags with the appropriately named `git branch`

and `git tag`

.

are another kind of label, once again pointing to a particular commit. The main difference with branches is that branches may move (you can change the commit they point to if you want), whereas tags are fixed forever.

When you make some changes in your files, you will then record them in the repo by committing them To the surprise of absolutely no one, this is done with `git commit`

.

. The action creates a new commit, whose parent will be the current commit. For instance, in the previous case where you were on `master`

, the new repo after committing will be (the new commit is in green):

Two significant things happened here:

- Your position on the graph changed:
`HEAD`

points to the new commit you just created. - More importantly:
`master`

moved as well. This is the main property of branches: instead of being “dumb” labels pointing to commits, they will automatically move when you add new commits on top of them. (Note that this won’t be the case with tags, which always point to the same commit no matter what.)

If you can add commits, you can also remove them (if they don’t have any children, obviously). However, it is often better to add a commit that will *revert* Create a revert commit with `git revert`

, and remove a commit with `git reset`

**(destructive!)**.

the changes of another commit (i.e. apply the opposite changes). This way, you keep track of what’s been done to the repository structure, and you do not lose the reverted changes (should you need to re-apply them in the future).

There is a special type of commits: *merge commits*, which have more than one parent (for example, the fifth commit from the left in the graph above). As can be expected, the command is `git merge`

.

At this point, we need to talk about *conflicts*. See *Pro Git*’s chapter on merging and basic conflict resolution for the details on managing conflicts in practice.

Until now, every action was simple: we can move around, add names, and add some changes. But now we are trying to reconcile two different versions into a single one. These two versions can be incompatible, and in this case the merge commit will have to choose which lines of each version to keep. If however, there is no conflict, the merge commit will be empty: it will have two parents, but will not contain any changes itself.

Until now, all the actions we’ve seen were append-only. We were only adding stuff, and it would be easy to just remove a node from the graph, and to move the various labels accordingly, to return to the previous state.

Sometimes we want to do more complex manipulation of the graph: moving a commit and all its descendants to another location in the graph. This is called a *rebase*. That you can perform with `git rebase`

**(destructive!)**.

In this case, we moved the branch `feature`

from its old position (in red) to a new one on top of `master`

(in green).

When I say “move the branch `feature`

”, I actually mean something slightly different than before. Here, we don’t just move the label `feature`

, but also the entire chain of commits starting from the one pointed by `feature`

up to the common ancestor of `feature`

and its base branch (here `master`

).

In practice, what we have done is deleted three commits, and added three brand new commits. Git actually helps us here by creating commits with the same changes. Sometimes, it is not possible to apply the same changes exactly because the original version is not the same. For instance, if one of the commits changed a line that no longer exist in the new base, there will be a conflict. When rebasing, you may have to manually resolve these conflicts, similarly to a merge.

It is often interesting to rebase before merging, because then we can avoid merge commits entirely. Since `feature`

has been rebased on top of `master`

, when merging `feature`

onto `master`

, we can just *fast-forward* `master`

, in effect just moving the `master`

label where `feature`

is: You can control whether or not `git merge`

does a fast-forward with the `--ff-only`

and `--no-ff`

flags.

Another manipulation that we can do on the graph is *squashing*, i.e. lumping several commits together in a single one. Use `git squash`

**(destructive!)**.

Here, the three commits of the `feature`

branch have been condensed into a single one. No conflict can happen, but we lose the history of the changes. Squashing may be useful to clean up a complex history.

Squashing and rebasing, taken together, can be extremely powerful tools to entirely rewrite the history of a repo. With them, you can reorder commits, squash them together, moving them elsewhere, and so on. However, these commands are also extremely dangerous: since you overwrite the history, there is a lot of potential for conflicts and general mistakes. By contrast, merges are completely safe: even if there are conflicts and you have messed them up, you can always remove the merge commit and go back to the previous state. But when you rebase a set of commits and mess up the conflict resolution, there is no going back: the history has been lost forever, and you generally cannot recover the original state of the repository.

You can use Git as a simple version tracking system for your own projects, on your own computer. But most of the time, Git is used to collaborate with other people. For this reason, Git has an elaborate system for sharing changes with others. The good news is: everything is still represented in the graph! There is nothing fundamentally different to understand.

When two different people work on the same project, each will have a version of the repository locally. Let’s say that Alice and Bob are both working on our project.

Alice has made a significant improvement to the project, and has created several commits, that are tracked in the `feature`

branch she has created locally. The graph above (after rebasing) represents Alice’s repository. Bob, meanwhile, has the same repository but without the `feature`

branch. How can they share their work? Alice can send the commits from `feature`

to the common ancestor of `master`

and `feature`

to Bob. Bob will see this branch as part of a *remote* graph, that will be superimposed on his graph: You can add, remove, rename, and generally manage remotes with `git remote`

. To transfer data between you and a remote, use `git fetch`

, `git pull`

(which fetches and merges in your local branch automatically), and `git push`

.

The branch name he just got from Alice is prefixed by the name of the remote, in this case `alice`

. These are just ordinary commits, and an ordinary branch (i.e. just a label on a specific commit).

Now Bob can see Alice’s work, and has some idea to improve on it. So he wants to make a new commit on top of Alice’s changes. But the `alice/feature`

branch is here to track the state of Alice’s repository, so he just creates a new branch just for him named `feature`

, where he adds a commit:

Similarly, Alice can now retrieve Bob’s work, and will have a new branch `bob/feature`

with the additional commit. If she wants, she can now incorporate the new commit to her own branch `feature`

, making her branches `feature`

and `bob/feature`

identical:

As you can see, sharing work in Git is just a matter of having additional branches that represent the graph of other people. Some branches are shared among different people, and in this case you will have several branches, each prefixed with the name of the remote. Everything is still represented simply in a single graph.

Unfortunately, some things are not captured in the graph directly. Most notably, the staging area used for selecting changes for committing, stashing, and submodules greatly extend the capabilities of Git beyond simple graph manipulations. You can read about all of these in *Pro Git*.

**Note:** This section is *not* needed to use Git every day, or even to understand the concepts behind it. However, it can quickly show you that the explanations above are not pure abstractions, and are actually represented directly this way.

Let’s dive a little bit into Git’s internal representations to better understand the concepts. The entire Git repository is contained in a `.git`

folder.

Inside the `.git`

folder, you will find a simple text file called `HEAD`

, which contains a reference to a location in the graph. For instance, it could contain `ref: refs/heads/master`

. As you can see, `HEAD`

really is just a pointer, to somewhere called `refs/heads/master`

. Let’s look into the `refs`

directory to investigate:

```
cat refs/heads/master
$ f19bdc9bf9668363a7be1bb63ff5b9d6bfa965dd
```

This is just a pointer to a specific commit! You can also see that all the other branches are represented the same way.You must have noticed that our graphs above were slightly misleading: `HEAD`

does not point directly to a commit, but to a branch, which itself points to a commit. If you make `HEAD`

point to a commit directly, this is called a “detached HEAD” state.

Remotes and tags are similar: they are in `refs/remotes`

and `refs/tags`

.

Commits are stored in the `objects`

directory, in subfolders named after the first two characters of their hashes. So the commit above is located at `objects/f1/9bdc9bf9668363a7be1bb63ff5b9d6bfa965dd`

. They are usually in a binary format (for efficiency reasons) called packfiles. But if you inspect it (with `git show`

), you will see the entire contents (parents, message, diff).

To know more about Git, specifically how to use it in practice, I recommend going through the excellent *Pro Git* book, which covers everything there is to know about the various Git commands and workflows.

The Git man pages (also available via `man`

on your system) have a reputation of being hard to read, but once you have understood the concepts behind repos, commits, branches, and remotes, they provide an invaluable resource to exploit all the power of the command line interface and the various commands and options.Of course, you could also use the awesome Magit in Emacs, which will greatly facilitate your interactions with Git with the additional benefit of helping you discover Git’s capabilities.

Finally, if you are interested in the implementation details of Git, you can follow Write yourself a Git and implement Git yourself! (This is surprisingly quite straightforward, and you will end up with a much better understanding of what’s going on.) The chapter on Git in Brown and Wilson (2012) is also excellent.

Brown, Amy, and Greg Wilson. 2012. *The Architecture of Open Source Applications, Volume II*. Creative Commons. https://www.aosabook.org/en/index.html.

This is a short overview of the following paper by Fried et al. (2017):

Fried, Roland, Sermad Abbas, Matthias Borowski, and Michael Imhoff. 2017. “Online Analysis of Medical Time Series.”

Annual Review of Statistics and Its Application4 (1): 169–88. https://doi.org/10.1146/annurev-statistics-060116-054148.

Unfortunately, most of the papers from *Annual Reviews* are not open access. I hope the situation will improve in the future, but in the meantime there is Sci-Hub.

As the title suggests, it is a very complete review of statistical models for studying medical time series in an online setting. It appeared in *Annual Reviews*, which publish very nice reviews of various topics in a wide variety of fields.

Since I work on developing algorithms for a medical device, this is particularly relevant for my job!

The goal of online medical time series analysis is to detect relevant patterns, such as trends, trend changes, and abrupt jumps. This is to support online decision support systems.

The paper (section 5) The section explaining the motivation behind the review is at the end of the paper. I find it strange to go straight to the detailed exposition of complex statistical methods without explaining the context (medical time series and devices) in more detail.

goes on to explain the motivation for developing robust methods of time series analysis for healthcare applications.

An important issue in clinical applications is the false positive rates:

Excessive rates of false positive alarms—in some studies more than 90% of all alarms—lead to alarm overload and eventually desensitization of caregivers, which may ultimately jeopardize patient safety.

There are two kinds of medical devices: clinical decision support and closed-loop controllers. *Decision support* aims to provide the physician with recommendations to provide the best care to the patient. The goal of the medical device and system is to go from raw, low-level measurements to “high-level qualitative principles”, on which medical reasoning is directly possible. This is the motivation behind a need for abstraction, compression of information, and interpretability.

The other kind of medical device is *physiologic closed-loop controllers* (PCLC). In this case, the patient is in the loop, and the device can take action directly based on the feedback from its measurements. Since there is no direct supervision by medical practitioners, a lot more caution has to be applied. Moreover, these devices generally work in hard real-time environments, making online functioning an absolute requirement.

The objective here is to recover the time-varying level underlying the data, which contains the true information about the patient’s state.

We assume that the time series \(y_1, \ldots, y_N\) is generated by an additive model

\[ y_t = \mu_t + \epsilon_t + \eta_t,\qquad t=1,\ldots,N, \]

where \(\mu\) represents the signal value, \(\epsilon\) is a noise variable, and \(\eta\) is an outlier variable, which is zero most of the time, but can take large absolute values at random times.

The paper reviews many methods for recovering the underlying signal via state estimation. Moving window techniques start from a simple running median and go through successive iterations to improve the properties of the estimator. Each time, we can estimate the mean of the signal and the variance.

Going further, regression-based filtering provide an interesting approach to estimate locally the slope and the level of the time series. Of these, the repeated median (RM) regression offers a good compromise between robustness and efficiency against normal noise.

Without using moving windows, Kalman filters I already talked about Kalman filters when I briefly mentioned applications in my post on quaternions.

can also reconstruct the signal by including in their state a steady state, a level shift, slope change, and outliers. However, it is often difficult to specify the error structure.

Instead of trying to recover the underlying signal, we can try to detect directly some events: level shifts, trend changes, volatility changes.

This is generally based on autoregressive modelling, which work better if we can use a small time delay for the detection.

All the techniques discussed above were designed with a single time series in mind. However, in most real-world applications, you measure several variables simultaneously. Applying the same analyses on multivariate time series can be challenging. Moreover, if the dimension is high enough, it becomes too difficult for a physician to understand it and make decisions. It is therefore very important to have methods to extract the most pertinent and important information from the time series.

The idea is to apply dimensionality reduction to the multivariate time series in order to extract meaningful information. Principal component analysis is too static, so dynamic versions are needed to exploit the temporal structure. This leads to optimal linear double-infinite filters, that

explore the dependencies between observations at different time lags and compress the information in a multivariate time series more efficiently that ordinary (static) principal component analysis.

Graphical models can also be combined with dimensionality reduction to ensure that the compressed variables contain information about the patient’s state that is understandable to physicians.

Finally, one can also use clustering to group time series according to their trend behaviour.

To summarize, here are the key points studied in the paper.

Context: We have continuous measurements of physiological or biochemical variables. These are acquired from medical devices interacting with the patient, and processed by our medical system. The system, in turn, should either help the physician in her decision-making, or directly take action (in the case of a closed-loop controller).

There are several issues with the basic approach:

- Measurements are noisy and contaminated by measurement artefacts that impact the ability to make decisions based on the measurements.
- We often measure a multitude of variables, which means a lot of complexity.

The article reviews methods to mitigate these issues: extracting the true signal, detecting significant events, and reducing complexity to extract clinically relevant information.

The final part of the conclusion is a very good summary of the challenges we face when working with medical devices and algorithms:

Addressing the challenges of robust signal extraction and complexity reduction requires:

- Deep understanding of the clinical problem to be solved,
- Deep understanding of the statistical algorithms,
- Clear identification of algorithmic problems and goals,
- Capabilities and expertise to develop new algorithms,
- Understanding of the respective medical device(s) and the development environment,
- Acquisition of clinical data that is sufficient to support development and validation of new algorithms.
The multitude of resulting requirements cannot be addressed by one profession alone. Rather, close cooperation between statisticians, engineers, and clinicians is essential for the successful development of medical devices embedding advanced statistical algorithms. Moreover, regulatory requirements have to be considered early on when developing algorithms and implementing them in medical devices. The overarching goal is to help make patient care more efficient and safer.

The complex interplay between mathematical, technical, clinical, and regulatory requirements, and the need to interact with experts in all these fields, are indeed what makes my job so interesting!

I didn’t include references to the methods I mention in this post, since the paper itself contains a lot of citations to the relevant literature.

*Annual Review of Statistics and Its Application* 4 (1): 169–88. https://doi.org/10.1146/annurev-statistics-060116-054148.

The phrase “for fun and profit” seems to be a pretty old expression: according to the answers to this StackExchange question, it might date back to Horace’s *Ars Poetica* (“prodesse et delectare”). I like the idea that books (and ideas!) should be both instructive and enjoyable…

While exploring quaternions and the theory behind them, I noticed an interesting pattern: in the exposition of Solà (2017), quaternions and rotations matrices had exactly the same properties, and the derivation of these properties was rigorously identical (bar some minor notation changes).

This is expected because in this specific case, these are just two representations of the same underlying object: rotations. However, from a purely mathematical and abstract point of view, it cannot be a coincidence that you can imbue two different types of objects with exactly the same properties.

Indeed, this is not a coincidence: the important structure that is common to the set of rotation matrices and to the set of quaternions is that of a *Lie group*.

In this post, I want to explain why I find Lie theory interesting, both in its theoretical aspects (for fun) and in its potential for real-world application (for profit). I will also give a minimal set of references that I used to get started.

From a mathematical point of view, seeing a common structure in different objects, such as quaternions and rotation matrices, should raise alarm signals in our heads. Is there a deeper concept at play here? If we can find that two objects are two examples of the same abstract structure, maybe we’ll also be able to identify that structure elsewhere, maybe where it’s less obvious. And then, if we prove interesting theorems on the abstract structure, we’ll essentially get the same theorems on every example of this structure, and *for free!* (i.e. without any additional work!) When you push that idea to its extremes, you get category theory, which is just the study of (abstract) structure. This in a fun rabbit hole to get into, and if you’re interested, I recommend the amazing math3ma blog, or Riehl (2017) for a complete and approachable treatment.

We can think of it as a kind of factorization: instead of doing the same thing over and over, we can basically do it *once* and recall the general result whenever it is needed, as one would define a function and call it later in a piece of software.

In this case, Lie theory provides a general framework for manipulating objects that we want to *combine* and on which we’d like to compute *derivatives*. Differentiability is an essentially linear property, in the sense that it works best in vector spaces. Indeed, think of what you do to with a derivative: you want to *add* it to other stuff to represent increase rates or uncertainties. (And of course, the differential operator itself is linear.)

Once you can differentiate, a whole new world opensThis is why a lot of programming languages now try to make differentiability a first-class concept. The ability to differentiate arbitrary programs is a huge bonus for all kinds of operations common in scientific computing. Pioneering advances were made in deep learning libraries, such as TensorFlow and PyTorch; but recent advances are even more exciting. JAX is basically a differentiable Numpy, and Julia has always made differentiable programming a priority, via projects such as JuliaDiff and Zygote.

: optimization becomes easier (because you can use gradient descent), you can have random variables, and so on.

In the case of quaternions, we can define explicitly a differentiation operator, and prove that it has all the nice properties that we come to expect from derivatives. Wouldn’t it be nice if we could have all of this automatically? Lie theory gives us the general framework in which we can imbue non-“linear” objects with differentiability.

Continuing on the example of rotations, what common properties can we identify?

- Quaternions and rotation matrices can be multiplied together (to compose rotations), have an identity element, along with nice properties.
- Quaternions and rotation matrices can be differentiated, and we can map them to and from usual vectors in \(\mathbb{R}^m\).

These two group of properties actually correspond to common mathematical structures: a *group* and a *differentiable manifold*.

You’re probably already familiar with groups, but let’s recall the basic properties:

- It’s a set \(G\) equipped with a binary operation \(\cdot\).
- The group is closed under the operation: for any element \(x,y\) in G, \(x \cdot y\) is always in \(G\).
- The operation is associative: \(x \cdot (y \cdot z) = (x \cdot y) \cdot z\).
- There is a special element \(e\) of \(G\) (called the
*identity element*), such that \(x \cdot e = e \cdot x\) for all \(x \in G\). - For every element \(x\) of \(G\), there is a unique element of \(G\) denoted \(x^{-1}\) such that \(x \cdot x^{-1} = x^{-1} \cdot x = e\).

A differentiable manifold is a more complex beast. For a more complete introduction to differential geometry and differentiable manifolds, see Lafontaine (2015). It introduces manifolds, differential topology, Lie groups, and more advanced topics, all with little prerequisites (basics of differential calculus).

Although the definition is more complex, we can loosely imagine it as a surface (in higher dimension) on which we can compute derivatives at every point. This means that there is a tangent hyperplane at each point, which is a nice vector space where our derivatives will live.

You can think of the manifold as a tablecloth that has a weird shape, all kinds of curvatures, but no edges or spikes. The idea here is that we can define an *atlas*, i.e. a local approximation of the manifold as a plane. The name is telling: they’re called atlases because they play the exact same role as geographical maps. The Earth is not flat, it is a sphere with all kinds of deformations (mountains, canyons, oceans), but we can have planar maps that represent a small area with a very good precision. Similarly, atlases are the vector spaces that provide the best linear approximation of a small region around a point on the manifold.

So we know what a group and a differential manifold are. As it turns out, that’s all we need to know! What we have defined so far is a *Lie group* Lie theory is named after Sophus Lie, a Norwegian mathematician. As such, “Lie” is pronounced *lee*. Lie was inspired by Galois’ work on algebraic equations, and wanted to establish a similar general theory for differential equations.

, i.e. a group that is also a differentiable manifold. The tangent vector space at the identity element is called the *Lie algebra*.

To take the example of rotation matrices:

- We can combine them (i.e. by matrix multiplication): they form a group.
- If we have a function \(R : \mathbb{R} \rightarrow \mathrm{GL}_3(\mathbb{R})\) defining a trajectory (e.g. the successive attitudes of a object in space), we can find derivatives of this trajectory! They would represent instantaneous orientation change, or angular velocities.

For a complete overview of Lie theory, there are a lot of interesting material that you can find online. There is also a chapter on Lie theory in the amazing *Princeton Companion to Mathematics* (Gowers, Barrow-Green, and Leader 2010, sec. II.48).

I especially recommend the tutorial by Solà, Deray, and Atchuthan (2018): just enough maths to understand what is going on, but without losing track of the applications. There is also a video tutorial made for the IROS2020 conferenceMore specifically for the workshop on Bringing geometric methods to robot learning, optimization and control.

. For a more complete treatment, Stillwell (2008) is greatI really like John Stillwell as a textbook author. All his books are extremely clear and a pleasure to read.

.

Because of the group structure, the manifold is similar at every point: in particular, all the tangent spaces look alike. This is why the *Lie algebra*, the tangent space at the identity, is so important. All tangent spaces are vector spaces isomorphic to the Lie algebra, therefore studying the Lie algebra is sufficient to derive all the interesting aspects of the Lie group.

Lie algebras are always vector spaces. Even though their definition may be quite complex (e.g. skew-symmetric matrices in the case of the group of rotation matrices Skew-symmetric matrices are matrices \(A\) such that \(A^\top = -A\): \[ [\boldsymbol\omega]_\times = \begin{bmatrix}
0 & -\omega_x & \omega_y \\
\omega_x & 0 & -\omega_z \\
-\omega_y & \omega_z & 0
\end{bmatrix}. \]

), we can always find an isomorphism of vector spaces between the Lie algebra and \(\mathbb{R}^m\) (in the case of finite-dimensional Lie groups). This is really nice for many applications: for instance, the usual probability distributions on \(\mathbb{R}^m\) translate directly to the Lie algebra.

The final aspect I’ll mention is the existence of *exponential maps*, allowing transferring elements of the Lie algebra to the Lie group. The operator \(\exp\) will wrap an element of the Lie algebra (i.e. a tangent vector) to its corresponding element of the Lie group by wrapping along a geodesic of the manifold. There is also a logarithmic map providing the inverse operation.

The Lie group (in blue) with its associated Lie algebra (red). We can see how each element of the Lie algebra is wrapped on the manifold via the exponential map. Figure from Solà, Deray, and Atchuthan (2018).

If all this piqued your interest, you can read a very short (only 14 pages!) overview of Lie theory in Solà, Deray, and Atchuthan (2018). They also expand on applications to estimation and robotics (as the title suggests), so they focus on deriving Jacobians and other essential tools for any Lie group. They also give very detailed examples of common Lie groups (complex numbers, rotation matrices, quaternions, translations).

Lie theory is useful because it gives strong theoretical guarantees whenever we need to linearize something. If you have a system evolving on a complex geometric structure (for example, the space of rotations, which is definitely not linear), but you need to use a linear operation (if you need uncertainties, or you have differential equations), you have to approximate somehow. Using the Lie structure of the underlying space, you immediately get a principled way of defining derivatives, random variables, and so on.

Therefore, for estimation problems, Lie theory provides a strong backdrop to define state spaces, in which all the usual manipulations are possible. It has thus seen a spike of interest in the robotics literature, with applications to estimation, optimal control, general optimization, and many other fields.

I hope that this quick introduction has motivated you to learn more about Lie theory, as it is a fascinating topic with a lot of potential!

Gowers, Timothy, June Barrow-Green, and Imre Leader. 2010. *The Princeton Companion to Mathematics*. Princeton University Press.

Lafontaine, Jacques. 2015. *An Introduction to Differential Manifolds*. Springer International Publishing. https://doi.org/10.1007/978-3-319-20735-3.

Riehl, Emily. 2017. *Category Theory in Context*. United States: Dover Publications : Made available through hoopla.

Solà, Joan. 2017. “Quaternion Kinematics for the Error-State Kalman Filter.” *CoRR*. http://arxiv.org/abs/1711.02508v1.

Solà, Joan, Jeremie Deray, and Dinesh Atchuthan. 2018. “A Micro Lie Theory for State Estimation in Robotics.” *CoRR*. http://arxiv.org/abs/1812.01537v7.

Stillwell, John. 2008. *Naive Lie Theory*. Undergraduate Texts in Mathematics. Springer New York. https://doi.org/10.1007/978-0-387-78214-0.

Quaternions come from the quest to find more numbers. A bunch of mathematicians from the 19th century were so impressed by the unreasonable effectiveness of complex numbers that they wondered whether the trick could be extended to other structures, notably other \(n\)-tuples of real numbers.

As it turns out, this only works for some values of \(n\), namely 2, 4, and 8. There are profound reasons for this, with connections to various areas of mathematics. If you want to learn more about the historical background of hypercomplex numbers and their properties, I cannot recommend Stillwell’s *Mathematics and Its History* enough (Stillwell 2010).

Here is a quick recap, in case you’d like a quick overview of all the stuff that could reasonably be considered “numbers”.

*Complex numbers* are of the form \(z = a + ib\), with \(a\) and \(b\) real numbers, and \(i\) (the *imaginary unit* defined such that \(i^2 =
-1\). They form a *field*, and have all the nice properties that we can expect of well-behaved numbers, such as:

- Associativity: \((xy)z = x(yz)\)
- Commutativity: \(xy = yx\)

for all complex numbers \(x\), \(y\), and \(z\).

*Quaternions* are an extension of complex numbers, but in *four* dimensions instead of just two. They are of the form \(\mathbf{q} = a +
ib + jc + kd\), where \(i\), \(j\), and \(k\) the imaginary units. They follow a bunch of rules, which are \(i^2 = j^2 = k^2 = ijk = -1\), and \[ ij = -ji = k,\quad jk = -kj = i,\quad \text{and } ki = -ik = j. \]

One thing we can notice straightaway: quaternions are *not commutative!* That is, in general, \(\mathbf{q_1} \mathbf{q_2} \neq
\mathbf{q_2} \mathbf{q_1}\). However, like the complex numbers, they are associative.

Finally, *octonions* are the last members of the club. They can be described with 8 real components. Their imaginary units behave similarly to quaternions imaginary units, with more complicated rules. And furthermore, they are neither commutative nor even associative! (At this point, one wonders whether they deserve the title of number at all.)

And *that’s it!* There is something really strange happening here: we can define something looking like numbers only for dimensions 2, 4, and 8. Not for 3, not for 6, and not for 17. Moreover, we’re losing important properties along the way. Starting from the full, complete, *perfect* structure of the complex numbersYes, as many authors have pointed out, complex numbers are actually the most “complete” numbers. They have all the interesting properties, and fill in the gaps where so-called “real” numbers are failing. You can define any polynomial you want with complex coefficients, they will always have the correct number of roots in the complex numbers.

, we gradually lose the things that make working with numbers easy and intuitive.

So we can build these kinds of 4- or 8-dimensional numbers. That is fascinating in a way, even if only to answer the more philosophical question of what numbers are, and what their properties mean. But as it turns out, quaternions also have direct applications, notably in Physics.

Indeed, quaternions are extremely useful for representing *rotations*. There are many different ways to represent rotations:

- Euler angles are arguably the most intuitive: it’s just three angles representing yaw, pitch, and roll.
- Rotation matrices are the most natural from a mathematical point of view. Rotations are invertible linear transformation in 3D space, after all, so they should be represented by matrices in \(\mathrm{GL}_3(\mathbb{R})\).

However, both of these representations suffer from serious drawbacks. Euler angles are nice for intuitively understanding what’s going on, and for visualisation. However, even simple operations like applying a rotation to a vector, or composing two rotations, quickly lead to a lot of messy and unnatural computations. As soon as rotations become larger that \(\pi\), you get positions that are ill-defined (and you have to implement a lot of wraparound in \([-\pi,\pi]\) or \([0,2\pi]\)).

Rotation matrices are more straightforward. Composing two rotations and rotating a vector is matrix-matrix or matrix-vector multiplication. However, a \(3\times 3\) matrix contains 9 scalar elements, just to represent an object that is intrinsically 3-dimensional. Because of this, computations can become costly, or even unstable in floating-point representations.

The quaternion is therefore an elegant compromise in space requirements (4 elements) and ease of computations. It is moreover easy and numerically stable to get rotation matrices from quaternions and vice-versa.

From a numerical stability point of view, it is easier to deal with computation errors in the case of quaternions: a renormalized quaternion is always a valid rotation, while it is difficult to correct a matrix that is not orthogonal any more due to rounding errors.

Rotations are represented by unit quaternions, i.e. quaternions of norm 1: For a complete derivation of quaternion properties and rotation representation, see Solà (2017).

\[ \mathbf{q} = \exp\left((u_x i + u_y j + u_z k) \frac{\theta}{2}\right). \]

This represents a rotation of angle \(\theta\) around the vector \(\mathbf{u} = [u_x\; u_y\; u_z]^T\). The unit quaternion \(\mathbf{q}\) can also be written as

\[ \mathbf{q} = \cos\left(\frac{\theta}{2}\right) + \mathbf{u}\sin\left(\frac{\theta}{2}\right). \]

Composition of rotations is simply the quaternion multiplication. To orient a vector \(\mathbf{x}\) by applying a rotation, we conjugate it by our quaternion \(\mathbf{q}\):

\[ \mathbf{x'} = \mathbf{q} \mathbf{x} \mathbf{q'}, \]

where all products are quaternion multiplications, and the vectors \(\mathbf{x}\) and \(\mathbf{x'}\) are represented as *pure quaternions*, i.e. quaternions whose real part is zero:

\[ \mathbf{x} = x_1 i + x_2 j + x_3 k. \]

The fact that quaternions are not commutative also corresponds directly to the fact that rotations themselves are not commutative.

People are using various quaternions conventions, depending on their choice of multiplication formula (\(ij = -k\) or \(ij = k\)) and on their choice of representation (real part first or real part last). The case \(ij = k\) and real part first used in this article is called the *Hamilton convention*, whereas the convention where \(ij = -k\) and the real part is last is called the *JPL convention*.As the name suggest, the JPL convention is used mostly by NASA’s Jet Propulsion Laboratory, and by extension in aerospace applications. The Hamilton convention in more frequent in robotics and in the state estimation literature, such as Kalman filtering (Solà 2017).

As always, it is important to clearly define ahead of time what convention is used in your projects, especially if you’re getting inspirations from books and articles. Check if they are all using the same conventions, or hard-to-debug issues may arise!

Wikipedia and especially Solà (2017) contain a very useful reference of the various possible conventions, and where they are used.

Quaternions are often the best choice whenever rotation or attitude representations are required. This includes robotics, aerospace engineering, 3D graphics, video games, and so on.

They are of particular use in optimal control or state estimation scenarios: they are often the representation of choice for the attitude of an object in a Kalman filter for instance. For a nice introduction to Kalman filters, see this blog post or the introductory article by Welch and Bishop (2006).

When working with quaternions, it may be tiresome to reimplement all the basic functions you might need (composition, conjugation, conversions to and from rotation matrices and Euler angles, and so on). For an overview of efficient floating-point algorithms for manipulating quaternions, see Joldeş and Muller (2020).

Thankfully, quaternions are a very standard part of engineers toolboxes, so many libraries were written for a variety of scientific programming languages.

For Julia (easily the best programming language for this kind of application in my opinion):

- Quaternions.jl, for basic quaternion representation and manipulation,
- Rotations.jl, for representing rotations and operating on them more generally,
- CoordinateTransformations.jl, for a general framework not limited to rotations.

In Python:

- scipy.spatial.transform.Rotation, which has the benefit of being included in SciPy directly,
- numpy-quaternion, for a more feature-complete implementation.

And if you have to work in Matlab, unfortunately the functionality is locked away in the Robotics System Toolbox, or in the Aerospace Toolbox. Use open source software if you can!

As it turns out, quaternions are even more interesting than expected. Not satisfied with representing rotations efficiently, they also have the structure of a Lie group. A Lie group is a structure that combines the properties of a group and of a differentiable manifold, thus allowing to compute derivatives, and solve differential equations, of functions taking quaternion values. Rotation matrices also have a Lie group structure.

Update: I wrote a detailed post on Lie theory!

This is obviously extremely interesting when studying dynamical systems, as these are often modelled as systems of differential equations. Having a way to define rigorously derivatives and uncertainties on quaternions is a very significant result.

Joldeş, M., and J. -M. Muller. 2020. “Algorithms for Manipulating Quaternions in Floating-Point Arithmetic.” In *2020 IEEE 27th Symposium on Computer Arithmetic (ARITH)*, 48–55. https://doi.org/10.1109/ARITH48897.2020.00016.

*CoRR*. http://arxiv.org/abs/1711.02508v1.

Stillwell, John. 2010. *Mathematics and Its History*. Undergraduate Texts in Mathematics. Springer. https://doi.org/10.1007/978-1-4419-6053-5.

Welch, Greg, and Gary Bishop. 2006. “An Introduction to the Kalman Filter.” *In Practice* 7 (1): 1–16. https://doi.org/10.1.1.117.6808.

After Phase I, here are my solutions to Phase II problems. The full code is included in the post, but everything is also available on GitHub.

A PDF of the problems descriptions is available on the competition website, or directly from my GitHub repo.

The submission guidelines gave a template where everything is defined in a `Contest2020.Problems`

Namespace. I kept the default values for `⎕IO`

and `⎕ML`

because the problems were not particularly easier with `⎕IO←0`

.

```
:Namespace Contest2020
:Namespace Problems (⎕IO ⎕ML ⎕WX)←1 1 3
```

```
∇ score←dd DiveScore scores
:If 7=≢scores
scores←scores[¯2↓2↓⍋scores]
:ElseIf 5=≢scores
scores←scores[¯1↓1↓⍋scores]
:Else
scores←scores
:EndIf
score←2(⍎⍕)dd×+/scores ∇
```

This is a very straightforward implementation of the algorithm describe in the problem description. I decided to switch explicitly on the size of the input vector because I feel it is more natural. For the cases with 5 or 7 judges, we use Drop (`↓`

) to remove the lowest and highest scores.

At the end, we sum up the scores with `+/`

and multiply them by `dd`

. The last operation, `2(⍎⍕)`

, is a train using Format (Dyadic) to round to 2 decimal places, and Execute to get actual numbers and not strings.

```
∇ steps←{p}Steps fromTo;segments;width
width←|-/fromTo
:If 0=⎕NC'p' ⍝ No left argument: same as Problem 5 of Phase I
segments←0,⍳width
:ElseIf p<0 ⍝ -⌊p is the number of equally-sized steps to take
segments←(-⌊p){0,⍵×⍺÷⍨⍳⍺}width
:ElseIf p>0 ⍝ p is the step size
segments←p{⍵⌊⍺×0,⍳⌈⍵÷⍺}width
:ElseIf p=0 ⍝ As if we took zero step
segments←0
:EndIf
⍝ Take into account the start point and the direction.
steps←fromTo{(⊃⍺)+(-×-/⍺)×⍵}segments ∇
```

This is an extension to Problem 5 of Phase I. In each case, we compute the “segments”, i.e., the steps starting from 0. In a last step, common to all cases, we add the correct starting point and correct the direction if need be.

To compute equally-sized steps, we first divide the segment \([0, 1]\) in `p`

equal segments with `(⍳p)÷p`

. This subdivision can then be multiplied by the width to obtain the required segments.

When `p`

is the step size, we just divide the width by the step size (rounded to the next largest integer) to get the required number of segments. If the last segment is too large, we “crop” it to the width with Minimum (`⌊`

).

```
∇ urls←PastTasks url;r;paths
r←HttpCommand.Get url
paths←('[a-zA-Z0-9_/]+\.pdf'⎕S'&')r.Data
urls←('https://www.dyalog.com/'∘,)¨paths ∇
```

I decided to use `HttpCommand`

for this task, since it is simply one `]load HttpCommand`

away and should be platform-independent.

Parsing XML is not something I consider “fun” in the best of cases, and I feel like APL is not the best language to do this kind of thing. Given how simple the task is, I just decided to find the relevant bits with a regular expression using Replace and Search (`⎕S`

).

After finding all the strings vaguely resembling a PDF file name (only alphanumeric characters and underscores, with a `.pdf`

extension), I just concatenate them to the base URL of the Dyalog domain.

The first task can be solved by decomposing it into several functions.

```
⍝ Test if a DNA string is a reverse palindrome. isrevp←{⍵≡⌽'TAGC'['ATCG'⍳⍵]}
```

First, we compute the complement of a DNA string (using simple indexing) and test if its Reverse (`⌽`

) is equal to the original string.

```
⍝ Generate all subarrays (position, length) pairs, for 4 ≤ length ≤ 12. subarrays←{⊃,/(⍳⍵),¨¨3↓¨⍳¨12⌊1+⍵-⍳⍵}
```

We first compute all the possible lengths for each starting point. For instance, the last element cannot have any (position, length) pair associated to it, because there is no three element following it. So we crop the possible lengths to \([3, 12]\). For instance for an array of size 10:

```
{3↓¨⍳¨12⌊1+⍵-⍳⍵}10
┌──────────────┬───────────┬─────────┬───────┬─────┬───┬─┬┬┬┐
│4 5 6 7 8 9 10│4 5 6 7 8 9│4 5 6 7 8│4 5 6 7│4 5 6│4 5│4││││ └──────────────┴───────────┴─────────┴───────┴─────┴───┴─┴┴┴┘
```

Then, we just add the corresponding starting position to each length (1 for the first block, 2 for the second, and so on). Finally, we flatten everything.

```
∇ r←revp dna;positions
positions←subarrays⍴dna
⍝ Filter subarrays which are reverse palindromes.
r←↑({isrevp dna[¯1+⍵[1]+⍳⍵[2]]}¨positions)/positions ∇
```

For each possible (position, length) pair, we get the corresponding DNA substring with `dna[¯1+⍵[1]+⍳⍵[2]]`

(adding `¯1`

is necessary because `⎕IO←1`

). We test if this substring is a reverse palindrome using `isrevp`

above. Replicate (`/`

) then selects only the (position, length) pairs for which the substring is a reverse palindrome.

The second task is just about counting the number of subsets modulo 1,000,000. So we just need to compute \(2^n \mod 1000000\) for any positive integer \(n\leq1000\).

` sset←{((1E6|2∘×)⍣⍵)1}`

Since we cannot just compute \(2^n\) directly and take the remainder, we use modular arithmetic to stay mod 1,000,000 during the whole computation. The dfn `(1E6|2∘×)`

doubles its argument mod 1,000,000. So we just apply this function \(n\) times using the Power operator (`⍣`

), with an initial value of 1.

First solution: `((1+⊢)⊥⊣)`

computes the total return for a vector of amounts `⍺`

and a vector of rates `⍵`

. It is applied to every prefix subarray of amounts and rates to get all intermediate values. However, this has quadratic complexity.

` rr←(,\⊣)((1+⊢)⊥⊣)¨(,\⊢)`

Second solution: We want to be able to use the recurrence relation (`recur`

) and scan through the vectors of amounts and rates, accumulating the total value at every time step. However, APL evaluation is right-associative, so a simple Scan (`recur\amounts,¨values`

) would not give the correct result, since `recur`

is not associative and we need to evaluate it left-to-right. (In any case, in this case, Scan would have quadratic complexity, so would not bring any benefit over the previous solution.) What we need is something akin to Haskell’s `scanl`

function, which would evaluate left to right in \(O(n)\) timeThere is an interesting StackOverflow answer explaining the behaviour of Scan, and compares it to Haskell’s `scanl`

function.

. This is what we do here, accumulating values from left to right. (This is inspired from `dfns.ascan`

, although heavily simplified.)

` rr←{recur←{⍵[1]+⍺×1+⍵[2]} ⋄ 1↓⌽⊃{(⊂(⊃⍵)recur⍺),⍵}/⌽⍺,¨⍵}`

For the second task, there is an explicit formula for cashflow calculations, so we can just apply it.

` pv←{+/⍺÷×\1+⍵}`

```
∇ text←templateFile Merge jsonFile;template;ns
template←⊃⎕NGET templateFile 1
ns←⎕JSON⊃⎕NGET jsonFile
⍝ We use a simple regex search and replace on the
⍝ template.
text←↑('@[a-zA-Z]*@'⎕R{ns getval ¯1↓1↓⍵.Match})template ∇
```

We first read the template and the JSON values from their files. The `⎕NGET`

function read simple text files, and `⎕JSON`

extracts the key-value pairs as a namespace.

Assuming all variable names contain only letters, we match the regex `@[a-zA-Z]*@`

to match variable names enclosed between `@`

symbols. The function `getval`

then returns the appropriate value, and we can replace the variable name in the template.

```
∇ val←ns getval var
:If ''≡var ⍝ literal '@'
val←'@'
:ElseIf (⊂var)∊ns.⎕NL ¯2
val←⍕ns⍎var
:Else
val←'???'
:EndIf ∇
```

This function takes the namespace matching the variable names to their respective values, and the name of the variable.

- If the variable name is empty, we matched the string
`@@`

, which corresponds to a literal`@`

. - If the variable name is present in the namespace, we query the namespace to get the required value.
- Otherwise, we have an unknown variable, so we replace it with
`???`

.

` CheckDigit←{10|-⍵+.×11⍴3 1}`

The check digit satisfies the equation \[ 3 x_{1}+x_{2}+3 x_{3}+x_{4}+3 x_{5}+x_{6}+3 x_{7}+x_{8}+3 x_{9}+x_{10}+3 x_{11}+x_{12} \equiv 0 \bmod 10, \] therefore, \[ x_{12} \equiv -(3 x_{1}+x_{2}+3 x_{3}+x_{4}+3 x_{5}+x_{6}+3 x_{7}+x_{8}+3 x_{9}+x_{10}+3 x_{11}) \bmod 10. \]

Translated to APL, we just take the dot product between the first 11 digits of the barcode with `11⍴3 1`

, negate it, and take the remainder by 10.

```
⍝ Left and right representations of digits. Decoding
⍝ the binary representation from decimal is more
⍝ compact than writing everything explicitly.
lrepr←⍉(7⍴2)⊤13 25 19 61 35 49 47 59 55 11 rrepr←~¨lrepr
```

For the second task, the first thing we need to do is save the representation of digits. To save space, I did not encode the binary representation explicitly, instead using a decimal representation that I then decode in base 2. The right representation is just the bitwise negation.

```
∇ bits←WriteUPC digits;left;right
:If (11=≢digits)∧∧/digits∊0,⍳9
left←,lrepr[1+6↑digits;]
right←,rrepr[1+6↓digits,CheckDigit digits;]
bits←1 0 1,left,0 1 0 1 0,right,1 0 1
:Else
bits←¯1
:EndIf ∇
```

First of all, if the vector `digits`

does not have exactly 11 elements, all between 0 and 9, it is an error and we return `¯1`

.

Then, we take the first 6 digits and encode them with `lrepr`

, and the last 5 digits plus the check digit encoded with `rrepr`

. In each case, adding 1 is necessary because `⎕IO←1`

. We return the final bit array with the required beginning, middle, and end guard patterns.

```
∇ digits←ReadUPC bits
:If 95≠⍴bits ⍝ incorrect number of bits
digits←¯1
:Else
⍝ Test if the barcode was scanned right-to-left.
:If 0=2|+/bits[3+⍳7]
bits←⌽bits
:EndIf
digits←({¯1+lrepr⍳⍵}¨(7/⍳6)⊆42↑3↓bits),{¯1+rrepr⍳⍵}¨(7/⍳6)⊆¯42↑¯3↓bits
:If ~∧/digits∊0,⍳9 ⍝ incorrect parity
digits←¯1
:ElseIf (⊃⌽digits)≠CheckDigit ¯1↓digits ⍝ incorrect check digit
digits←¯1
:EndIf
:EndIf ∇
```

- If we don’t have the correct number of bits, we return
`¯1`

. - We test the first digit for its parity, to determine if its actually a left representation. If it’s not, we reverse the bit array.
- Then, we take the bit array representing the right digits (
`¯42↑¯3↓bits`

), separate the different digits using Partition (`⊆`

), and look up each of them in the`rrepr`

vector using Index Of (`⍳`

). We do the same for the left digits. - Final checks for the range of the digits (i.e., if the representations could not be found in the
`lrepr`

and`rrepr`

vectors), and for the check digit.

```
∇ parts←Balance nums;subsets;partitions
⍝ This is a brute force solution, running in
⍝ exponential time. We generate all the possible
⍝ partitions, filter out those which are not
⍝ balanced, and return the first matching one. There
⍝ are more advanced approach running in
⍝ pseudo-polynomial time (based on dynamic
⍝ programming, see the "Partition problem" Wikipedia
⍝ page), but they are not warranted here, as the
⍝ input size remains fairly small.
⍝ Generate all partitions of a vector of a given
⍝ size, as binary mask vectors.
subsets←{1↓2⊥⍣¯1⍳2*⍵}
⍝ Keep only the subsets whose sum is exactly
⍝ (+/nums)÷2.
partitions←nums{((2÷⍨+/⍺)=⍺+.×⍵)/⍵}subsets⍴nums
:If 0=≢,partitions
⍝ If no partition satisfy the above
⍝ criterion, we return ⍬.
parts←⍬
:Else
⍝ Otherwise, we return the first possible
⍝ partition.
parts←nums{((⊂,(⊂~))⊃↓⍉⍵)/¨2⍴⊂⍺}partitions
:EndIf ∇
```

This is the only problem that I didn’t complete. It required parsing the files containing the graphical representations of the trees, which was needlessly complex and, quite frankly, hard and boring with a language like APL.

However, the next part is interesting: once we have a matrix of coefficients representing the relationships between the weights, we can solve the system of equations. Matrix Divide (`⌹`

) will find one solution to the system. Since the system is overdetermined, we fix `A=1`

to find one possible solution. Since we want integer weights, the solution we find is smaller than the one we want, and may contain fractional weights. So we multiply everything by the Lowest Common Multiple (`∧`

) to get the smallest integer weights.

```
∇ weights←Weights filename;mobile;branches;mat
⍝ Put your code and comments below here
⍝ Parse the mobile input file.
mobile←↑⊃⎕NGET filename 1
branches←⍸mobile∊'┌┴┐'
⍝ TODO: Build the matrix of coefficients mat.
⍝ Solve the system of equations (arbitrarily setting
⍝ the first variable at 1 because the system is
⍝ overdetermined), then multiply the coefficients by
⍝ their least common multiple to get the smallest
⍝ integer weights.
weights←((1∘,)×(∧/÷))mat[;1]⌹1↓[2]mat ∇
```

```
:EndNamespace :EndNamespace
```

I’ve always been quite fond of APL and its “array-oriented” approach of programmingSee my previous post on simulating the Ising model with APL. It also contains more background on APL.

. Every year, Dyalog (the company behind probably the most popular APL implementation) organises a competition with various challenges in APL.

The Dyalog APL Problem Solving Competition consists of two phases:

- Phase I consists of 10 short puzzles (similar to what one can find on Project Euler or similar), that can be solved by a one-line APL function.
- Phase II is a collection of larger problems, that may require longer solutions and a larger context (e.g. reading and writing to files), often in a more applied setting. Problems are often inspired by existing domains, such as AI, bioinformatics, and so on.

In 2018, I participated in the competition, entering only Phase ISince I was a student at the time, I was eligible for a prize, and I won $100 for a 10-line submission, which is quite good!

(my solutions are on GitHub). This year, I entered in both phases. I explain my solutions to Phase I in this post. Another post will contain annotated solutions for Phase II problems.

The full code for my submission is on GitHub at dlozeve/apl-competition-2020, but everything is reproduced in this post.

Write a function that, given a right argument

`Y`

which is a scalar or a non-empty vector and a left argument`X`

which is a single non-zero integer so that its absolute value is less or equal to`≢Y`

, splits`Y`

into a vector of two vectors according to`X`

, as follows:If

`X>0`

, the first vector contains the first`X`

elements of`Y`

and the second vector contains the remaining elements.If

`X<0`

, the second vector contains the last`|X`

elements of`Y`

and the first vector contains the remaining elements.

**Solution:** `(0>⊣)⌽((⊂↑),(⊂↓))`

There are three nested trains hereTrains are nice to read (even if they are easy to abuse), and generally make for shorter dfns, which is better for Phase I.

. The first one, `((⊂↑),(⊂↓))`

, uses the two functions Take (`↑`

) and Drop (`↓`

) to build a nested array consisting of the two outputs we need. (Take and Drop already have the behaviour needed regarding negative arguments.) However, if the left argument is positive, the two arrays will not be in the correct order. So we need a way to reverse them if `X<0`

.

The second train `(0>⊣)`

will return 1 if its left argument is positive. From this, we can use Rotate (`⌽`

) to correctly order the nested array, in the last train.

UTF-8 encodes Unicode characters using 1-4 integers for each character. Dyalog APL includes a system function,

`⎕UCS`

, that can convert characters into integers and integers into characters. The expression`'UTF-8'∘⎕UCS`

converts between characters and UTF-8.Consider the following:

`'UTF-8'∘⎕UCS 'D¥⍺⌊○9' 68 194 165 226 141 186 226 140 138 226 151 139 57 'UTF-8'∘⎕UCS 68 194 165 226 141 186 226 140 138 226 151 139 57 D¥⍺⌊○9`

How many integers does each character use?

`'UTF-8'∘⎕UCS¨ 'D¥⍺⌊○9' ⍝ using ]Boxing on ┌──┬───────┬───────────┬───────────┬───────────┬──┐ │68│194 165│226 141 186│226 140 138│226 151 139│57│ └──┴───────┴───────────┴───────────┴───────────┴──┘`

The rule is that an integer in the range 128 to 191 (inclusive) continues the character of the previous integer (which may itself be a continuation). With that in mind, write a function that, given a right argument which is a simple integer vector representing valid UTF-8 text, encloses each sequence of integers that represent a single character, like the result of

`'UTF-8'∘⎕UCS¨'UTF-8'∘⎕UCS`

but does not use any system functions (names beginning with`⎕`

)

**Solution:** `{(~⍵∊127+⍳64)⊂⍵}`

First, we build a binary array from the string, encoding each continuation character as 0, and all the others as 1. Next, we can use this binary array with Partitioned Enclose (`⊂`

) to return the correct output.

A Microsoft Excel spreadsheet numbers its rows counting up from 1. However, Excel’s columns are labelled alphabetically — beginning with A–Z, then AA–AZ, BA–BZ, up to ZA–ZZ, then AAA–AAZ and so on.

Write a function that, given a right argument which is a character scalar or non-empty vector representing a valid character Excel column identifier between A and XFD, returns the corresponding column number

**Solution:** `26⊥⎕A∘⍳`

We use the alphabet `⎕A`

and Index Of (`⍳`

) to compute the index in the alphabet of every character. As a train, this can be done by `(⎕A∘⍳)`

. We then obtain an array of numbers, each representing a letter from 1 to 26. The Decode (`⊥`

) function can then turn this base-26 number into the expected result.

Write a function that, given a right argument which is an integer array of year numbers greater than or equal to 1752 and less than 4000, returns a result of the same shape as the right argument where 1 indicates that the corresponding year is a leap year (0 otherwise).

A leap year algorithm can be found here.

**Solution:** `1 3∊⍨(0+.=400 100 4∘.|⊢)`

According to the algorithm, a year is a leap year in two situations:

- if it is divisible by 4, but not 100 (and therefore not 400),
- if it is divisible by 400 (and therefore 4 and 100 as well).

The train `(400 100 4∘.|⊢)`

will test if each year in the right argument is divisible by 400, 100, and 4, using an Outer Product. We then use an Inner Product to count how many times each year is divisible by one of these numbers. If the count is 1 or 3, it is a leap year. Note that we use Commute (`⍨`

) to keep the dfn as a train, and to preserve the natural right-to-left reading of the algorithm.

Write a function that, given a right argument of 2 integers, returns a vector of the integers from the first element of the right argument to the second, inclusively.

**Solution:** `{(⊃⍵)+(-×-/⍵)×0,⍳|-/⍵}`

First, we have to compute the range of the output, which is the absolute value of the difference between the two integers `|-/⍵`

. From this, we compute the actual sequence, including zeroIf we had `⎕IO←0`

, we could have written `⍳|1+-/⍵`

, but this is the same number of characters.

: `0,⍳|-/⍵`

.

This sequence will always be nondecreasing, but we have to make it decreasing if needed, so we multiply it by the opposite of the sign of `-/⍵`

. Finally, we just have to start the sequence at the first element of `⍵`

.

Write a function that, given a right argument which is an integer vector and a left argument which is an integer scalar, reorders the right argument so any elements equal to the left argument come first while all other elements keep their order.

**Solution:** `{⍵[⍋⍺≠⍵]}`

`⍺≠⍵`

will return a binary vector marking as 0 all elements equal to the left argument. Using this index to sort in the usual way with Grade Up will return the expected result.

A common technique for encoding a set of on/off states is to use a value of \(2^n\) for the state in position \(n\) (origin 0), 1 if the state is “on” or 0 for “off” and then add the values. Dyalog APL’s component file permission codes are an example of this. For example, if you wanted to grant permissions for read (access code 1), append (access code 8) and rename (access code 128) then the resulting code would be 137 because that’s 1 + 8 + 128.

Write a function that, given a non-negative right argument which is an integer scalar representing the encoded state and a left argument which is an integer scalar representing the encoded state settings that you want to query, returns 1 if all of the codes in the left argument are found in the right argument (0 otherwise).

**Solution:** `{f←⍸∘⌽(2∘⊥⍣¯1)⋄∧/(f⍺)∊f⍵}`

The difficult part is to find the set of states for an integer. We need a function that will return `1 8 128`

(or an equivalent representation) for an input of `137`

. To do this, we need the base-2 representations of \(137 = 1 + 8 + 128 = 2^0 + 2^3 + 2^7 =
10010001_2\). The function `(2∘⊥⍣¯1)`

will return the base-2 representation of its argument, and by reversing and finding where the non-zero elements are, we find the correct exponents (`1 3 7`

in this case). That is what the function `f`

does.

Next, we just need to check that all elements of `f⍺`

are also in `f⍵`

.

A zigzag number is an integer in which the difference in magnitude of each pair of consecutive digits alternates from positive to negative or negative to positive.

Write a function that takes a single integer greater than or equal to 100 and less than 10

^{15}as its right argument and returns a 1 if the integer is a zigzag number, 0 otherwise.

**Solution:** `∧/2=∘|2-/∘×2-/(10∘⊥⍣¯1)`

First, we decompose a number into an array of digits, using `(10∘⊥⍣¯1)`

(Decode (`⊥`

) in base 10). Then, we Reduce N Wise to compute the difference between each pair of digits, take the sign, and ensure that the signs are indeed alternating.

Write a function that, given a right argument which is an integer scalar or vector, returns a 1 if the values of the right argument conform to the following pattern (0 otherwise):

- The elements increase or stay the same until the “apex” (the highest value) is reached
- After the apex, any remaining values decrease or remain the same

**Solution:** `{∧/(⍳∘≢≡⍋)¨(⊂((⊢⍳⌈/)↑⊢),⍵),⊂⌽((⊢⍳⌈/)↓⊢),⍵}`

How do we approach this? First we have to split the vector at the “apex”. The train `(⊢⍳⌈/)`

will return the index of (`⍳`

) the maximum element.

```
(⊢⍳⌈/)1 3 3 4 5 2 1 5
```

Combined with Take (`↑`

) and Drop (`↓`

), we build a two-element vector containing both parts, in ascending order (we Reverse (`⌽`

) one of them). Note that we have to Ravel (`,`

) the argument to avoid rank errors in Index Of.

```
{(⊂((⊢⍳⌈/)↑⊢),⍵),⊂⌽((⊢⍳⌈/)↓⊢),⍵}1 3 3 4 5 2 1
┌─────────┬───┐
│1 3 3 4 5│1 2│ └─────────┴───┘
```

Next, `(⍳∘≢≡⍋)`

on each of the two vectors will test if they are non-decreasing (i.e. if the ranks of all the elements correspond to a simple range from 1 to the size of the vector).

Write a function that takes as its right argument a vector of simple arrays of rank 2 or less (scalar, vector, or matrix). Each simple array will consist of either non-negative integers or printable ASCII characters. The function must return a simple character array that displays identically to what

`{⎕←⍵}¨`

displays when applied to the right argument.

**Solution:** `{↑⊃,/↓¨⍕¨⍵}`

The first step is to Format (`⍕`

) everything to get strings. A lot of trial-and-error is always necessary when dealing with nested arrays, and this being about formatting exacerbates the problem.

The next step would be to “stack everything vertically”, so we will need Mix (`↑`

) at some point. However, if we do it immediately we don’t get the correct result:

```
{↑⍕¨⍵}(3 3⍴⍳9)(↑'Adam' 'Michael')
1 2 3
4 5 6
7 8 9
Adam Michael
```

Mix is padding with spaces both horizontally (necessary as we want the output to be a simple array of characters) and vertically (not what we want). We will have to decompose everything line by line, and then mix all the lines together. This is exactly what SplitSplit is the dual of Mix.

(`↓`

) does:

```
{↓¨⍕¨⍵}(3 3⍴⍳9)(↑'Adam' 'Michael')(⍳10) '*'(5 5⍴⍳25)
┌───────────────────┬─────────────────┬──────────────────────┬─┬───────────────
│┌─────┬─────┬─────┐│┌───────┬───────┐│┌────────────────────┐│*│┌──────────────
││1 2 3│4 5 6│7 8 9│││Adam │Michael│││1 2 3 4 5 6 7 8 9 10││ ││ 1 2 3 4 5
│└─────┴─────┴─────┘│└───────┴───────┘│└────────────────────┘│ │└──────────────
└───────────────────┴─────────────────┴──────────────────────┴─┴───────────────
─────────────────────────────────────────────────────────────┐
┬──────────────┬──────────────┬──────────────┬──────────────┐│
│ 6 7 8 9 10│11 12 13 14 15│16 17 18 19 20│21 22 23 24 25││
┴──────────────┴──────────────┴──────────────┴──────────────┘│ ─────────────────────────────────────────────────────────────┘
```

Next, we clean this up with Ravel (`,`

) and we can Mix to obtain the final result.

Operations research (OR) is a vast area comprising a lot of theory, different branches of mathematics, and too many applications to count. In this post, I will try to explain why it can be a little disconcerting to explore at first, and how to start investigating the topic with a few references to get started.

Keep in mind that although I studied it during my graduate studies, this is not my primary area of expertise (I’m a data scientist by trade), and I definitely don’t pretend to know everything in OR. This is a field too vast for any single person to understand in its entirety, and I talk mostly from an “amateur mathematician and computer scientist” standpoint.

Operations research can be difficult to approach, since there are many references and subfields. Compared to machine learning for instance, OR has a slightly longer history (going back to the 17th century, for example with Monge and the optimal transport problem) For a very nice introduction (in French) to optimal transport, see these blog posts by Gabriel Peyré, on the CNRS maths blog: Part 1 and Part 2. See also the resources on optimaltransport.github.io (in English).

. This means that good textbooks and such have existed for a long time, but also that there will be plenty of material to choose from.

Moreover, OR is very close to applications. Sometimes methods may vary a lot in their presentation depending on whether they’re applied to train tracks, sudoku, or travelling salesmen. In practice, the terminology and notations are not the same everywhere. This is disconcerting if you are used to “pure” mathematics, where notations evolved over a long time and is pretty much standardised for many areas. In contrast, if you’re used to the statistics literature with its strange notations, you will find that OR is actually very well formalized.

There are many subfields of operations research, including all kinds of optimization (constrained and unconstrained), game theory, dynamic programming, stochastic processes, etc.

For an overall introduction, I recommend Wentzel (1988). It is an old book, published by Mir Publications, a Soviet publisher which published many excellent scientific textbooks Mir also published *Physics for Everyone* by Lev Landau and Alexander Kitaigorodsky, a three-volume introduction to physics that is really accessible. Together with Feynman’s famous lectures, I read them (in French) when I was a kid, and it was the best introduction I could possibly have to the subject.

. It is out of print, but it is available on Archive.org

. The book is quite old, but everything presented is still extremely relevant today. It requires absolutely no background, and covers everything: a general introduction to the field, linear programming, dynamic programming, Markov processes and queues, Monte Carlo methods, and game theory. Even if you already know some of these topics, the presentations is so clear that it is a pleasure to read! (In particular, it is one of the best presentations of dynamic programming that I have ever read. The explanation of the simplex algorithm is also excellent.)

If you are interested in optimization, the first thing you have to learn is modelling, i.e. transforming your problem (described in natural language, often from a particular industrial application) into a mathematical programme. The mathematical programme is the structure on which you will be able to apply an algorithm to find an optimal solution. Even if (like me) you are initially more interested in the algorithmic side of things, learning to create models will shed a lot of light on the overall process, and will give you more insight in general on the reasoning behind algorithms.

The best book I have read on the subject is Williams (2013)

. It contains a lot of concrete, step-by-step examples on concrete applications, in a multitude of domains, and remains very easy to read and to follow. It covers nearly every type of problem, so it is very useful as a reference. When you encounter a concrete problem in real life afterwards, you will know how to construct an appropriate model, and in the process you will often identify a common type of problem. The book then gives plenty of advice on how to approach each type of problem. Finally, it is also a great resource to build a “mental map” of the field, avoiding getting lost in the jungle of linear, stochastic, mixed integer, quadratic, and other network problems.

Another interesting resource is the freely available MOSEK Modeling Cookbook, covering many types of problems, with more mathematical details than in Williams (2013). It is built for people wanting to use the commercial MOSEK solver, so it could be useful if you plan to use a solver package like this one (more details on solvers below).

The basic algorithm for optimization is the simplex algorithm, developed by Dantzig in the 1940s to solve linear programming problems. It is the one of the main building blocks for mathematical optimization, and is used and referenced extensively in all kinds of approaches. As such, it is really important to understand it in detail. There are many books on the subject, but I especially liked Chvátal (1983) (out of print, but you can find cheap used versions on Amazon). It covers everything there is to know on the simplex algorithms (step-by-step explanations with simple examples, correctness and complexity analysis, computational and implementation considerations) and to many applications. I think it is overall the best introduction. Vanderbei (2014) follows a very similar outline, but contains more recent computational considerationsFor all the details about practical implementations of the simplex algorithm, Maros (2003) is dedicated to the computational aspects and contains everything you will need.

. (The author also has lecture slides.)

For more books on linear programming, the two books Dantzig (1997), Dantzig (2003) are very complete, if somewhat more mathematically advanced. Bertsimas and Tsitsiklis (1997) is also a great reference, if you can find it.

For all the other subfields, this great StackExchange answer contains a lot of useful references, including most of the above. Of particular note are Peyré and Cuturi (2019) for optimal transport, Boyd (2004) for convex optimization (freely available online), and Nocedal (2006) for numerical optimization. Kochenderfer (2019)

is not in the list (because it is very recent) but is also excellent, with examples in Julia covering nearly every kind of optimization algorithms.

If you would like to watch video lectures, there are a few good opportunities freely available online, in particular on MIT OpenCourseWare. The list of courses at MIT is available on their webpage. I haven’t actually looked in details at the courses contentI am more comfortable reading books than watching lecture videos online. Although I liked attending classes during my studies, I do not have the same feeling in front of a video. When I read, I can re-read three times the same sentence, pause to look up something, or skim a few paragraphs. I find that the inability to do that with a video diminishes greatly my ability to concentrate.

, so I cannot vouch for them directly, but MIT courses are generally of excellent quality. Most courses are also taught by Bertsimas and Bertsekas, who are very famous and wrote many excellent books.

Of particular notes are:

- Introduction to Mathematical Programming,
- Nonlinear Optimization,
- Convex Analysis and Optimization,
- Algebraic Techniques and Semidefinite Optimization,
- Integer Programming and Combinatorial Optimization.

Another interesting course I found online is Deep Learning in Discrete Optimization, at Johns Hopkins It is taught by William Cook, who is the author of *In Pursuit of the Traveling Salesman*, a nice introduction to the TSP problem in a readable form.

. It contains an interesting overview of deep learning and integer programming, with a focus on connections, and applications to recent research areas in ML (reinforcement learning, attention, etc.).

When you start reading about modelling and algorithms, I recommend you try solving a few problems yourself, either by hand for small instances, or using an existing solver. It will allow you to follow the examples in books, while also practising your modelling skills. You will also get an intuition of what is difficult to model and to solve.

There are many solvers available, both free and commercial, with various capabilities. I recommend you use the fantastic JuMP

library for Julia, which exposes a domain-specific language for modelling, along with interfaces to nearly all major solver packages. (Even if you don’t know Julia, this is a great and easy way to start!) If you’d rather use Python, you can use Google’s OR-Tools or PuLP for linear programming.

Regarding solvers, there is a list of solvers on JuMP’s documentation, with their capabilities and their license. Free solvers include GLPK (linear programming), Ipopt (non-linear programming), and SCIP (mixed-integer linear programming).

Commercial solvers often have better performance, and some of them propose a free academic license: MOSEK, Gurobi, and IBM CPLEX in particular all offer free academic licenses and work very well with JuMP.

Another awesome resource is the NEOS Server. It offers free computing resources for numerical optimization, including all major free and commercial solvers! You can submit jobs on it in a standard format, or interface your favourite programming language with it. The fact that such an amazing resource exists for free, for everyone is extraordinary. They also have an accompanying book, the NEOS Guide, containing many case studies and description of problem types. The taxonomy may be particularly useful.

Operations research is a fascinating topic, and it has an abundant literature that makes it very easy to dive into the subject. If you are interested in algorithms, modelling for practical applications, or just wish to understand more, I hope to have given you the first steps to follow, start reading and experimenting.

Bertsimas, Dimitris, and John N. Tsitsiklis. 1997. *Introduction to Linear Optimization*. Belmont, Massachusetts: Athena Scientific. http://www.athenasc.com/linoptbook.html.

Boyd, Stephen. 2004. *Convex Optimization*. Cambridge, UK New York: Cambridge University Press.

Chvátal, Vašek. 1983. *Linear Programming*. New York: W.H. Freeman.

Dantzig, George. 1997. *Linear Programming 1: Introduction*. New York: Springer. https://www.springer.com/gp/book/9780387948331.

———. 2003. *Linear Programming 2: Theory and Extensions*. New York: Springer. https://www.springer.com/gp/book/9780387986135.

Kochenderfer, Mykel. 2019. *Algorithms for Optimization*. Cambridge, Massachusetts: The MIT Press.

Maros, István. 2003. *Computational Techniques of the Simplex Method*. Boston: Kluwer Academic Publishers.

Nocedal, Jorge. 2006. *Numerical Optimization*. New York: Springer. https://www.springer.com/gp/book/9780387303031.

Peyré, Gabriel, and Marco Cuturi. 2019. “Computational Optimal Transport.” *Foundations and Trends in Machine Learning* 11 (5-6): 355–206. https://doi.org/10.1561/2200000073.

Vanderbei, Robert. 2014. *Linear Programming : Foundations and Extensions*. New York: Springer.

Wentzel, Elena S. 1988. *Operations Research: A Methodological Approach*. Moscow: Mir publishers.

Williams, H. Paul. 2013. *Model Building in Mathematical Programming*. Chichester, West Sussex: Wiley. https://www.wiley.com/en-fr/Model+Building+in+Mathematical+Programming,+5th+Edition-p-9781118443330.

ICLR is one of the most important conferences in machine learning, and as such, I was very excited to have the opportunity to volunteer and attend the first fully-virtual edition of the event. The whole content of the conference has been made publicly available, only a few days after the end of the event!

I would like to thank the organizing committee for this incredible event, and the possibility to volunteer to help other participantsTo better organize the event, and help people navigate the various online tools, they brought in 500(!) volunteers, waved our registration fees, and asked us to do simple load-testing and tech support. This was a very generous offer, and felt very rewarding for us, as we could attend the conference, and give back to the organization a little bit.

.

The many volunteers, the online-only nature of the event, and the low registration fees also allowed for what felt like a very diverse, inclusive event. Many graduate students and researchers from industry (like me), who do not generally have the time or the resources to travel to conferences like this, were able to attend, and make the exchanges richer.

In this post, I will try to give my impressions on the event, the speakers, and the workshops that I could attend. I will do a quick recap of the most interesting papers I saw in a future post.

As a result of global travel restrictions, the conference was made fully-virtual. It was supposed to take place in Addis Ababa, Ethiopia, which is great for people who are often the target of restrictive visa policies in Northern American countries.

The thing I appreciated most about the conference format was its emphasis on *asynchronous* communication. Given how little time they had to plan the conference, they could have made all poster presentations via video-conference and call it a day. Instead, each poster had to record a 5-minute videoThe videos are streamed using SlidesLive, which is a great solution for synchronising videos and slides. It is very comfortable to navigate through the slides and synchronising the video to the slides and vice-versa. As a result, SlidesLive also has a very nice library of talks, including major conferences. This is much better than browsing YouTube randomly.

summarising their research. Alongside each presentation, there was a dedicated Rocket.Chat channelRocket.Chat seems to be an open-source alternative to Slack. Overall, the experience was great, and I appreciate the efforts of the organizers to use open source software instead of proprietary applications. I hope other conferences will do the same, and perhaps even avoid Zoom, because of recent privacy concerns (maybe try Jitsi?).

where anyone could ask a question to the authors, or just show their appreciation for the work. This was a fantastic idea as it allowed any participant to interact with papers and authors at any time they please, which is especially important in a setting where people were spread all over the globe.

There were also Zoom session where authors were available for direct, face-to-face discussions, allowing for more traditional conversations. But asking questions on the channel had also the advantage of keeping a track of all questions that were asked by other people. As such, I quickly acquired the habit of watching the video, looking at the chat to see the previous discussions (even if they happened in the middle of the night in my timezone!), and then skimming the paper or asking questions myself.

All of these excellent ideas were implemented by an amazing website, collecting all papers in a searchable, easy-to-use interface, and even including a nice visualisation of papers as a point cloud!

Overall, there were 8 speakers (two for each day of the main conference). They made a 40-minute presentation, and then there was a Q&A both via the chat and via Zoom. I only saw a few of them, but I expect I will be watching the others in the near future.

This talk was fascinating. It is about robotics, and especially how to design the “software” of our robots. We want to program a robot in a way that it could work the best it can over all possible domains it can encounter. I loved the discussion on how to describe the space of distributions over domains, from the point of view of the robot factory:

- The domain could be very narrow (e.g. playing a specific Atari game) or very broad and complex (performing a complex task in an open world).
- The factory could know in advance in which domain the robot will evolve, or have a lot of uncertainty around it.

There are many ways to describe a policy (i.e. the software running in the robot’s head), and many ways to obtain them. If you are familiar with recent advances in reinforcement learning, this talk is a great occasion to take a step back, and review the relevant background ideas from engineering and control theory.

Finally, the most important take-away from this talk is the importance of *abstractions*. Whatever the methods we use to program our robots, we still need a lot of human insights to give them good structural biases. There are many more insights, on the cost of experience, (hierarchical) planning, learning constraints, etc, so I strongly encourage you to watch the talk!

This is a very clear presentation of an area of ML research I do not know very well. I really like the approach of teaching a set of methods from a “historical”, personal point of view. Laurent Dinh shows us how he arrived at this topic, what he finds interesting, in a very personal and relatable manner. This has the double advantage of introducing us to a topic that he is passionate about, while also giving us a glimpse of a researcher’s process, without hiding the momentary disillusions and disappointments, but emphasising the great achievements. Normalizing flows are also very interesting because it is grounded in strong theoretical results, that brings together a lot of different methods.

This talk was very interesting, and yet felt very familiar, as if I already saw a very similar one elsewhere. Especially for Yann LeCun, who clearly reuses the same slides for many presentations at various events. They both came back to their favourite subjects: self-supervised learning for Yann LeCun, and system 1/system 2 for Yoshua Bengio. All in all, they are very good speakers, and their presentations are always insightful. Yann LeCun gives a lot of references on recent technical advances, which is great if you want to go deeper in the approaches he recommends. Yoshua Bengio is also very good at broadening the debate around deep learning, and introducing very important concepts from cognitive science.

On Sunday, there were 15 different workshops. All of them were recorded, and are available on the website. As always, unfortunately, there are too many interesting things to watch everything, but I saw bits and pieces of different workshops.

A lot of pretty advanced talks about RL. The general theme was meta-learning, aka “learning to learn”. This is a very active area of research, which goes way beyond classical RL theory, and offer many interesting avenues to adjacent fields (both inside ML and outside, especially cognitive science). The first talk, by Martha White, about inductive biases, was a very interesting and approachable introduction to the problems and challenges of the field. There was also a panel with Jürgen Schmidhuber. We hear a lot about him from the various controversies, but it’s nice to see him talking about research and future developments in RL.

Ever since I read Judea Pearl’s *The Book of Why* on causality, I have been interested in how we can incorporate causality reasoning in machine learning. This is a complex topic, and I’m not sure yet that it is a complete revolution as Judea Pearl likes to portray it, but it nevertheless introduces a lot of new fascinating ideas. Yoshua Bengio gave an interesting talkYou can find it at 4:45:20 in the livestream of the workshop.

(even though very similar to his keynote talk) on causal priors for deep learning.

Cognitive science is fascinating, and I believe that collaboration between ML practitioners and cognitive scientists will greatly help advance both fields. I only watched Leslie Kaelbling’s presentation, which echoes a lot of things from her talk at the main conference. It complements it nicely, with more focus on intelligence, especially *embodied* intelligence. I think she has the right approach to relationships between AI and natural science, explicitly listing the things from her work that would be helpful to natural scientists, and things she wish she knew about natural intelligences. It raises many fascinating questions on ourselves, what we build, and what we understand. I felt it was very motivational!

I didn’t attend this workshop, but I think I will watch the presentations if I can find the time. I have found the intersection of differential equations and ML very interesting, ever since the famous NeurIPS best paper on Neural ODEs. I think that such improvements to ML theory from other fields in mathematics would be extremely beneficial to a better understanding of the systems we build.

]]>