When a given square sparse matrix can be permuted into a lower triangular form?

For some, the question in the title of this post could look trivial. After all, they have been doing LU decomposition in Matlab and using Matlab’s mldivide (or “backslash”) to solve systems with the factor matrices. These are usually done without worrying if the pivoting took place and permuted those factor matrices (Matlab used to call them psychologically triangular matrices. For “why?” and “who said that?”, see here). Because, the help page of mldivide says that it tests if a matrix “is square, is not diagonal, looks triangular, is actually triangular, there are zeros in the diagonal, is a permuted triangular”. So this must be taken care of…

Davis’s “Direct Methods for Sparse Linear Systems” book (a real gem) asks in its Exercise 3.7:

…Consider a matrix \mathbf{A} that may be a permuted upper or lower triangular matrix with both rows and columns permuted by unknown permutations \mathbf{P} and \mathbf{Q}. Write an algorithm that determines if the matrix is in this form and, if so, solves \mathbf{A}x = b….

Then the book hints an algorithm, which builds the permutation matrices essentially by picking a row with a single nonzero, ordering that row and the associated column, and then by deleting both from the matrix and so on. But this is ok, only if we can recover a full diagonal! What if, at some point, we have rows or columns with no entries at all? The “algorithm” does not handle this case.

It is perhaps ok in the context of LU decomposition, in which case one is warned of about the numerical rank deficiency during LU decomposition (before one attempts to solve the linear systems). But in general it is not ok; at Mathworks, folks are aware of this (see here). If the matrix is not coming from LU decomposition, then we may well be in trouble. We are back to the question: When a given square sparse matrix can be permuted into a lower triangular form?

Some other cases are also easy. If there is a perfect matching in the given matrix, then all standard matching algorithms find it quickly, and after permuting the matching entries to the diagonal, we can check if the matrix is reducible into 1\times 1 blocks (or the directed graph of \mathbf{B}=\mathbf{AQ} is acyclic, where \mathbf{Q} permutes the columns of \mathbf{A} to have a zero-free diagonal in \mathbf{B}). Every step of this is linear time—even computing the matching with the standard tricks). So this case is handled. The question is complicated when we do not have a perfect matching. It is in fact NP-complete to decide if a given square sparse matrix can be permuted into a lower/upper triangular form [1] with unsymmetric permutations.

So writing it once more to emphasize it:

It is easily decidable to test if \mathbf{A} can be symmetrically permuted to a lower/upper triangular form. The directed graph of \mathbf{A} should be acyclic (or \mathbf{PAP}^T is lower/upper triangular).

It is NP-complete to decide if a given square sparse matrix can be permuted to a lower/upper triangular form. Otherwise said, one can find two permutation matrices \mathbf{P} and \mathbf{Q} such that \mathbf{PAQ} is lower/upper triangular only with non-polynomial time algorithms, assuming P\neq NP.

The topic was discussed elsewhere before.


  1. Guillaume Fertin, Irena Rusu, and Stéphane Vialette:
    Obtaining a triangular matrix by independent row-column permutations. ISAAC 2015: 165–175 (doi)

Karp–Sipser heuristic:

On the degree-1 and degree-2 reduction rules

Karp–Sipser (KS) [5] heuristic is a well-known method to obtain high quality matchings in undirected graphs, bipartite graphs, and hypergraphs.  This heuristic is based on two reduction rules. The first rule says that if there is a vertex with degree one, then we can match that vertex with its unique neighbor and remove both from the graph without any loss.  The second rule says that if there is a vertex v with degree two, then its neighbors can be coalesced, and the vertex v can be discarded. A maximum matching on the reduced graph can be extended to a maximum matching on the original graph. If there is no degree-1 or degree-2 vertices, then a randomly selected edge is used to match two vertices, upon which both of them are removed.

The (full) heuristic with the degree-2 reduction rule turns out to be difficult to analyse. That is why most studies are conserved with the variant with the degree-1 reduction rule. This variant has excellent practical performance [6,7], good theoretical guarantees for certain classes of random graphs [2,5], and can be made to have an approximation ratio around 0.866 [3] for bipartite graphs and a close-by ratio on general graphs [4]. There are graphs for which KS with only the degree-1 reduction can obtain worse results. The following is an example and a table from [3].

The bipartite graph instances giving hard time to KS with the degree-1 rule correspond to the following matrices (rows form one part, and the columns form the other part) . The rows and columns are split into two sets in the middle, yielding a 2\times 2 structure. The (1,1)-block is full, the (2,2)-block is empty, and there is an identity matrix at the other two blocks. In order to hide the perfect matching composed of those two identity matrices, additional nonzeros are added to the last t rows in the first row block, and the last t columns in the first column block so that those rows and columns become dense. For n=3200, and with different thickness t\in \{2,4,8,16, 32\} in the full rows, KS with only the degree-1 rule obtains the performance shown in the table.


t Performance
2 0.782
4 0.704
8 0.707
16 0.685
32 0.670

We want to now turn attention to the full KS (that is with the degree-1 and degree-2 reduction rules). Anastos and Frieze [1] state in the abstract of a recent arXiv paper:

In a seminal paper on finding large matchings in sparse random graphs, Karp and Sipser (FOCS 1981) proposed two algorithms for this task. The second algorithm has been intensely studied, but due to technical difficulties, the first algorithm has received less attention. Empirical results in the orignal paper suggest that the first algorithm is superior. In this paper we show that this is indeed the case, at least for random cubic graphs. We show that w.h.p. the first algorithm will find a matching of size n/2-O(\log n) on a random cubic graph (indeed on a random graph with degrees in \{3, 4\}). We also show that the algorithm can be adapted to find a perfect matching w.h.p. in O(n) time, as opposed to O(n^{3/2}) time for the worst-case.

This wonderful result shows improvement with the degree-2 reductions (over using only degree-1 reductions): instead of leaving out O(n^{1/5}) vertices as KS with the degree-1 rule only, it leaves out only \log n vertices. We can also create instances for which KS with both reduction rules obtains perfect matchings, whereas that with the degree-1 reduction obtains inferior results. These instances are upper (or lower) Hessenberg matrices, or more sparser versions of it (for example if we discard all but the first and the last subdiagonal nonzeros). A figure is shown below with only two subdiagonal nonzeros. On these instances, an immediate application of the degree-2 reduction triggers a chain of degree-1 reductions, until we have a 2\times 2 matrix, whose perfect matching is obtained by a degree-2 and then a degree-1 reduction. On the other hand, KS with only the degree-1 rule obtains the performance shown in the table for different n (the performance reported in the table is average of 10 different random runs on the same instance).


n Performance
500 0.828
1000 0.790
5000 0.754

There is a catch though. While there is a linear time, straightforward implementation of the degree-1 rule, the straightforward implementation of the degree-2 rule can be quadratic in the worst case; consider repetitive merges with the same vertex which will happen with an Hessenberg matrix. We can perhaps assume that the worst cases are rare in practice, but nonetheless a worst-case quadratic run time complexity is a lot!


  1. Michael Anastos and Alan Frieze, Finding perfect matchings in random cubic graphs in linear time, arXiv preprint arXiv:1808.00825, Nov., 2018. (url)
  2. Jonathan Aronson, Alan Frieze, and Boris G. Pittel, Maximum matchings in sparse random graphs: Karp-Sipser revisited, Random Structures & Algorithms, 12 (1998), pp. 111-177.
  3. Fanny Dufossé, Kamer Kaya, and Bora Uçar, Two approximation algorithms for bipartite matching on multicore architectures, Journal of Parallel and Distributed Computing, 85 (2015), pp. 62–78. (doi)
  4. Fanny Dufossé, Kamer Kaya, Ioannis Panagiotas, and Bora Uçar, Approximation algorithms for maximum matchings in undirected graphs, Proceedings of the Seventh SIAM Workshop on Combinatorial Scientific Computing, Bergen, Norway, 2018, pp. 56–65. (doi)
  5. Richard M. Karp and Michael Sipser, Maximum matching in sparse random graphs, 22nd Annual IEEE Symposium on Foundations of Computer Science (FOCS), 1981, pp. 364–375. (doi)
  6. Johannes Langguth, Fredrik Manne, and Peter Sanders, Heuristic initialization for bipartite matching problems, Journal of Experimental Algorithmics, 15 (2010), pp. 1.1–1.22. (doi)
  7. Jakob Magun, Greedy matching algorithms, an experimental study, Journal of Experimental Algorithmics, 3 (1998). (doi)

9 November 2018

Today was a great day at the school. There were two remarkable events. The first one is the 30th anniversary of the lab LIP at ENS Lyon (here is the program). The second was that the ENS Lyon awarded Marc Snir with an honorary doctorate degree (Docteur Honoris Causa). The page announcing this event is in French and google translate does a good job in translating it to English. Yves Robert had slides for introducing Marc during the event.

Marc Snir
Marc Snir

Many of us know Marc (ever heard of the MPI standard and the book “MPI: The complete reference“). His work span complexity theory, to MPI standard, to parallel computing systems. And oh, he speaks French.

I was lucky to see his talk during the 30th anniversary of the LIP (but unlucky to miss the ceremony of Docteur Honoris Causa). He gave an overview of his involvement with building parallel machines: BlueGene, Blue Waters, SP/Vulcan, and others. His talk has many whimsical observations. Here are some:

  • A supercomputer research prototype is an oxymoron.
  • A supercomputer research design is either boring or unpractical.
  • The main contribution [of all the supercomputer design projects]: The projects educated a generation of researchers.
  • Theory informs practice, but should not be taken literally.

After stating

Often theory follows practice, rather than practice following theory

he discussed how his paper with Upfal and Felperin was motivated by Vulcan’s practically well behaving design of \log N + O(1) stages. Back then, the theory demonstrated 2\log N stages to avoid worst cases. The cited paper shows \log N + \log\log N, where the extra term is O(1) for practical purposes.

After the talk, I wondered which computer he helped to build was his favorite. He said, more or less,

I created them, so all are my favorite !


Eli Upfal, Sergio Felperin, and Marc Snir, Randomized routing with shorter paths, IEEE Transactions on Parallel and Distributed Systems, 7(4), pp. 356–362, 1996. (doi)

After CSC18

It has been long time we last blogged, due to Bora’s other professional engagements. There has been many things in between, including the 8th SIAM Workshop on CSC (CSC18), June 6–8, 2018, Bergen, Norway. The best paper of CSC18 by Kevin Deweese and John R. Gilbert, entitled Evolving Difficult Graphs for Laplacian Solvers is our subject.

Kevin is currently a PhD student at the University of California,  Santa Barbara, working on provably fast Laplacian solvers. See his web page for a few of his papers with experimental evaluation (most of the similar solvers are hard to implement).

Here is the abstract of the subject paper (Link to paper) by Kevin and John:

We seek to discover Laplacian linear systems that stress the ability of existing Laplacian solver packages to solve them efficiently. We employ a genetic algorithm to explore the problem space of graphs of fixed size and edge density. The goal is to measure the gap between theoretical and existing Laplacian solvers, by trying to find worst case example graphs for existing solvers. These problems may have little use inside any real world application, but they give great insight into solver behavior. We report performance results of our genetic algorithm, and explore the properties of the evolved graphs.

Kevin and John focus on the combinatorial solver by Kelner, Orecchia, Sidford, and Zhu (arXiv link) known as KOSZ on the provably fast solver side, and PCG with a Jacobi preconditioner on the traditional side. The genetic algorithms by Kevin and John create populations of graphs by starting with an initial population. Then, a Laplacian solver is run on all graphs and those which required the most work to solve are selected as parents. A random vertex is swapped between the selected parents to yield new individuals. Random edge mutations of the form edge removal and replacement are performed. The techniques are versatile: they are used for creating hard instances for KOSZ and PCG; they are also used to create instances in which the performance of KOSZ with respect to that of PCG varies. On a reported instance, KOSZ outperforms PCG by a factor of 2, and on another one PCG outperforms KOSZ by a factor of 140! In all experiments, the performance is measured in terms of the number of arithmetic operations. Future work includes combining different instances to yield larger problems which stress both solvers to understand our abilities (beware solvers!).

Recent news on the minimum fill-in problem

We have just seen a a paper in the Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 2017. The proceedings is available online for the curious. A paper from the proceedings is the subject of this post.

Minimum fill-in: Inapproximability and almost tight lower bounds,
by, Yixin Cao and R. B. Sandeep (url).

The minimum fill-in problem is one of the core problems in sparse direct methods. This problem is NP-complete [6]. The NP-completeness of the problem was first conjectured to be true in 1976 by Rose, Tarjan, and Lueker [5] in terms of the elimination process on undirected graphs. Then Rose and Tarjan [4] proved in 1978 that finding an elimination ordering on a directed graph that gives minimum fill-in is NP-complete (there was apparently a glitch in the proof which was rectified by Gilbert [1] two years later). Finally, Yannakakis [6] proved the NP-completeness of the minimum fill-in problem on undirected graphs in 1981.

The sparse matrix community needs a heuristic to tackle the minimum fill-in problem, as direct methods are used in many applications. There are quite efficient and effective heuristics for this purpose, which are variants of the approximate minimum degree and incomplete nested dissection algorithms. These heuristics are efficiently implemented in programs that are wide spread. The mentioned heuristics do not have any performance guarantees (except for classes of graphs such as graphs with good separators) but they are very-well established. So there is an NP-complete problem, practitioners have some heuristics and are happy with what they have.

On the other hand, the minimum fill-in problem poses many challenges to the theoreticians. Natanzon, Shamir, and Sharan [2] obtained an algorithm that approximates the minimum fill-in (k) by bounding the fill by 8k^2 in O(k n^3) time, for a graph with n vertices. The paper by Cao and Sandeep shows that unless P=NP, there is no polynomial time approximation scheme (PTAS) for the minimum fill-in problem. A PTAS is an algorithm which takes an instance of an optimization problem and a parameter \epsilon > 0 and, in polynomial time, produces a solution that is within a factor 1 + \epsilon of the optimum. This is a bad news in a way: no simple heuristic with a performance guarantee for theoretically oriented practitioners is in view.

The paper by Cao and Sandeep contains two other theorems, ruling out algorithms with approximation 1+\epsilon and with a running time 2^{O(n^{1-\delta})}, and exact algorithms with running time in 2^{O(k^{1/2-\delta})}\cdot n^{O(1)}, assuming exponential time hypothesis (ETH). Here n is again the number of vertices, k is an integer parameterizing the fill-in, and \delta is a small positive constant. ETH posits that the satisfiability problem with at most three variables per clause cannot be solved in 2^{o(p+q)} time, where p and q denote the number of clauses and variables, respectively, in the problem.

If you are interested in these problems, then see the tree-width and minimum fill-in challenges. The minimum tree-width is related to the shortest elimination tree over all elimination orderings of an undirected graph, which one of us had proved to be NP-complete [3].


  1. John R. Gilbert, A note on the NP-completeness of vertex elimination on directed graphs, SIAM Journal on Algebraic and Discrete Methods, 1 (1980), pp. 292–294 (link).
  2. Assaf Natanzon, Ron Shamir, and Roded Sharan, A polynomial approximation for the minimum fill-in problem, SIAM Journal on Computing, 30, (2000), pp. 1067–1079 (link).
  3. Alex Pothen, The complexity of optimal elimination trees,
    Tech. Report, Department of Computer Science, Penn State University, 1988 (link).
  4. Donald J. Rose and Robert E. Tarjan, Algorithmic aspects of vertex elimination in directed graphs, SIAM Journal on Applied Mathematics, 34 (1978), pp. 176–197 (link).
  5. Donald J. Rose, Robert E. Tarjan, and George S. Lueker, Algorithmic aspects of vertex elimination on graphs, SIAM Journal on Computing, 5 (1976), pp. 266–283 (link).
  6. Mihalis Yannakakis, Computing the minimum fill-in is NP-complete, SIAM Journal on Algebraic and Discrete Methods, 2 (1981), pp. 77–79 (link).

After CSC’16 (cont’): The best paper

Now that the Proceedings of the CSC Workshop 2016 has appeared (link), we can look at the papers. In this post, we are going to look at the best paper by Manne, Naim, Lerring, and Halappanavar, “On stable marriages and greedy matchings” (doi).

As noted in the citation of the award, this paper brings together  recent half-approximation algorithms  for  weighted matchings  and the classical  Gale-Shapley and McVitie-Wilson algorithms for the Stable Marriage problem. This is helpful in two ways.

First, the authors show that a  recent half-approximation algorithm for computing greedy matchings, the Suitor algorithm and its variants,  is equivalent to  the classical algorithms for the Stable Marriage problem. This correspondence enables the authors to propose multi-core and many-core parallel algorithms for the Stable Marriage problem, based on the greedy matching algorithms. This way, if I am not mistaken, they develop the first GPU algorithm for the Stable Marriage problem.

Second, the extensive theoretical work on the Stable Marriage algorithms  explains the  behavior of the Suitor matching algorithm. The worst-case number of proposals in the Suitor algorithm can be O(n^2),  where n is the number of vertices  in the bipartite graph. However, if the weights are assigned randomly, the expected number of proposals is O(n \log n), which follows from a classical analysis of Donald Knuth [1, 2].

This work represents a nice marriage between theory and practice: the practical algorithms for the greedy matching help in designing parallel algorithms for the Stable Marriage problem, and the theoretical understanding of the Stable Marriage problem sheds light into the behavior of the greedy weighted matching algorithms.


Stable marriage problem: This is described on a full bipartite graph G=(L\cup R, L\times R).  Each vertex on the left has a total ranking of the vertices on the right; similarly each vertex on the right has a total ranking of all vertices on the left. The aim is to find a matching M such that no l\in L and r\in R  would obtain a higher ranked partner if they were to abandon their current partners in M and rematch with each other.

Greedy matching algorithm: Here we compute approximations to matchings of maximum  weight in weighted graphs. The Greedy algorithm  considers edges in a non-increasing order of weights and the heaviest remaining edge (u,v) is added to the matching, whereupon all edges incident on u and v are removed. The Greedy matching is a half-approximate matching.

Suitor algorithm: This is a proposal based algorithm [3]. Vertices can propose in any order, however, each vertex proposes to a neighbor with the  heaviest weight  that already does not have a better weight offer to match with it.  A vertex could annul the proposal received by a neighbor if it has a better weight edge to offer the neighbor.  When two vertices propose to each other, they are matched. The Suitor algorithm computes the same matching as the one obtained by the Greedy algorithm and the Locally Dominant edge algorithm described by Robert Preis [4]!


  1. Vicki Knoblauch, Marriage matching: A conjecture of Donald Knuth, Economics Working Papers. Paper 200715, http://digitalcommons.uconn.edu/econ_wpapers/200715, 2007.
  2. Donald E. Knuth, Stable Marriage and its Relation to Other Combinatorial Problems, CRM Proceedings and Lecture Notes, Vol. 10 American Mathematical Society, (1997).
  3. Fredrik Manne and Mahantesh Halappanavar, New effective multithreaded matching algorithms, in Proc. IPDPS 2014, IEEE 28th International Parallel and Distributed Processing Symposium, Phoenix, AZ, USA, pp. 519–528, 2014.
  4. Robert Preis, Linear time 1/2-approximation algorithm for maximum weighted matching in general graphs, in Proc. STACS 99, 16th Annual Symposium on Theoretical Aspects of Computer Science Trier, Germany, pp. 259–269, 1999.

After CSC16 (cont’)

Umit V. Çatalyürek

Umit V. Çatalyürek presented our work on the directed acyclic graph (DAG) partitioning problem. In this problem, we are given a DAG G=(V,E) and an integer k\geq 2. The aim is to partition the vertex set V into k parts V_1,\ldots, V_k in such a way that the parts have (almost) equal weight and the sum of the costs of all those arcs having their endpoint in different parts minimized. Vertices can have weights, and the edges can have costs. Up to now, all is standard. What is not standard is that the quotient graph of the parts should be acyclic. In other words, the directed graph G'=(V', E'), where V'=\{V_1,\ldots,V_k\} and V_i\rightarrow V_j\in E' iff v_i\rightarrow v_j\in E for some v_i\in V_i and v_j\in V_j, should be acyclic.

John R. Gilbert

John R. Gilbert wanted to understand the complexity of the problem, with respect to the undirected version. He is an expert on the subject matter (see, e.g., [2]). He asked what happens if we orient the edges of the n\times n model problem. If you are not familiar with this jargon, it is the n\times n mesh with each node being connected to its immediate neighbor in the four main directions, if those neighbors exist. See the small image for an 8\times 8 example.

8\times 8 model problem.

Partitioning these kind of meshes is a very hard problem. Gary Miller had mentioned their optimal partitioning in his invited talk (about something else). Rob Bisseling [Ch. 4, 1] has a great text about partitioning these meshes and their cousins in 3D. I had surveyed known methods in a paper with Anaël Grandjean [3]. In particular, Anaël found about discrete isoperimetric problems, showed that the shape of an optimal partition at a corner, or inside the mesh was discussed in the related literature. He also showed that the Cartesian partitioning is optimal for edge cut.  Anaël also proposed efficient heuristics which produced connected components. See the cited papers for nice images. Our were based on our earlier work with Umit [4].

Anyhow,  let’s return back to acyclic partitioning of DAGs, and John’s question. He suggested that we should look at the electrical spiral heater to get an orientation. This orientation results in a total order of the vertices. The figures below show the ordering of the 8\times 8 and the 16\times 16 meshes. Only some of the edges are shown; all edges of the mesh, including those that are not shown are from the lower numbered vertex to the higher numbered one.

As seen in the figures above, the spiral ordering is a total order and there is only one way to cut the meshes into two parts with the same number of vertices; blue and red show the two parts.

Theorem: Consider the n\times n mesh whose edges are oriented following the electrical spiral heater ordering. The unique acyclic cut with n^2/2 vertices in each side has n^2-4n+3 edges in cut, for n\geq 8.

The theorem can be proved by observing that the blue vertices in the border (excluding the corners) has one arc going to a red vertex; those in the interior, except the one labeled n^2/2 has 2 such arcs;  the vertex labeled n^2/2 has three such arcs. The condition n\geq 8 comes from the fact that we assumed that there are blue vertices in the interior of the mesh. This is a lot of edges to cut!


Later, John said that he visited the Maxwell Museum of Anthropology at UNM after the CSC16 workshop, and saw that similar designs by the original native New Mexicans.


  1. Rob H. Bisseling, Parallel Scientific Computation: A Structured Approach using BSP and MPI, 1st ed, Oxford University Press, 2004.
  2. John R. Gilbert, Some nested dissection order is nearly optimal. Inf. Process. Lett. 26(6): 325–328 (1988).
  3. Anaël Grandjean and Bora Uçar, On partitioning two dimensional finite difference meshes for distributed memory parallel computers. PDP 2014: 9–16.
  4. Bora Uçar and Umit V. Çatalyürek, On the scalability of hypergraph models for sparse matrix partitioning. PDP 2014: 593–600.
  5. Da-Lun Wang and Ping Wang, Discrete isoperimetric problems, SIAM Journal on Applied Mathematics, 32(4):860–870 (1977).