## CSC Mini-symposium at SIAM PP16

SIAM Conference on Parallel Processing for Scientific Computing (PP16) took place in Paris, April 12–15, 2016. This was the first SIAM PP outside the US.

As announced before, we had a mini-symposium on CSC in three sessions. SIAM keeps records of the program, the speakers and the abstracts (here). As the organizers, we thought that we should take this one step ahead and make the pdf’s of the talks available as well, wherever possible.

Here is the list of talks, in the order of speaker line-up.

#### 1. Computational surgery: Visualization with augmented matrices (Alex Pothen)

Authors: Alex Pothen (Purdue University, USA) and Mu Wang (Purdue University, USA)

Abstract. Not yet

Talk: No files yet.

Comments: Alex had a video showcasing the use of the solver, in real-time, updating the mesh and showing the result of the surgery. The video was prepared with the help of professionals. The audience was all captive and silent during the video!

#### 2. Parallel combinatorial algorithms in sparse matrix computation? (Esmond Ng)

Authors: Mathias Jacquelin (Lawrence Berkeley National Laboratory, USA), Esmond Ng (Lawrence Berkeley National Laboratory, USA), Barry Peyton (Dalton State College, USA), Yili Zheng (Lawrence Berkeley National Laboratory, USA), and Kathy Yelick, (Lawrence Berkeley National Laboratory, USA).

Abstract. Combinatorial techniques are used in several phases of sparse matrix computation. For large-scale problems, while numerical phases are often executed in parallel, most of these combinatorial techniques are serial and can become bottlenecks. We are investigating the extent to which some of the combinatorial techniques can be performed in parallel.

Talk: No files yet.

Comments: RCM is discussed as the showcase. I think Aydin and Ariful were also involved (second slide of the talk had this information). Given the group’s experience in distributed memory BFS, it is of surprise that the RCM is implemented based on this. The target was not small-world graphs, neither social network graphs; graphs with large diameters were at the focus. So the parallelization problem is somehow tough. Sorting (by the vertex degrees) is required for a formal RCM (I could not catch the details of which sorting algorithm was used). This step incurred cost and was detrimental to performance. Maybe, in an application one can skip sorting and obtain a variant of RCM (after all, RCM is a heuristic). Also, Esmond pointed that the motivation for this work is that the matrix/graph could be already distributed in another context. Instead of collecting the global data to a central processor, solving it there, and distributing the result back to everyone, one could possibly solve the problem in parallel. In RCM,  pseudo-peripheral nodes are used traditionally. They are again found by BFS. There are recent work for finding the diameters in graphs with a few rounds of BFS. Maybe review this.

#### 3. Parallel graph matching algorithms using matrix algebra (Ariful Azad)

Authors: Ariful Azad (Lawrence Berkeley National Laboratory, USA) and Aydin Buluç (Lawrence Berkeley National Laboratory, USA).

Abstract. We present distributed-memory parallel algorithms for computing matchings in bipartite graphs. We consider both exact and approximate algorithms for cardinality and weighted matching problems. We substitute the asynchronous data access patterns of traditional matching algorithms by a small subset of more structured, bulk-synchronous functions based on matrix algebra. Relying on communication-avoiding algorithms for the underlying matrix-algebric modules, different matching algorithms achieve good speedups on tens of thousands of cores on current supercomputers.

#### 4. On the Birkhoff–von Neumann decomposition (Bora Uçar)

Authors: Michele Benzi (Emory University, USA), Fanny Dufossé (Inria Lille-Nord Europe, France),  Kamer Kaya (Sabanci University, Turkey), Alex Pothen (Purdue University, USA), and Bora Uçar (CNRS and ENS Lyon, France).

Abstract. Not yet

Talk: (boraUcar-pp16) file.

#### 5. A Partitioning problem for load balancing and reducing communication from the field of quantum chemistry (Edmond Chow)

Authors: Edmond Chow (Georgia Institute of Technology, USA)

Abstract. We present a combinatorial problem and potential solutions arising in parallel computational chemistry. The Hartree-Fock (HF) method has a very complex data access pattern. Much research has been devoted over 20 years for parallelizing this important method, based primarily on intuition and experience. A formal approach for parallelizing HF while reducing communication may come from graph and hypergraph partitioning. Besides providing a potential solution, this approach may also shed light on the optimality of existing approaches.

Talk: (edmondChow-pp16) file.

#### 6. Community detection on GPU (Fredrik Manne)

Authors: Md Naim (University of Bergen, Norway) and Fredrik Manne (University of Bergen, Norway,)

Abstract. There has been considerable interest in community detection for finding the modularity structure in real world data. Such data sets can arise from social networks as well as various scientific domains. The Louvain method is one popular method for this problem as it is simple and fast. It can also be used to detect hierarchical structures in the data. However, its inherently sequential nature and cache unfriendly workloads makes it difficult to parallelize. This is particularly true for co-processor architectures. In this work we show how these obstacles can be overcome and present results from implementing the algorithm on a GPU.

Talk: (fredrikManne-pp16) file.

Comments: Md Naim could not attend to the conference (dommage), and Fredrik replaced him.

#### 7. Scalable parallel algorithms for de novo assembly of complex genomes (Evangelos Georganas)

Authors: Evangelos Georganas (University of California, Berkeley, USA)

Abstract. A critical problem for computational genomics is the problem of de novo genome assembly: the development of robust scalable methods for transforming short randomly sampled sequences into the contiguous and accurate reconstruction of complex genomes. While advanced methods exist for assembling the small and haploid genomes of prokaryotes, the genomes of eukaryotes are more complex. We address this challenge head on by developing HipMer, an end-to-end high performance de novo assembler designed to scale to massive concurrencies. HipMer employs an efficient Unified Parallel C (UPC) implementation and computes the assembly of the human genome in only 8.4 minutes using 15,360 cores of a Cray XC30 system.

Talk: (evangelosGeorganas-pp16) file.

#### 8. Faster and more scalable sparse matrix-matrix multiplication (Aydin Buluç)

Authors: Aydin Buluç (Lawrence Berkeley National Laboratory, USA)

Abstract. We present a faster and more scalable implementation of the sparse matrix-matrix multiplication (SpGEMM) kernel. The implementation exploits multiple levels of parallelism, using a scalable three-dimensional algorithm for inter-node parallelism and multithreaded subroutines for intra-node parallelism. The three-dimensional formalism has characteristics that are special for the sparse case, which we thoroughly explain. We then provide results on applications in Markov graph clustering and Algebraic Multigrid based graph coarsening.

Talk: (aydinBuluc-pp16) file.

#### 9. Directed graph partitioning (Umit V. Catalyurek)

Authors: Julien Herrmann (The Ohio State University, USA), Umit V. Catalyurek (The Ohio State University, USA), Kamer Kaya (Sabanci University, Turkey), and Bora Uçar (CNRS and ENS Lyon, France).

Abstract. In scientific computing directed graphs are commonly used for modeling dependencies among entities. However, while modeling some of the problems as graph partitioning problems, directionality is generally ignored. Accurate modeling of some of the problems necessitates to take the directionally into account, which adds additional constraints that cannot be easily addressed in the current state-of-the-art partitioning methods and tools. In this talk, we will discuss some example problems, models and potential solution approaches for them.

Talk: (pdf) file.

#### 10. Parallel approximation algorithms for b-Edge Covers and data privacy (Arif Khan)

Authors: Arif Khan (Purdue University, USA), and Alex Pothen (Purdue University, USA).

Abstract. We propose a new 3/2-approximation algorithm, called LSE for computing $b$-Edge Cover and its application to a data privacy problem called adaptive $latex k$-Anonymity. $b$-Edge Cover is a special case of the well-known  Set Multicover problem and also a generalization of Edge Cover problem in graphs. The objective is to choose a subset of $latex C$ edges in the graph with weights on the edges, such that at least a specified number $latex b(v)$ of edges in $latex C$ are incident on each vertex $latex v$ and the sum of edge weights is minimized. We implement the algorithm on serial and shared-memory parallel processors and compare its performance against a collection of inherently sequential approximation algorithms that have been proposed for the Set Multicover problem. With LSE, i) we propose the first shared-memory parallel algorithm for the adaptive $latex k$-Anonymity problem and ii) give new theoretical results regarding privacy guarantees which are significantly stronger than the best known previous results.

Talk: (arifKhan-pp16) file.

#### 11. Clustering sparse matrices with information from both numerical values and pattern (Daniel Ruiz)

Authors: Iain S. Duff (Science & Technology Facilities Council, United Kingdom and CERFACS, Toulouse, France), Philip Knight (University of Strathclyde, United Kingdom), Sandrine Mouysset (Université de Toulouse, France), Daniel Ruiz (ENSEEIHT, France), and Bora Uçar (CNRS and ENS Lyon, France).

Abstract. Considering any square fully indecomposable matrix $A$, we can apply a two-sided diagonal scaling to $latex |A|$ to render it into doubly stochastic form. The Perron-Frobenius theorem is a key tool to exploit and we aim to use spectral properties of doubly stochastic matrices to reveal hidden block structure in matrices. We also combine this with classical graph analysis techniques to design partitioning algorithms for large sparse matrices based on both numerical values and pattern information.

Talk: (danielRuiz-pp16) file.

#### 12. Parallel graph coloring on manycore architectures (Mehmet Deveci)

Authors: Mehmet Deveci (Sandia National Laboratories, USA), Erik Boman (Sandia National Laboratories, USA), and Siva Rajamanickam (Sandia National Laboratories, USA).

Abstract. In scientific computing, the problem of finding sets of independent tasks is usually addressed with graph coloring. We study performance portable graph coloring algorithms for many-core architectures. We propose a novel edge-based algorithm and enhancements of the speculative Gebremedhin-Manne algorithm that exploit architectures. We show superior quality and execution time of the proposed algorithms on GPUs and Xeon Phi compared to previous work. We present effects of coloring on applications such as Gauss-Seidel preconditioned solvers.

Talk: (mehmetDeveci-pp16) file.