Solve the clustering problem using QAOA

Author: Cooper Midroni cooper@entropicalabs.com

In this notebook we demonstrate a start-to-finish project workflow for using Quantum Approximate Optimization Algorithm to cluster a simple dataset. Along the way, we will explain the major concepts of QAOA and build intuition as to how QAOA can be used to solve clustering problems. This notebook will steer away from heavy mathematical explanations in favor of a higher level view of the algorithm’s core components. It is mainly geared towards users that don’t have physics background but come from computer science.

Variational Hybrid Algorithms

We often take for granted the many decades of progress that lead to today’s widespread use of classical computers. As memory and compute power become ever cheapened by Moore’s Law, the pressure to find optimal resource allocations for algorithms shrinks away. However, with quantum computers in their early stages, they still feel this daunting requirement. In response to this, a family of algorithms known as variational hybrid quantum-classical algorithms was created, with the notion that quantum resources can be made more useful when partnered with classical routines. The Quantum Approximate Optimization Algorithm (QAOA), belongs to the family of variatonal hybrid algorithms.

We can infer a lot from merely unpacking this name. The presence of ‘variational’ tells us these algorithms will follow an iterative approach, while ‘hybrid’ tells us they will leverage the use of both quantum and classical computers. In fact, this describes the main flow of the algorithm, with all that needs be answered is when does this iteration stop and what information is passed between devices.

A visual representation of a generic variational hybrid quantum-classical algorithm. A visual representation of a generic variational hybrid quantum-classical algorithm.

To answer the question of what, we note that the main goal of QAOA is optimize a set of parameters, which we denote as \(\vec{\gamma}\) and \(\vec{\beta}\). You’ll notice that these symbols are vectors, as such they are \(n-\)length. We discuss later what aspects of our problem decide the value of \(n\) in the second notebook.

\(\vec{\gamma}\) and \(\vec{\beta}\) parameterize a cost function which is evaluated with our Quantum Circuit to produce a cost value. This output value is input to the optimizer, and is used to determine whether the nudging of our parameters is in a direction of lower cost. We will sometimes call the cost value an expectation value, represented by \(\langle\psi|Cost|\psi\rangle\), which is the expected value of the cost function \(Cost\) over the wave function \(\psi\). If you were caught off guard by the term ‘wave function’, then it is equally as effective to think of \(\langle\psi|Cost|\psi\rangle\) as the notion of cost as in the more traditional machine learning sense. The Classical Optimizer will return updated parameters to the quantum circuit for re-evaluation, and the cycle repeats.

When does this algorithm stop? Well, once a stopping criterion is met of course. This criterion is often a pre-defined maximum number of iterations, or occurs after a repeat number of evaluations land within the same threshold of convergence (a tolerance for the cost value in which we consider numbers within an \(\epsilon-\)window the same). Once this criterion is met, the optimized parameters are returned and used to define the solution.

A visual representation of QAOA in the format of a variational hybrid algorithm. A visual representation of QAOA in the format of a variational hybrid algorithm.

The above description should leave you with many questions. - How does the above process solve a clustering problem? - How exactly do \(\vec{\gamma}\) and \(\vec{\beta}\) define the solution? - How do we define a meaningful cost function for our problem? - What in the world is a wave function?

We hope to answer these and more. For now, if you feel comfortable with the critical vocabulary of QAOA (the bolded words), then you’ll be well prepared for the explanations below. *** ### Data Preparation Now let’s get to the fun part! We will import our data and define the problem setting as a highly manicured example for this clustering demo.

The dataset we will be using is the Pokemon dataset, which can be found on Github. In our journey to Catch ’Em All, we will attempt to cluster Pokemon into Legendary and non-Legendary classes.

Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Import Data

df = pd.read_csv('./data/pokemon.csv')
df = df.set_index('#') #index pokemon by their ID number
df = df.rename_axis('ID') #rename axis to 'ID' instead of '#'
df = df.loc[~df.index.duplicated(keep='first')] #drop duplicates
df.head()
Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
ID
1 Bulbasaur Grass Poison 318 45 49 49 65 65 45 1 False
2 Ivysaur Grass Poison 405 60 62 63 80 80 60 1 False
3 Venusaur Grass Poison 525 80 82 83 100 100 80 1 False
4 Charmander Fire NaN 309 39 52 43 60 50 65 1 False
5 Charmeleon Fire NaN 405 58 64 58 80 65 80 1 False

To avoid the many bells and whistles of later iterations of Pokemon games, we’ll stick to our roots and only consider Pokemon from the first three generations.

df = df.loc[df['Generation']<=3]
df.sample(frac=1).head() #sample the whole dataset (frac=1) to shuffle the arrangement
Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
ID
266 Silcoon Bug NaN 205 50 35 55 25 25 15 3 False
185 Sudowoodo Rock NaN 410 70 100 115 30 65 30 2 False
5 Charmeleon Fire NaN 405 58 64 58 80 65 80 1 False
380 Latias Dragon Psychic 600 80 80 90 110 130 110 3 True
56 Mankey Fighting NaN 305 40 80 35 35 45 70 1 False
print('Percent of Non-Legendary Pokemon: %.2f' %((df.Legendary.count()-df.Legendary.sum())/df.Legendary.count()))
print('Percent of Legendary Pokemon: %.2f' %((df.Legendary.sum())/df.Legendary.count()))
Percent of Non-Legendary Pokemon: 0.95
Percent of Legendary Pokemon: 0.05

We can see that the classes are quite unevenly distributed. To remedy this, we will randomly select 5 Legendary and 5 Non-Legendary Pokemon to act as our samples to be clustered.

legendary = df.loc[df['Legendary'] == True].sample(5)
non_legendary = df.loc[df['Legendary'] == False].sample(5)
pokemon = pd.concat([legendary,non_legendary])

To further simplify the problem, and not worry about the encoding of categorical data, we will only consider numerical values in our clustering of the data.

numerical_columns = ['Total','HP','Attack','Defense','Sp. Atk','Sp. Def','Speed']
labels = pokemon['Legendary']
data = pokemon[numerical_columns].copy()
data.head()
Total HP Attack Defense Sp. Atk Sp. Def Speed
ID
379 580 80 75 150 75 150 50
386 600 50 150 50 150 50 150
243 580 90 85 75 115 100 115
381 600 80 90 80 130 110 110
378 580 80 50 100 100 200 50

We now have a dataset which is ready to be processed, but we may not be exactly clear on what to do with it. For that we must further understand how the QAOA process detailed above is actually used to solve a clustering problem.

The Maxcut Problem

As laid out by Rigetti’s paper on QAOA, there are a number of important steps that we must follow to map the problem of clustering into a format which QAOA can process. Broadly speaking, QAOA solves the MAXCUT problem, in which a graph of \(n\) vertices is separated into two complementary subsets, \(S\) and \(S^{c}\), such that the number of edges between \(S\) and \(S^{c}\) is as large as possible.

Title

A depiction of the maxcut problem, displaying a cut which separates white and black vertices.

A depiction of the maxcut problem, displaying a cut which separates white and black vertices. Image credit:Wikipedia

This problem can be made more sophisticated by adding numerical values as weights to the edges, such that the best solution maximizes the sum of weights which separate \(S\) and \(S^{c}\). This is precisely the approach we take in using MAXCUT to cluster our data.

We allow the weights associated to each edge to be some notion of distance between points. In this way, the sets dictated by our optimal cut, \(S\) and \(S^{c}\), separate the data into binary clusters which are maximally distant (and hence, maximally dissimilar) from one another.

From our current understanding, we can already begin to formulate some first steps in preparing our data to fit this frameowrk.

We can use the distances_dataset function from entropica_qaoa.utilities to easily turn this set of points into the desired matrix of pairwise distances.

from entropica_qaoa.utilities import distances_dataset

dist = pd.DataFrame(distances_dataset(data.values),
                       index=data.index,columns=data.index)
dist.iloc[0:5, 0:5]
ID 379 386 243 381 378
ID
379 0.000000 206.276513 118.953773 117.260394 79.056942
386 206.276513 0.000000 108.627805 104.880885 220.907220
243 118.953773 108.627805 0.000000 30.000000 128.062485
381 117.260394 104.880885 30.000000 0.000000 122.474487
378 79.056942 220.907220 128.062485 122.474487 0.000000
df.loc[dist.index].head()
Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
ID
379 Registeel Steel NaN 580 80 75 150 75 150 50 3 True
386 DeoxysNormal Forme Psychic NaN 600 50 150 50 150 50 150 3 True
243 Raikou Electric NaN 580 90 85 75 115 100 115 2 True
381 Latios Dragon Psychic 600 80 90 80 130 110 110 3 True
378 Regice Ice NaN 580 80 50 100 100 200 50 3 True

From Maxcut to QUBO

With an understanding of the Maxcut structure which produces our clustered output, we ask ourselves how we can turn what is effectively a graph problem into the setting of an optimization problem. The answer is to map our Maxcut interpretation into a Quadratic Unconstrainted Binary Optimization (QUBO) problem. QUBO problems attempt to minimize a quadratic polynomial with binary variables. Luckily, MAXCUT already has a well-known QUBO cost function. This cost function is sophisticated enough to allow for our pairwise distanes to be meaningfully included, as well as to allow for the inclusion of bias terms on individual samples.

\[Cost=-\sum_{\langle i j\rangle} J_{i j} \sigma_{i} \sigma_{j}-\mu \sum_{j} h_{j} \sigma_{j}\]

To explain the notation: - \(\sigma_{i}\) is the cluster class (-1 or 1) of sample \(i\) - \(J_{i j}\) is the distance between sample \(i\) and sample \(j\) - \(h_{j}\) is a bias term on sample \(j\) - \(\mu\) is a universal weight applied to all bias terms

By convention, a negative sign is applied to the cost function, as above. In quantum mechanics we would denote thie function as \(H(\sigma)\). The symbol \(H\) stands for Hamiltonian, which is an operator which acts as a sum of the energies of the system. For the scope of this notebook, thinking of \(Cost\) as any traditional cost function which we want to minimize will serve us equally as valuable.

From QUBO to a Hamiltonian

Now we must use our data to create the cost function defined above. To make a Hamiltonian that is recognizable by pyQuil, we must use the pyQuil PauliTerm object.

from pyquil.api import WavefunctionSimulator
from pyquil.paulis import PauliSum, PauliTerm

A PauliTerm object can be quadratic or of order one. In the case of it being quadratic, it represents the relationship between any two samples of data. An order one PauliTerm would be an implementation of a bias term - a cost constraint which only affects one variable. Below we show some basic functionality of the PauliTerm object.

#Constructing a quadratic PauliTerm
i = 3
j = 6
print('Distance between samples %d and %d: %.3f' %(i,j,dist.values[i][j]))
Distance between samples 3 and 6: 433.417

To create the quadratic term we multiply two Paulis together. Each PauliTerm has an accompanying coefficient which is also multiplied. For simplicity’s sake, we include the pairwise distance as a coefficient of one factor, and make the other ‘1.0’.

term1 = PauliTerm("Z",i,dist.values[i][j])
term2 = PauliTerm("Z",j,1.0)
term = term1*term2
print(term)
(433.4166586553867+0j)*Z3*Z6

Feel free to play with the coefficient number of term2 to see how it affects the output of the cell.

For those new to quantum computing, you’re likely wondering what the purpose of the letter ‘Z’ is. It indicates that this PauliTerm is a Z operator.

You may also note that our sample numbers, \(i=3\) and \(j=6\), have found their way into the printed output. Including \(i\) and \(j\) in each PauliTerm tells pyQuil which samples or qubits the operation is applied to. That’s right, in the QAOA setup we consider each datapoint to be mapped to a qubit. Thus, the above printed statement actually means “apply a penalty of :math:`Q` should sample 3 and sample 6 be in the same class”, where \(Q\) is the coefficient of the operator product Z3*Z6. Said in a more quantum-intuitive sense: “Apply a penalty of :math:`Q` should qubit 3 and qubit 6 both be found in the same spin state (spin up or spin down)”.

Thus, as QAOA tries to minimize the cost function, sample 3 and 6 will only appear in the same class if this configuration is optimal. The choice of our weights as the distances between the samples implies, that in a “good” configuration samples that lie far apart will end up in different classes.

We can see now that to make the Hamiltonian for our system we must iterate over each distance in our distance matrix, and assign it within a PauliTerm as the interaction strength between the appropriate qubits. We can readily achieve this using the utility function hamiltonian_from_distances.

from entropica_qaoa.utilities import hamiltonian_from_distances

hamiltonian = hamiltonian_from_distances(dist)
print(hamiltonian)
(206.27651344736267+0j)*Z0*Z1 + (118.95377253370319+0j)*Z0*Z2 + (117.26039399558574+0j)*Z0*Z3 + (79.05694150420949+0j)*Z0*Z4 + (429.1270208225066+0j)*Z0*Z5 + (414.1255848169731+0j)*Z0*Z6 + (329.6209944769902+0j)*Z0*Z7 + (365.8551625985343+0j)*Z0*Z8 + (95.06839643120105+0j)*Z0*Z9 + (108.62780491200216+0j)*Z1*Z2 + (104.88088481701516+0j)*Z1*Z3 + (220.90722034374522+0j)*Z1*Z4 + (467.9743582719036+0j)*Z1*Z5 + (451.3867521316947+0j)*Z1*Z6 + (364.0741682679506+0j)*Z1*Z7 + (415.45156155681974+0j)*Z1*Z8 + (159.80613254815975+0j)*Z1*Z9 + (30+0j)*Z2*Z3 + (128.06248474865697+0j)*Z2*Z4 + (425.20583250938597+0j)*Z2*Z5 + (410.91361622608713+0j)*Z2*Z6 + (325.1153641401772+0j)*Z2*Z7 + (368.0353243915589+0j)*Z2*Z8 + (77.5112895003044+0j)*Z2*Z9 + (122.47448713915891+0j)*Z3*Z4 + (447.88391353117385+0j)*Z3*Z5 + (433.4166586553867+0j)*Z3*Z6 + (346.04912945996557+0j)*Z3*Z7 + (389.1015291668744+0j)*Z3*Z8 + (96.42613753542138+0j)*Z3*Z9 + (436.3484845854286+0j)*Z4*Z5 + (425.73465914816+0j)*Z4*Z6 + (340.80786375903944+0j)*Z4*Z7 + (372.96112397943034+0j)*Z4*Z8 + (123.44229421069588+0j)*Z4*Z9 + (30.822070014844883+0j)*Z5*Z6 + (113.35784048754634+0j)*Z5*Z7 + (70+0j)*Z5*Z8 + (362.9572977638003+0j)*Z5*Z9 + (96.17692030835673+0j)*Z6*Z7 + (63.245553203367585+0j)*Z6*Z8 + (347.4881292936494+0j)*Z6*Z9 + (65.95452979136459+0j)*Z7*Z8 + (261.3197275369772+0j)*Z7*Z9 + (303.59183124715327+0j)*Z8*Z9

The above exercise brings up an important limitation to our present QAOA approach. The number of datapoints we are able to use is limited by the number of qubits we have available.

Minimize the Hamiltonian with QAOA

Now that we have mapped the clustering problem to a Hamiltonian it is time to find the spin class assignments/spin configuration that minimizes our cost function. We do this using the QAOA algorithm. First we need to import the neccesary bits and pieces:

# import the neccesary pyquil modules
from entropica_qaoa.qaoa.cost_function import QAOACostFunctionOnQVM, QAOACostFunctionOnWFSim

# import QAOAParameters
from entropica_qaoa.qaoa.parameters import ExtendedParams

# import an optimizer
from scipy.optimize import minimize

#Some utilities for time tracking and measuring our outcomes.
import time
from math import log
from entropica_qaoa.utilities import cluster_accuracy, max_probability_bitstring

Now we can set up the hyperparameters (problem parameters that remain fixed for this problem instance):

timesteps = 3 # The QAOA p parameter
iters = 500 # Number of classical optimiser iterations
n_qubits = 10 #this number might be defined before your dataset - should equal the number of data points
#The hamiltonian is also a hyperparameter

And of course also the parameters need to be chosen. In this QAOA run, we will use ExtendedParameters. This parameter class provides the most degrees of freedom for our optimizer to explore the energy landscape. Conversely, it also has the most parameters to optimize and thus will take longer to converge.

To instantiate this parameter class, we need to pass in three separate lists of angles. - \(\vec{\beta}\): every timestep requires \(n_{qubits}\) beta rotations. Thus there are \(n_{qubits}\times timesteps\) beta values. - \(\vec{\gamma}_{pairs}\): there is a gamma rotation for every two-qubit interaction. A simple way to come up with this number is to measure the length of your Hamiltonian, subtracted by the number of single qubit bias terms in place. - \(\vec{\gamma}_{singles}\): there is a gamma single rotation for each bias term included in the hamiltonian.

We randomly generate these lists as their initial starting states are somewhat redunant. They will be optimized over 100s of iterations!

betas = [round(val,1) for val in np.random.rand(timesteps*n_qubits)]
gammas_singles = [round(val,1) for val in np.random.rand(0)] #we don't want any bias terms
gammas_pairs = [round(val,1) for val in np.random.rand(timesteps*len(hamiltonian))]

hyperparameters = (hamiltonian, timesteps)
parameters = (betas, gammas_singles, gammas_pairs)

params = ExtendedParams(hyperparameters, parameters)

Before starting the simulator, make sure you are running Rigetti’s QVM and Quil Compiler by running qvm -S and quilc -S in two open and disposable terminals

Let’s begin by running QAOA with \(p=3\) timesteps, and a maximum of 500 optimiser Iterations.

# Set up the WavefunctionSimulator from pyQuil
sim = WavefunctionSimulator()
cost_function = QAOACostFunctionOnWFSim(hamiltonian,
                                        params=params,
                                        sim=sim,
                                        enable_logging=True)
t0 = time.time()
res = minimize(cost_function, params.raw(), tol=1e-3, method='Cobyla',
               options={"maxiter": iters})
print('Run complete!\n','Runtime:','{:.3f}'.format(time.time()-t0))
Run complete!
 Runtime: 56.446
wave_func = cost_function.get_wavefunction(params.raw())
lowest = max_probability_bitstring(wave_func.probabilities())
true_clusters = [1 if val else 0 for val in labels]
acc = cluster_accuracy(lowest,true_clusters)
True Labels of samples: [1, 1, 1, 1, 1, 0, 0, 0, 0, 0]
Lowest QAOA State: [0, 0, 1, 1, 1, 1, 0, 0, 1, 0]
Accuracy of Original State: 60.0 %
Accuracy of Complement State: 40.0 %

We can analyze the optimizer to see whether or not our QAOA run converged. For the full message, run:

print(res)
print('Cost Function Value:', res.fun)
print('Converged?:',res.message)
Cost Function Value: -446.28427408573975
Converged?: Maximum number of function evaluations has been exceeded.

We can see we did not converge. Let’s tighten up our operations by wrapping our QAOA runs in a function and increase the QAOA parameter \(p\).

def run_qaoa(hamiltonian, params, timesteps, max_iters, init_state=None):
    cost_function = QAOACostFunctionOnWFSim(hamiltonian,
                                            params=params,
                                            initial_state=init_state)
    res = minimize(cost_function, params.raw(), tol=1e-3, method='Cobyla',
                          options={"maxiter" : max_iters})

    return cost_function.get_wavefunction(params.raw()), res

The cell below will take around 2 to 3 minutes to run:

t0 = time.time()
wave_func, res = run_qaoa(hamiltonian, params, timesteps=3, max_iters=1500)
print('Run complete\n','Runtime:','{:.3f}'.format(time.time()-t0))
Run complete
 Runtime: 159.021
lowest = max_probability_bitstring(wave_func.probabilities())
true_clusters = [1 if val else 0 for val in labels]
acc = cluster_accuracy(lowest,true_clusters)

print('Cost Function Value:', res.fun)
print('Converged?:',res.message)
True Labels of samples: [1, 1, 1, 1, 1, 0, 0, 0, 0, 0]
Lowest QAOA State: [1, 0, 1, 0, 0, 0, 1, 1, 1, 1]
Accuracy of Original State: 30.0 %
Accuracy of Complement State: 70.0 %
Cost Function Value: -860.5133756234061
Converged?: Optimization terminated successfully.

You should typically find that increasing the number of allowed iterations gives a more accurate answer. The precise numbers will depend on which Pokemons are randomly selected at the beginning.