The Fractal Dimension of SAT Formulas

Modern SAT solvers have experienced a remarkable progress on solving industrial instances. Most of the techniques have been developed after an intensive experimental testing process. Recently, there have been some attempts to analyze the structure of these formulas in terms of complex networks, with the long-term aim of explaining the success of these SAT solving techniques, and possibly improving them. We study the fractal dimension of SAT formulas, and show that most industrial families of formulas are self-similar, with a small fractal dimension. We also show that this dimension is not affected by the addition of learnt clauses. We explore how the dimension of a formula, together with other graph properties can be used to characterize SAT instances. Finally, we give empirical evidence that these graph properties can be used in state-of-the-art portfolios.


Introduction
The SAT community has been able to come up with successful SAT solvers for industrial applications. However, nowadays we can hardly explain why these solvers are so efficient working on industrial SAT instances with hundreds of thousands of variables and not on random instances with hundreds of variables. The common wisdom is that the success of modern SAT/CSP solvers is correlated to their ability to exploit the hidden structure of real-world instances [27]. Unfortunately, there is no precise definition of the notion of structure.
At the same time, the community of complex networks has produced tools for describing and analyzing the structure of social, biological and communication networks [1] which can explain some interactions in the real-world.
Representing SAT instances as graphs, we can use some of the techniques from complex networks to characterize the structure of SAT instances. Recently, some progress has been made in this direction. It is known that many industrial instances have the small-world property [26], exhibit high modularity [5], and have a scale-free structure [3]. In [15], the eigenvector centrality of variables in industrial instances is analyzed. It is shown that it is correlated with some aspects of SAT solvers. For instance, decision variables selected by the SAT solvers, are usually the most central variables in the formula. However, how these analysis may help improve the performance of SAT solvers is not known at this stage. The fractal structure of search spaces, and its relation with the performance of randomized search methods, is studied in [13] The first contribution of this paper is to analyze the existence of self-similarity in industrial SAT instances. The existence of a self-similar structure would mean that after rescaling (replacing groups of nodes at a given distance by a single node, for example), we would observe the same kind of structure. It would also mean that the diameter d max of the graph grows as d max ∼ n 1/d , where d is the fractal dimension of the graph, and not as d max ∼ log n, as in random graphs or small-world graphs. Therefore, actions in some part of the graph (like variable instantiation) would not propagate to other parts as fast as in random graphs. Our analysis shows that most industrial formulas are self-similar. Also fractal dimension does not change much during the execution of modern SAT solvers.
Studying graph properties of formulas has several direct applications. One of them, is the generation of industrial-like random SAT instances. Understanding the structure of industrial instances is a first step towards the development of random instance generators, reproducing the features of industrial instances. Related work in this direction can be found in [4].
Another potential application, is to improve portfolio approaches [28,14] which are solutions to the algorithm selection problem [24]. State-of-the-art SAT Portfolios compute a set of features of SAT instances in order to select the best solver from a predefined set to be run on a particular SAT instance. It is reasonable to think that more informative structural features of SAT instances can help to improve portfolios.
The second contribution of this paper is an experimental study that shows how to use graph properties plus the clause/variable ratio in modern state-ofthe-art portfolios. The graph properties we use are: the distribution of variable frequencies, the modularity and the fractal dimension of a SAT formula. We show that using this reduced set of properties we are able to classify instances into families slightly better than the portfolio SATzilla2012 [29], which currently uses a total of 138 features. Secondly, we show that these features could be used as the basis of a portfolio SAT solver, showing that they give a level of information similar to all SATzilla features together. Let us emphasize that the fractal dimension is crucial in obtaining these results.
The paper proceeds as follows. We introduce the fractal dimension of graphs in Section 2. Then, we analyze whether SAT instances represented as graphs do have a fractal dimension in Section 3, and the effect of learnt clauses. In Section 4, we describe two additional previously studied graph features of SAT instances, the α exponent and the modularity. Section 5 describes briefly portfolios approaches and the set of features currently used. We finish in Section 6 presenting some experimental results on the feature-based classification of SAT instances, and conclude in Section 7. We also include an appendix with the numeric values used in some of the figures.

Fractal Dimension of a Graph
We can define a notion of fractal dimension of a graph following the principle of self-similarity. We will use the definition of box covering by Hausdorff [17]. Definition 1. Given a graph G, a box B of size l is a subset of nodes such that the distance between any pair of them is smaller than l. Let N (l) be the minimum number of boxes of size l required to cover the graph. We say that a set of boxes covers a graph, if every node of the graph is in some box. We say that a graph has the self-similarity property if the function N (l) decreases polynomially, i.e. N (l) ∼ l −d , for some value d.
In this case, we call d the dimension of the graph.
Notice that N (1) is equal to the number of nodes of G, and N (d max + 1) = 1 where d max is the diameter of the graph. Proof: We can reduce the graph coloring problem to the computation of N (2) as follows. Given a graph G, let G, the complement of G, be a graph with the same nodes, and where any pair of distinct nodes are connected in G iff they are not connected in G. Boxes of size 2 in G are cliques, thus they are sets of nodes of G without an edge between them. Therefore, the minimal number of colors needed to color G is equal to the minimal number of cliques needed to cover G, i.e. N (2).
There are several efficient algorithms that compute (approximate) upper bounds of N (l) (see [25]). They are called burning algorithms. Following a greedy strategy, at every step they try to select the box that covers (burns) the maximal number of uncovered (unburned) nodes. Although they are polynomial algorithms, we still need to do some further approximations to make the algorithm of practical use.
First, instead of boxes, we will use circles.

Definition 2.
A circle of radius r and center c is a subset of nodes of G such that the distance between any of them and the node c, is smaller that r.
Notice that any circle of radius r is a box of size 2 r − 1 (the opposite is in general false) and any box of size l is a circle of radius l (it does not matter what node of the box we use as center). Notice also that every radius r and center c characterizes a unique circle. According to Hausdorff's dimension definition, N (r) ∼ r −d also characterizes self-similar graphs of dimension d.
Consider now, a graph G and a radius r. At every step, for every possible node c, we could compute the number of unburned nodes covered by the circle of center c and radius r, and select the node c that maximizes this number, as it is proposed in [25]. However, this algorithm is still too costly for our purposes. Instead of this, we will apply the following strategy. We will order the nodes according to their arity: c 1 , . . . , c n such that arity(c i ) ≥ arity(c j ), when i > j. Now, for i = 1 to n, if c i is still unburned and the box of center c i and radius r contains some unburned node, select this circle. Then, we approximate N (r) as the number of selected circles.

The Fractal Dimension of SAT Instances
Given a SAT instance, we can build a graph from it. Here, we propose three models. Given a Boolean formula, the Clause-Variable Incidence Graph (CVIG) associated to it, is a bipartite graph with nodes the set of variables and the set of clauses, and edges connecting a variable and a clause whenever that variable occurs in that clause. In the Variable Incidence Graph model (VIG), nodes represent variables, and edges between two nodes indicate the existence of a clause containing both variables. Finally, in the Clause Incidence Graph model (CIG), nodes represent clauses, and an edge between two clauses indicates they share a negated literal. We can define the weighted version of all three models assigning weights to the edges, such that the sum of the weights of all edges generated by a clause is equal to one. This way, we compensate the effect of big clauses C that generate |C| 2 edges in the VIG model, and |C| edges in the CVIG model. In this paper we analyze the function N (r) for the graphs obtained from a SAT instance following the VIG and CVIG models. These two functions are denoted N and N b , respectively, and they relate to each other as follows.
Proof: Notice that, for any formula, given a circle of radius r in the VIG model, using the same center and radius 2 r − 1 we can cover the same variable nodes in the CVIG model. Conversely, given a circle of center a clause node c and radius 2 r + 1 in the CVIG model, using an adjacent variable node as center and radius r + 2 in the VIG model, we cover at least the same variable nodes. Therefore, we have N b (2 r) ≤ N (r) ≤ N b (2 r − 2), and N (r) ∼ N b (2 r). From this asymptotic relation, we can derive the two implications stated in the lemma.

Dimension versus Diameter
The function N (r) determines the maximal radius r max of a graph, defined as the minimum radius of a circle covering the whole graph. The maximal radius and the diameter d max of a graph are also related. From these relations we can conclude the following.
Lemma 3. For self-similar graphs or SAT formulas (where N (r) ∼ r −d ), the diameter grows polynomially, as d max ∼ n 1/d In random graphs or SAT formulas (where N (r) ∼ e −β r ), the diameter grows logarithmically, as d max ∼ log n β .
Proof: The diameter of a graph and the maximal radius are related as r max ≤ d max ≤ 2 r max . Notice that N (1) = n is the number of nodes, and N (r max ) = 1. Hence, if N (r) = C r −d , then r max = n 1/d , and if N (r) = C e −β r , then r max = log n β + 1.
The diameter, as well as the typical distance L between nodes 5 , have been widely used in the characterization of graphs. For instance, small world graphs [26] are characterized as those graphs with a small typical distance L ∼ log n and a large clustering coefficient. This definition of small world graphs is quite imprecise, because it is difficult to decide what is "small" distance and "large" coefficient. Moreover, the diameter (and the typical distance) of a graph are measures quite expensive to compute in practice, and very sensitive. For instance, given a graph with d max ≈ log n, if we add a chain of n ′ ≈ c n connected nodes (a sequence of implications, in the case of a SAT formula), representing a small fraction of the total number of nodes, the diameter of the graph grows to c n. However, a simple pre-processing, like unit propagation in the case of SAT formulas, may destroy this chain and make the diameter drop down again. The typical distance is a more stable measure, however it depends on the size of the graph. This means that, to decide if a graph has high or low typical distance, we have to compare it with the typical distance in a random graph of the same size. On the contrary, a quite good approximation of the fractal dimension can be quickly computed, and, as it depends on the whole graph, it is quite stable under simple graph (formula) modifications. As we will show in this paper, the fractal dimension of a SAT formula remains quite stable during the solving process (which involves variable instantiation, and addition of learnt clauses). Moreover, the dimension is independent of the size of the graph. Therefore, we advocate for the use of the fractal dimension instead of the diameter or the typical distance in the characterization of graphs, search problems or SAT instances.

Experimental Evaluation
We have conducted an exhaustive analysis of the industrial formulas of the SAT Race 2010 and 2012 Competitions, and some 3CNF random formulas.
For the random formulas, in the VIG model, we observe that the function, normalized as N norm (r) = N (r)/N (1) only depends on the variable/clause ratio (and not on the number of variables). Moreover, in the phase transition point m/n = 4.25, the function has the form N norm (r) = e −2.3 r , i.e. it decays exponentially with β = 2.3 (see Figure 1). Hence, r max = log n 2.3 + 1. For instance, for n = 10 6 variables, random formulas have a radius r max ≈ 7. For bigger values of m/n the decay β is bigger. For values m/n < 4 the formula usually forms an unconnected graph, and N (r) is bigger than the number of partitions. In this case, N (r) decreases smoothly, even though, it does not seem to have a polynomial N (r) ∼ r −d behavior. In the CVIG model, we observe the same behavior. However, in this case, N (r) decays exponentially with β = 1.2 ≈ 2.3/2. Hence, the decay is just half of the decay of the VIG model, as we expected according to Lemma 2. Analyzing industrial instances we observe that, in most cases, all instances of the same family have a very similar normalized function N norm (r). In Figure 1 1e  in self-similar graphs. We also observe that the functions N norm (r) for ACG-15-10p0 and ACG-20-10p1 (that have similar names) are closer to each other than to the other instances of the family. The same happens in other families, like the bitverif. Here, three instances are self-similar, and two not. This suggests that some families are too heterogeneous, and contain encodings of problems of different nature. In Figure 1, we also show the results for the velev family. In this case, the function N (r) decreases very fast (even faster than for random formulas) and following an exponential pattern.
We can conclude that, in the SAT Race 2010 Competition, velev, grieu, bioinf and some bitverif instances have a N (r) function with exponential decay, i.e. are not self-similar; whereas the rest of instances are all of them self-similar, with dimensions ranging between 2 and 3. In Figure 4 we show the dimension of all instances. Since not all formulas are self-similar, we assign them a pseudodimension computed as follows. If N (r) ∼ r −d , then log N (r) ∼ −d · log r, i.e. the dimension is the slope of a representation of N (r) vs. r using logarithmic   axes. Even if N (r) is not polynomial, we compute the pseudo-dimension as the interpolation, by linear regression, of log N (r) vs. log r, using the values for r = 1, . . . , 5.
In Table 2 and Figure 4, we present detailed results of the fractal dimensions, d and d b , and the exponential decays, β and β b , of the VIG and CVIG graphs respectively, on the SAT Race 2010 families and some random instances. These results are presented using the averages for each family and their standard deviations. The values we show are computed by linear regression using as described above.

The Effect of Learning
State-of-the-art SAT solvers, which incorporate Conflict Directed Clause Learning (CDCL), extend the input formula by adding learnt clauses from conflicts, during their execution. Unitary learnt clauses can be propagated without deciding over any variable (i.e., at level 0 of the search tree), simplifying the original formula. Learnt clauses of bigger length establish (explicitly) new relations between variables.
We have conducted some experiments to analyze how the fractal dimension evolves during the execution of the SAT solver. First, we have generated new formulas adding to the original one the learnt clauses at different depths of the execution (in particular, after 10 2 , 10 3 , 10 4 and 10 5 decisions), and propagating the unitary clauses. 6 Then, we have analyzed the fractal dimensions, d and d b , of these new formulas. In Tables 3 and 4 and Figures 2 and 3, we present the values obtained. Columns named d x and d b x represent the fractal dimensions after x decisions, for the VIG and CVIG respectively.
Two different phenomena can be observed. On one hand, and only in some cases, after a small number of decisions, the fractal dimension slightly decreases (see d 10 2 and d b 10 2 in families mizh, ibm, bioinf and nec). This is due to the learning and propagation of unitary clauses, that simplify the original formula. Notice that this fact does not happen in random 3CNF formulas, for which no unitary clauses are learnt. On the other hand, fractal dimension increases as the execution progresses. This fact is expected because learnt clauses mean conflicts between subsets of variables. So, a learnt clause establishes new connections between variables, directly (in the VIG) or indirectly through nodes clause (in the CVIG). Therefore, the number of tiles needed to cover the whole graph decreases with the addition of new learnt clauses, and hence, the fractal dimension becomes higher. In other words, new clauses make typical distance decrease, hence fractal dimension increase. Empirical results prove this hypothesis in all the formulas, including random 3CNF.
In a second experiment, we try to quantify this dimension increase. To do that, we have used the same formulas as before, but replacing the set of learnt clauses by the same number of random clauses of the same size. In Tables 3  and 4 and Figures 2 and 3, we present our results. Columns named d x−r and d b x−r represent the fractal dimensions after x decisions, replacing learnt clauses by random clauses. In the first steps of execution, the fractal dimension obtained using random clauses is very close to the values obtained using the learnt clauses. However, in further steps of the execution, random clauses produce significantly higher dimension increase than learnt clauses (except in velev and grieu families, where it is very similar). This can be explained as follows: initially, the solver pre-processes the formula, finding fast conflicts and generating short clauses. This is the case of learning and propagating unitary clauses in some instances. Then, it starts its execution choosing variables randomly because the activity-based heuristic does not have enough information to work correctly. This causes the generation of clauses that connect variables randomly, and have the same effect on the dimension as random clauses. Once the heuristic starts to work, the solver focuses on subsets of local 7 variables. While the values of d x and d x−r are still very close in random, velev, and grieu instances (i.e. in the instances with higher dimension), d x−r is significantly higher than d x in the rest of industrial instances (see d 10 4 for instance).
We can conclude that CDCL solvers tend to work locally, because conflicts found by the solver concern variables that were already close in the graph. So these conflicts are useful to explicitly show some local restrictions, but they hardly ever connect distant parts of the formula. This strategy seems the most adequate when dealing with formulas with small dimension (big typical distance between variables), like most industrial SAT instances.

Additional Graph Properties
In this section we are going to review two other features of CNF formulas. These characteristics are also related to the corresponding graph features, and they are usually studied in the context of distinguishing random graphs from real networks.
The first feature is the distribution of arities of the nodes in a graph. In the classical random graph model [11], the probability that an edge is chosen is constant. Therefore, the node arities follow a binomial distribution, and most node have about the same number of edges. In scale-free graphs node arities follow a power law distribution p(k) ∼ k −α , where usually 2 < α < 3. These distributions are characterized by a great variability. In recent years it has been observed that many other real-world graphs, like some social and metabolic networks, also have a scale-free structure (see [1]).
Similarly, in the context of CNF formulas, instances where variables are selected with a uniform distribution, are called random formulas. In them, the number of occurrences of a variable also follows a binomial distribution, and most variables occur about the same number of times. In [3], the distribution of occurrences of variables in industrial formulas (from the SAT competitions) was analyzed. For every instance, they compute the values f real (k), where f real (k) is the number of variables that have a number of occurrences equal to k.
They see that in many industrial formulas f real (k) is close to a power-law distribution k − α, where the exponent α range between 2 and 3.
This value α can be approximated with the most-likely method. As the powerlaw distribution is intended to describe only the tail of the distribution, we can discard some values of f real (k), for small values of k. Experimentally, we observe that in most industrial formulas arities of variables follow a power-law distribution with α ranging from 2 to 3 (see Figure 4). For the rest of formulas, we compute a pseudo-exponent using the same approximate method, and allowing to discard up to 5 values of f real (k), and minimizing the error, measured as the maximal difference between the real and the approximated distributions. The computation of the α exponent is extremely fast.
The second feature to analyze the structure of a SAT instance is the notion of modularity introduced by [19] for detecting the community structure of a graph. This property is defined for a graph and a specific partition of its vertices into communities, and measures the adequacy of the partition in the sense that most of the edges are within a community and few of them connect vertices of distinct communities. The modularity of a graph is then the maximal modularity for all possible partitions of its vertices. Obviously, measured this way, the maximal modularity would be obtained putting all vertices in the same community.
To avoid this problem, [19] defines modularity as the fraction of edges connecting vertices of the same community minus the expected fraction of edges for a random graph with the same number of vertices and edges.
The problem of maximizing the modularity of a graph is NP-hard [8]. As a consequence, most of the modularity-based algorithms proposed in the literature return an approximate lower-bound value for the modularity (see a survey in [12]). However, the complexity of many of these algorithms, make them inadequate for large graphs (as it is the case of industrial SAT instances, viewed as graphs). For this reason, there are algorithms specially designed to deal with large-scale networks, like the greedy algorithms for modularity optimization [18,9], the label propagation-based algorithm [23] and the method based on graph folding [6].
The community structure of SAT formulas was introduced in [5] using the weighted VIG model. Here, we reproduce the analysis for the SAT Race 2010 competition (see Figure 4). We use the folding algorithm [6], that relaxing the precision on the computed approximation, may run in some seconds in most formulas, and less than 1 minute in all them.
We could conclude that the typical industrial SAT instance is a formula with a fractal dimension ranging from 2 to 3, where frequencies of variable occurrences follow a power-law distribution with an exponent also ranging from 2 to 3, and a clear community structure with Q ≈ 0.8. We think that most SAT solvers are optimized for dealing with this kind of formulas.

Portfolio SAT Approaches
From the SAT competitions that take place every year since 2002, we have learnt that no solver dominates over all the instances. From a theoretical point of view, this makes sense, since the underlying proof system of SAT solvers is resolution, and it has been shown not to be automatizable [2] (under strong assumptions). A proof system is automatizable if there exists an algorithm that given an unsatisfiable formula, produces a refutation in time polynomial in the size of shortest refutation [7]. Therefore, it seems reasonable to have a pool of SAT solvers, and given a SAT instance try to predict their expected running time in order to choose the best candidate. This is known as the algorithm selection problem, which consists of choosing the best algorithm from a predefined set, to run on a problem instance [24]. Algorithm portfolios tackle this problem.
Portfolios have been shown to be very successful in Satisfiability [28,14], Constraint Programming [20], Quantified Boolean Formulas [21], etc. Modern portfolio solvers are an example of how machine learning can help Constraint Programming. Machine learning techniques are used to build the prediction model of the expected running times of a solver on a given instance.
The first successful algorithm portfolio for SAT was exploited by SATzilla 2007 [28]. In this algorithm a regression function is trained to predict the performance of every solver based on the features of an instance. For a new instance, the solver with the best predicted runtime is chosen.
The success of modern SAT/CSP solvers is correlated with their ability to exploit the hidden structure of real-world instances [27]. Therefore, a key element of SAT/CSP portfolios is to carefully select which features identify the underlying structure of the instance. These features correspond to the attributes the learning algorithm will use to build the classifier or predictor. The features must be related to the hardness of solving the instance, since our goal is to predict which solver will be the most efficient for the given instance. Also the computation has to be automatizable and with a reasonable cost, since it would not make sense to consume more time on computing the features than solving the instance. For example, in the SATzilla version for the SAT competition, the timeout for computing the features is around 90 seconds, while the timeout for solving an instances is around 900 seconds in the SAT challenge 2012.
With respect to the features to be analyzed, SATzilla2012 identifies a total of 138. The first 90 features, introduced for the original SATzilla can be categorized as follows: problem size features (1-7), graph based features (8-36), balance features (37-49), proximity to Horn Formula features (50-55), DPLL probing features (56-62), LP-Based features (63-68) and local search problem features (69-90). The features in the last three categories can be expensive to compute in large instances, and therefore, in practice, we can not use them. The rest of the categories, correspond to: clause learning features (91-108), survey propagation (109-126) and timing (127-138).
As we just mentioned, SATzilla uses graph based features. These features are extracted from the CVIG, VIG and CIG representations of a SAT instance as a graph (see Section 3 for definitions of these representations). On these graphs, node degree statistics are computed. In the case of CVIG, variable and clause nodes are analyzed independently. Additionally, diameter statistics and clustering coefficient statistics are computed for the VIG and CIG graphs, respectively. The statistics involve the computation of the mean, variation coefficient, min, max and entropy.

Feature-Based SAT Instance Classification
In order to analyze how good a set of features is for characterizing SAT instances, we conduct an experimental investigation using supervised machine learning techniques. These techniques allow us to build an instance classifier h, that given an instancex = (x 1 , x 2 , . . . , x m ) characterized by m computable attributes (in our case the features of a SAT instance), and a finite set of class labels L = {λ 1 , λ 2 , . . . , λ k }, decides its labelλ ∈ L. That is h(x) =λ. In order to validate the classifier we use cross-validation. One round of cross-validation involves partitioning the set of instances into two complementary subsets, the training set and the test set. The classifier is built with the training set, while the validation is performed with the test set. In our experiments, rounds have one instance as test set, and the rest as training set. We have as many rounds as instances.
Our set of instances comes from the industrial track of the SAT competitions. Within this track, instances are grouped into families, according to their industrial application area (e.g. hardware verification, cryptography, planning, scheduling, etc.).
We have used the 100 instances of the SAT Race 2010, that are grouped into 17 families. We also tested the 600 instances of the SAT Challenge 2012 (application track), that are grouped into 20 families. In our experiments, we had to face two problems. On one hand, some families are too wide in the sense that the family is not specific enough. For example, in the termination family from 2012, different termination problems are considered, and different encodings of the same termination problem appear. Notice that having a different encoding of a problem is enough to alter substantially the performance of a SAT solver. On the other hand, many formulas are so hard, that SATzilla features tool crashes computing the features of some of them. Thus, although our graph properties are computable, it would make the comparison unfair. Therefore, we decided to focus our experimentation on the instances from the 2010 competition because the mentioned problems were fewer. Even in the SAT 2010 set, the problem to compute the SATzilla features arises, and we had to eliminate the two instances of the post family.
First we will present the problem of instance classification into families. In Figure 4, we show the four coordinates of the graph features: exponent α , modularity Q and fractal dimensions d and d b , for both VIG and CVIG, respectively. Instances of each family are plotted with a distinct mark. At first sight, we can see that instances belonging to the same family are usually closer to each other, except for the instances of the bitverif family. Thus, we could conclude that most of the instances are well classified into families by these graph features.
In the first experiment we have conducted, we try to validate the previous hypothesis, doing a cross-validation test on the classifiers of instances into families. For this purpose, we use the supervised learning C4.5 algorithm [22]. This is a classifying method based on decision trees. In Table 1, we present these results.
We built two classifiers: one with the 138 SATzilla features, and another with the α, Q, d and d b features plus the clause-variable ratio m/n. We included m/n into our set of features since this can be a natural indicator of the hardness of the instance. As we can see in Table 1, we obtain comparable results with respect to the SATzilla-based classifier, using only 3 or 4 features. We tested all the possible subsets, but we only present those with a success greater or equal to what we achieve with SATzilla 2012 using the 138 features. It is important to notice that the fractal dimension d b on CVIG appears in the highest ranked subsets, and seems to be better than using the d on VIG. Next, in our second experiment we want to check if our features set could be used as the basis of a portfolio SAT solver. Thus, we will use them to predict which is the best SAT solver for a SAT instance. One of the techniques used in supervised learning is the k-NN (k-nearest-neighbor) method. It consists on selecting for a test instance, the classification of the k nearest training instances. This is the approach used, for instance, in [16]. In our case, we modify this method as follows. Let t s i be the time needed by solver s on SAT instance i, and let d ij be the distance between test instance i and training instance j (computed using the euclidean distance, according to their (normalized) feature values). We can predict the time needed by solver s on an instance i aŝ  These results are still a bit far from the results of the virtual best solver, that would solve 78 instances. However, if we analyze the results in each incorrectly classified instance, we can see that there is not too much room for improvement. For example, one of the diagnosis instances (UTI-20-10p0) is only solved by CryptoMiniSat. However, this solver does not solve any other of the instances of the diagnosis family. On the contrary, the lingeling solver (the one both metasolvers choose) is the best solving the rest of instances of this family, but does not solve UTI-20-10p0. Therefore, any reasonable learning method would fail selecting a solver for this instance, as both meta-solvers do.
Our final experiment will be to use our features in a state-of-the-art portfolio. We reached out to the IBM team (winner of some tracks in 2011 & 2012 SATcompetitions). Their portfolio is based on hierarchical clustering, conceptually close to decision forests. They kindly used their portfolio with our 5 features and SATzilla's 138. Not incorporating feature computation time, our feature set solves 87.2% instances, and the 138 feature set solves only 82.7%. Taking into account feature computation time, our features solve 75.8% instances, while the 138 feature set solves only 42.85%.
We cannot explain yet why these features are so much more powerful for solver selection. However, any classifier is easier to dissect when based on 5 features rather than 138.

Conclusions
In this paper we have studied the existence of self-similarity in industrial SAT instances. We can conclude that, in the SAT Race 2010 Competition, the velev, grieu, bioinf and some bitverif instances are not self-similar; whereas the rest of instances are all of them self-similar, with fractal dimensions ranging between 2 and 3. These fractal dimensions are very small when compared with random SAT formulas. Fractal dimension and typical distances and graph diameter are related (small dimension implies big distance and diameter). Hence, industrial SAT instances have a big diameter (intuitively, we need quite long chains of implications to propagate a variable instantiation to others).
We have studied the evolution of fractal dimension of SAT formulas along the execution of a solver. We can say that, in general, fractal dimension increases when new learnt clauses are added to the formula, except in the first steps of solving some industrial instances, where some unitary clauses are learnt. Moreover, this increase is specially abrupt in those instances that show exponential decays (for instance, in the family grieu or random formulas). This increase is small, if we compare it with the effect of adding random clauses. Therefore, learning does not contribute very much to connect distant parts of the formula, as one could think.
We have explored how these graph features plus the clause-variable ratio could be used within portfolios to characterize SAT instances. First, we observed that these five features can be used to classify SAT instances into families comparing favorably to the results obtained with the 138 features from SATzilla2012. Second, we simulated how two hypothetical portfolios would have performed in the SAT Race 2010 Competition using the four features, and the 138 features from SATzilla2012, respectively. We observed that they perform similarly. Third, we provided data from a real portfolio that shows the effectiveness of this approach.
As future work, we plan to investigate into more detail how to use structural graph features such as the fractal dimension, the α exponent or the modularity, to design more efficient single SAT solvers.  (10) mixed Table 3. Evolution of the fractal dimension d of SAT Race 2010 and some random formulas using VIG. dorig stands for the fractal dimension of the original formula. dx stands for the fractal dimension of the new formula generated adding the learnt clauses after x decisions to the original formula. d x−rand stands for the fractal dimension of a formula generated adding to the original formula as random clauses as learnt clauses, and of the same size. Numbers in brackets represent the number of instance that are not still solved as UNSAT at that depth.   (10) mixed Table 4. Evolution of the fractal dimension d b of SAT Race 2010 and some random formulas using CVIG. d b orig stands for the fractal dimension of the original formula. d b

Family
x stands for the fractal dimension of the new formula generated adding the learnt clauses after x decisions to the original formula. d b x−rand stands for the fractal dimension of a formula generated adding to the original formula as random clauses as learnt clauses, and of the same size. Numbers in brackets represent the number of instance that are not still solved as UNSAT at that depth.