LIMITS WITH MODELING DATA AND MODELING DATA WITH LIMITS

Modeling of the solubility of amino acids and purine and pyrimidine bases with a set of sixteen molecular descriptors has been thoroughly analyzed to detect and understand the reasons for anomalies in the description of this property for these two classes of compounds. Unsatisfactory modeling can be ascribed to incomplete collateral data, i.e, to the fact that there is insufficient data known about the behavior of these compounds in solution. This is usually because intermolecular forces cannot be modeled. The anomalous modeling can be detected from the rather large values of the standard deviation of the estimates of the whole set of compounds, and from the unsatisfactory modeling of some of the subsets of these compounds. Thus the detected abnormalities can be used (i) to get an idea about weak intermolecular interactions such as hydration, self-association, the hydrogen-bond phenomena in solution, and (ii) to reshape the molecular descriptors with the introduction of parameters that allow better modeling. This last procedure should be used with care, bearing in mind that the solubility phenomena is rather complex.

Throughout the present paper we will be concerned with the fact that one of the main problems in modeling can be phrased as the 'absent col-data' problem, i.e., the failure to model in a satisfactory way properties or activities of a class of compounds whose collateral data are either missing or incomplete, while the main body of data actually seems complete.The fact that the modeling is unsatisfactory can be detected at many statistical levels, but in some cases only a critical analysis of the standard deviation of the estimate, s, like in the present case, reveals that something is 'faulty' with the modeling.This is, in fact, what happens with the modeling of the solubility of amino acids and of purine and pyrimidine bases.The basis descriptors used throughout this study belong to a medium-sized set made up of two subsets of eight molecular connectivity indices and eight molecular pseudoconnectivity I/E-State indices, recently defined, and which will be elaborately discussed in the next section.These graph-theoretical molecular indices (Kier & Hall, 1986;Kier & Hall 1999;Pogliani, 2000Pogliani, , 2001) ) like many other indices of the same type, are, nevertheless, rather insensitive to weak intermolecular interactions.Nevertheless, the modeling of the solubility for these two classes of compounds when examined in detail with these molecular descriptors can help to detect at which level the modeling fails, how consistent the failure is, and what can be done to prevent it.
The solubility of solids is a rather complex process, which is influenced by the magnitude of the enthalpy change on the fusion of the pure solute, ∆H fus , and the melting point of solute, T fus , i.e., -lnx = (∆H fus /R)(1/T -1/T fus ) (Atkins, 1990), where x is the mole fraction solubility at T. But, other factors, such as the association or self-association phenomena in solution, which can gives rise to supramolecular species, can influence solubility.The importance of such phenomena can be seen with the hydration numbers, n, of some cations in aqueous solvent: n(Cs + ) = 6; n(K + ) = 7, n(Na + ) = 13, n(Li + ) = 22, n(Cd +2 ) = 39, and n(Zn +2 ) = 44 (Van der Sluys, 2001).Association and self-association, are, surely the main, even if not the only, phenomena that influences the solubility of amino acids and of purine and pyrimidine bases.Actually, self-association in solution has been clearly detected for only four purine and pyrimidine bases (Pogliani, 2000a;Pogliani, 1993;Agostini, Bonacchi, Dapporto, Paoli, Fedi & Manzini, 1990;Agostini, Bonacchi, Dapporto, Paoli, Pogliani & Toja, 1994;Nagashima & Suzuki, 1984;Guttman & Higuchi, 1957;Bolton, Guttman & Higuchi, 1957).For all other compounds, similar phenomena can only be indirectly inferred from the irregular characteristics of the modeling, which are useful if one remains aware of the pitfalls of a circular reasoning.In practice, modeling the solubility of these two classes of compounds is influenced by missing data about concerning association, selfassociation, and even by missing thermodynamic data.If this information were at hand a full set of supramolecular or semiempirical descriptors could be introduced therefore for the whole set of compounds, which could be used to refine the modeling.
It cannot be excluded that a wider set of molecular descriptors could achieve better modeling, but the reader is reminded that graph-theoretical molecular descriptors are rather insensitive to weak non-covalent intermolecular interactions at long range and to van der Waals forces at close range.These type of interactions constitute a tremendous challenge not only for chemical graph theory but, also, for the whole of modern chemistry (Dykstra & Lisy, 2000).

METHOD
The Structure (S)-Property (P) relation is usually approximated by Linear equation ( 1), where P is the modeled property, c 1 , and c 0 are the regression coefficients, U 0 ≡ 1 is the unitary index and S is any structural descriptor, which can either be a molecular connectivity (MC) term, X = f(χ), a molecular pseudoconnectivity term, Y = f(ψ), or a mixed molecular connectivity-pseudo-connectivity higher-order term, Z = f(X,Y) (Pogliani 2000c(Pogliani , 2001(Pogliani , 2002)).This last term can also have the form Z = f(X, Y, β), where β is a basis MC index.The linear relation can also be written as a dot vectorial product: P = C⋅S, where C = (c 1 , c 0 ), and S = (S, U 0 ).To avoid negative calculated P values, with no biological or physical meaning, which can further reduce the quality of the modeling it is better to use the modulus modeling equation: P = c 1 S + c 0 U 0 .Here bars stand for absolute value.This modeling equation normally enhances the description, provided that the experimental activities or properties are all positive.If some experimental activity, A, or property, P, values are negative then the modulus bars should be omitted and the normal modeling equation should be used.Clearly, any molecular descriptor can be introduced and used for S, such as graphtheoretical descriptors, geometrical descriptors, quantum mechanical descriptors, thermodynamic descriptors, and even for more 'ad hoc' descriptors (Kier & Hall, 1986).The basis descriptors of this study we will be a set, {β} = {{χ},{ψ}}, of basis indices known as the molecular connectivity and pseudoconnectivity indices.With these basis indices more complex S descriptors will be derived.To avoid huge calculation problems the following medium-sized set of molecular connectivity and pseudoconnectivity indices will be used.
Basis χ indices are directly based on the δ and δ v connectivity numbers of a graph and a pseudo-graph respectively (Kier & Hall, 1986, Pogliani, 2000).Basis ψ indices are, on the other hand indirectly based on δ and δ v numbers through the I-State (ψ I subset) and S-State (ψ E subset) indices (Kier & Hall, 1999;Pogliani 2000cPogliani , 2001)), which are defined in Eqs. ( 4) and ( 5) Here, N = principal quantum number, ∆I ij = (I i -I j ) / r 2 ij , and r ij = counts of atoms in the minimum path length separating two atoms i and j, which is equal to the usual graph distance d ij + 1.From the factor Σ j ∆I ij it is evident that S incorporates, at the atomic level, information about the influence of the remainder of the molecular environment, and that it can also be negative.These two atom-level indices encode simultaneously the graph and pseudograph representation of a molecule, as they are directly (I) and indirectly (S) based on δ and δ v numbers of a graph and a pseudograph, respectively.Indices of subsets (2) and (3) and their subsets are formally similar as can be seen from the following definitions χ t = (δ 1 ⋅δ 2 ⋅δ 3 ⋅.......⋅δ N ) -0.5 (12) T ψ I = (I 1 ⋅I 2 ⋅I 3 ⋅.......⋅I N ) -0.5 (13) Index χ t (and χ v t ) is the total molecular connectivity index, and has as its ψ counterpart the total molecular pseudoconnectivity index, T ψ I (and T ψ E ).Sums in Eqs.(6-9), as well as products in Eqs. ( 12) and ( 13), are taken over all the N atoms (vertices in graph terminology) of a molecule.Sums in eq. 10, and 11 are over all edges (σ bonds in a molecule) of the chemical graph.By replacing δ with δ v in Eqs.(6, 8, 10, and 12) the subset of valence χ indices {D v , 0 χ v , 1 χ v , χ v t } is obtained.By replacing I i with S i in Eqs.(7, 9, 11, and 13) the pseudoconnectivity ψ E subset { S ψ E , 0 ψ E , 1 ψ E , T ψ E } is obtained.Peaks S and T in ψ indices stand for sum and total, the other peaks follow the established denomination for χ indices (Ker & Hall, 1986).
One of the results of the I S concept (Kier & Hall, 1999) states that Σ i S i = Σ i I i , with the consequence that S ψ I = S ψ E .In this case set 3 will consist of seven ψ indices only.Now, to avoid negative S i values for carbon atoms bonded to highly electronegative atoms, which could give rise to imaginary ψ E values, every S i value of a class of compounds whose carbon atoms show negative S i values has been rescaled to the S value of the carbon atom in CF 4 (S = -5.5).This is the lowest S values a carbon atom can assume.Inevitably, this rescaling invalidates the cited result of the I S concept, with the consequence that S ψ I ≠ S ψ E .This rescaling procedure is mandatory for amino acids, and purine and pyrimidine bases.For further information About the influence of the rescaling procedure on the quality of the modeling see Pogliani (2001).
The procedure used to construct the molecular connectivity, X = f(χ) and the molecular pseudo-connectivity terms, Y = f(ψ) is a trial-and-error procedure (Pogliani, 2000(Pogliani, -2001)).This procedure, which optimizes not only the basis indices but also the optimization parameters, normally converges quite rapidly or does not work at all.The general form of these terms looks like a rational function, Here β is a basis index, S = X or Y for β = χ or β = ψ, respectively, and a -d, m -q, and r are optimization parameters that can be either negative, or zero or one.In these last two cases the rational function can be condensed into a much simpler form.As can be seen from Eq. ( 14) the power of each basis index is again optimized, which means that the original power (-1/2, see Eqs. (8-13)), looses its restrictive meaning.The method of constructing terms could loosely be called for Configuration Interaction of Graph-Type Basis Indices (CI-GTBI) because of its vague resemblance with the quantum method, Configuration Interaction of Molecular Orbitals made up of Gaussian type basis functions.
Throughout the present study mixed connectivity-pseudoconnectivity terms, Z = f(X, Y) will be derived and used, whenever possible.The construction of the higher-level mixed Z terms is performed with the aid of a search procedure, which consists of trying the different mathematical operations that can be used to combine X and Y together.For the sake of briefness this search procedure will also be called a trial-and-error search.
The statistical performance of the graph-structural MC invariant, S, is controlled by a quality factor, Q = r / s, and by the Fischer ratio F = fr 2 /[(1-r 2 )ν], where r and s are the correlation coefficient and the standard deviation of the estimates, respectively, f is the number of freedom degrees = n-(ν+1), ν is the number of variables, and n is the number of data.Parameter Q has no absolute meaning as it is an 'intra' statistical parameter used to compare the descriptive power of different descriptors for the same property, however this property should always be given in the same scale.The F ratio, which has the character of an 'inter'-statistical parameter, tells us, even if Q improves, which additional descriptor endangers the statistical quality of the combination.For every invariant S, β, and U 0 , the fractional utility, u k = c k /s k , where s k is the confidence interval of c k , as well as the average fractional utility <u>=Σu k /(ν+1), will be given.If the modeling relation is linear, with only one structural descriptor, S, and with U 0 , then : <u> = (u 1 +u 0 )/2.The utility statistics allows descriptors that give rise to unreliable coefficient values (c k ), whenever they have a high deviation interval (s k ) to be detected.Thus, this statistics gives an indirect information about the importance of a descriptor in the modeling equation.The reader should be aware that specific modeling is always under the control of all of these statistical parameters, and an improved Q is not a good recipe for a good modeling.To avoid citing the dimensions of the modeled properties every time each property P should be read as P/P° where P° is the unitary value of the property.This allows the property P to be read as a pure numerical value (Berberan-Santos & Pogliani, 1999).

RESULTS AND DISCUSSION
Table 1 shows the experimental values of the modeled properties for the amino acids, and the purine and pyrimidine bases.Tables 2 through 5 show the connectivity and pseudoconnectivity values of amino acids, and purine and pyrimidine bases, respectively.Notice that the solubility values are given with the corresponding temperatures, which for the amino acids is 25°C.The temperatures for the purine and pyrimidine bases is given in parenthesis beside each solubility value.The original source for the experimental values are Weast (1984Weast ( -1985)), Lide (1991Lide ( -1992)), Guttman &Higuchi (1957), andBolton et al., (1957).Throughout these sources there is no direct mention about experimental errors, but from a comparison done on different results for Leucine, mentioned in Weast (1984Weast ( -1985)), a 7%-10% error for the found solubility values can be assumed.The modeling power of a linear relation with connectivity or pseudoconnectivity terms is very dependent on the quality of the data used to derive the modeling equation.Now, there are cases where the data are not complete, in the sense that solubility values are not enough to give a full picture of the solubility problem.To solve the solubility problem of the amino acids and bases, information about their association in solution should be at hand.Now, for some of these compounds (some bases) the information exists and this uncovers and underlines the importance of this kind of information for all the remaining compounds.For most compounds the association phenomena in solution can only be guessed at by the unsatisfactory modeling that can be detected at the level of the standard deviation of the estimates, s, which is degraded by the presence of strong outliers.Whilst it makes things easier throwing away outliers is scientifically unsatisfying, especially if they do not represent any form of experimental error.The solubility of amino acids and purine and pyrimidine bases has another interesting aspect in that it is a classic example of how it is possible to derive a modeling equation that works pretty well for the training set of compounds, but that it does not work on the chosen subsets of compounds.Thus indicating a case of overfitting for the training set.

Solubility of Amino Acids
A detailed analysis of the modeling of the solubility, Sol, of 20 amino acids (no Cys, but with Hyp), at once uncovers four strong outliers: Arg, Ser, Hyp, and Pro.To take care of these outliers a weighting parameter a has to be introduced, which weights the reciprocal basis indices (Pogliani, 2000) that have to be used to model this property of amino acids (R = 1/β) Here, a(Pro) = 8, a(Ser, Hyp, Arg) = 2, and a(others) = 1.The rationale for such a choice will be elaborated further throughout this section.The resulting two subsets of suprareciprocal basis indices, R S (χ) and R S (ψ) of Eqs. ( 14) and ( 15), represent the best basis descriptors up to now detected for this property.Note that the suprareciprocal descriptors of these two sets can be read as very simple forms of the molecular connectivity and pseudoconnectivity terms.
Let us look closer at the character of parameter a.As already underlined in an another paper (Pogliani, 2000a) the concept of outliers has a meaning only in the context of a model and the reasons that give rise to them should be determined.Alas, in many cases these reasons are unclear as there is a lack of experimental data, and then these can only be guessed at from a faulty modeling.Thus, parameter a could be seen as a weighting factor, loosely representing an association parameter.Improving the modeling will tell us if it is an appropriate choice.Practically, this parallels the method that subjectively gives outliers different weights, which asserts that the model is correct but the data needs to be adjusted.The fact that the total connectivity, χ t and χ t v , and the pseudoconnectivity, T ψ I and T ψ E , indices have to be divided by a, instead of multiplied, resides in their definition: in fact, Eqs. ( 12) and ( 13) show that their values decrease with the increasing complexity of the chemical graph.The trial-and-error search for the best mixed higher-order connectivity pseudoconnectivity term for the solubility of amino acids gives the following term and statistical parameters (C is the correlation vector, u is the utility vector of each parameter of the regression, and Z = [a/ 0 χ v +0.06(a/ 0 ψ I )] 0.9 )  From the solubility values of amino acids in Table 1, let us note that the value of the s statistics of our optimal term is not that small.To further enhance this modeling it is better to use the modulus Eq. ( 18), otherwise amino acids Tyr and Trp will show negative calculated solubility values, Sol(AA) = 38.66⋅Z′sol -337.8(18) Use of the modulus equation enhances the description, Q(S calc /S exp ) = 0.04, F(S calc /S exp ) = 3980 becomes Q(S calc )/S exp ) = 0.05, and F(S calc )/S exp ) = 5308.With the modulus Eq. ( 18) the calculated, Sol clc , of Table 6 have been obtained.Table 6 also shows the leave-one-out values, Sol loo .Between calculated (Sol calc ) and leave-one-out values (Sol loo ), no consistent disagreement can be detected, while a noticeable disagreement can be detected between calculated and experimental solubility values.The column of ratio values, Ra=Sol clc /Sol, shows that only twelve solubility values are modeled in a satisfactory way, if an interval for this ratio ranging between 0.5 and 1.5 for this range is allowed.
Now, let us examine the influence of the association parameter a on the modeling.This parameter has been inferred to avoid throwing away the strong outliers Arg, Hyp, Pro, and Ser.If we choose a = 1 for Arg, Hyp, Pro, and Ser, both Z′ Sol , and 0 R v become very por descriptor of the solubility of the amino acids.In the following lines the utility values have not been reported as the description is bad enough at the level of the remaining statistics Here 0 R v = 1/ 0 χ v , as a = 1.The best descriptors for these sixteen amino acids is (here a = 1) If we exclude also Asp, Gln, Lys, Met, Thr, and Tyr from the modeling, leaving a total of n = 10 amino acids (with a =1), we obtain the following results, where even here as for the n = 16 case, 0 R is the best descriptor {Z′ Sol }: Q = 0.034, F = 66, r = 0.944, s = 28, n = 10, <u> = 7.3, u = (8.1,6.6) { 0 R} : Q = 0.038, F = 82, r = 0.955, s = 25, n = 10, <u> = 7.8, u = (9.1,6.6) {1/M}: Q = 0.033, F = 63, r = 0.942, s = 28, n = 10, <u> = 6.7, u = (7.9,5.5) Even here the mixed term Z′ Sol is (as for n = 16 case) a discrete descriptor, and this underlines the reliability of this term.Note that, up to now, s has always been rather 'unhealthy'.Let us see further how the modeling of the ten amino acids left-out behave, i.e., Tyr, Thr, Met, Lys, Gln, Asp, Ser, Pro, Hyp, and Arg, for which a = 1: Term Z′ Sol is here the best descriptor while 1/M is not a good descriptor, as its s value is unsatisfactory.If we enlarge the search to the modeling of a subclass (I) made up of six amino acids with very different solubility values from among the sixteen amino acids with a = 1, i.e., Gly, Ala, Thr, Asp, Lys, and Tyr, and to a subclass (II) made up of six amino acids with very similar solubility values, i.e., Leu, Asn, Phe, Ile, Gln, and His, we note (i) the good quality of the Z' term (ii) the good quality of 0 R, and (iii) the poor results of a/M in modeling subclass II.
The negative point in the simulation of the solubility of these different subclasses of amino acids arises from the standard deviation of the estimates, s.Not only this is too large for some amino acids with low solubility, but looking at the different subclasses we notice that it is only satisfactory, i.e., s = 3-5, only with subclass (II), where the differences in solubility are not dramatic, i.e., Sol(Leu)=23, Sol(Asn)=25, Sol(Phe)=29, Sol(Ile)=34, Sol(Gln)=42, Sol(His)= 43.The behavior of s could be explained assuming the existence in solution of associative phenomena not taken into due consideration by the weighting parameter a, which is used here for only four amino acids.Thus, our inferred a values are only partially useful.We could always infer a more precise set of a values valid for other amino acids, but lacking experimental evidence renders such a choice highly questionable.
Before closing this section on the solubility of amino acids let us note that an attempt to develop semiempirical terms with the T fus of amino acids (see introduction), following the method outlined by Pogliani (2000a), gives only poor results.Instead semiempirical terms which, instead, include ∆H fus and ∆H fus plus T fus cannot be derived as ∆H fus for the whole set of amino acids is missing.Here we face here a second case of incomplete information.The solubility of amino acids is mainly influenced by rapid association (with the solvent) or self-association phenomena in solution and has suggested the next section on the solubility of twenty-three bases.

Solubility of Purine and Pyrimidine Bases
Before getting into the details of this description it should be noted that some of the original experimental solubility values of these purine and pyrimidine bases are scattered throughout four different publications (Guttman & Higuchi, 1957;Bolton et al., 1957;Agostini et. al., 1990Agostini et. al., , 1994)), and in Pogliani (1995).
For this modeling the following suprasquared basis indices have to be introduced, where: a(7PTp) = 4, a(1ETb, 7ETp, Cf) = 2, a(7ITp) = 1.5, and a(others) = 1 (Pogliani, 2000), the rationale for this choice is explained in the following lines The fact that the total, χ t and χ t v , connectivity indices and the total T ψ I and T ψ E pseudoindices have to be divided, instead of multiplied, by the association parameter a is again because of their definition: their values decrease with increasing complexity of the chemical graph.The presence of such strong outliers as, 7PTp, 1Etb, 7ETp, Cf, and 7Itp, oblige us to introduce the weighting parameter, a, which has been already introduced for the solubility of amino acids.But things with purine and pyrimidine bases are a little different.Actually, the weighting parameter for the cited outliers really represents of an experimental association parameter (Pogliani, 1995;Guttman & Higuchi, 1957;Bolton at al., 1957).Remarkable (i) as for the amino acids, the type of descriptor found for the solubility of purine and pyrimidine bases (i.e.suprasquared indices) is similar for both χ and ψ indices, and that (ii) the optimal basis descriptors for the solubility of amino acids and for the solubility of purine and pyrimidine bases are completely different completely from each other (i.e., suprareciprocal and suprasquared indices and pseudoindices).
Even for these bases the following molar mass descriptor is a very good simulator for the solubility, nearly as good as the best suprasquared index, 1 S = (a 1 χ) 2 {(aM) 2 } : Q = 0.170, F = 1455, r = 0.993, s = 5.8, n = 23, <u> = 21, u = (38, 4.8) { 1 S}: Q = 0.176, F = 1553, r = 0.993, s = 5.7, n = 23, <u> = 22, u = (39, 4.9) The statistics of the best molecular pseudoconnectivity suprasquared index, 0 S I = (a 0 ψ I ) 2 , is  While the best two-pseudoindex combination has the following statistical level, where, T S I = ( T ψ I /a) 2 , { 0 S I , T S I } : Q= 0.232, F = 1352, r = 0.996, s = 4.3, n = 23, <u> = 21, u = (51, 4.3, 7.4) While no improved combination is obtained with the empirical descriptor (aM) 2 , the following homogeneous combination seems to be an optimal descriptor even at the level of the F statistics F= 1446, r = 0.997, s = 4.2, n = 23, <u> = 22, u = (53, 4.4, 8) For sheer curiosity let us now see if we can improve the F value of the { 1 S, S t } combination just by algebraically adding its two descriptors, {( 1 S + S t )}: Q= 0.176, F = 1553, r = 0.993, s = 5.7, n = 23, <u> = 22, u = (39, 4.9) The artifice of merging two descriptors into one to enhance F statistics has in fact worsened both r and s (i.e., Q) and brought no improvement in utility.This fact tells us that a CI-GTBI cannot be based on the simple algebraic sum of the best indices and/or pseudoindices, but that it must cover a (i) basis index optimization, (ii) an exponent optimization, and (iii) an optimization of the coefficient of the basis index.The best overall descriptor for the solubility of bases is the following CI-GTBI or term,  This relation, Eq. ( 22), has no absolute value bars as every calculated solubility value of bases is positive.Considering that some solubility values are very low, i.e., down to 0.02 for Sol(UA) (see Table 1), this seems to underline the good quality of found mixed higher-order term, Z′ Sol , even if its s value (s = 3) seems effectively too large in relation to the lowest solubility values.Table 6 shows the calculated solubility values with Eq. ( 22) and the calculated solubility values with the leave-one-out method.The similarity between these two sets of values underlines the low sensitivity of the leave-one-out method in detecting irregular behavior in the simulation of a property.Only a comparison between experimental and calculated values, as we did for amino acids, tells us that the modeling is anomalous.From the ratio of calculated to experimental solubility values, Ra = Sol clc / Sol, we note that, if a ∆Ra = ± 0.5 is accepted as a limit for a good simulation, then only the solubility of thirteen purine and pyrimidine bases are fairly described.The standard deviation of estimates, s, for the purine and pyrimidine bases is much lower than the s for the amino acids, as s(PP) = 3.0 and s(AA) = 25.Nevertheless, it should not be forgotten that the scale of the solubility values of amino acids goes up to 1600 for Pro (remember that we are dealing with adimensional P/P 0 values), and that some solubility values of our bases are as low as 0.02.Thus, things are not at all rosy even for our bases, and even here we will need more data to tell us about what is going on in solution for each compound.Due to the low s value the simulation of purines and pyrimidines solubility seems more homogeneous than the simulation of the solubility of amino acids.In fact, a too high solubility is predicted for seven amino acids while a too low solubility is predicted for only one.For purines and pyrimidines the spectrum of solubility values is more symmetrical as six solubility values are too high and four are too low.
Now, let us model some subclasses of these bases, and first of all let us see how the optimal term, Z′, models the entire class of bases when a = 1 for every compound.Let us also look for the best descriptor, and the quality of the molar mass descriptor, (aM) 2 {Z′ Sol }: Q = 0.004, F = 1.0, r = 0.214, s = 48, n = 23 The very poor quality of these descriptors with a =1, means that there is no description without supraindices.Let us now eliminate those compounds with a ≠ 1 from the description, i.e., 7PTp, 1Etb, 7ETp, 7Itp, and Cf , and model only those compounds with a = 1.The result is, {Z′ Sol }: Q = 0.221, F = 11, r = 0.643, s = 2.9, n = 18 {(M) 2 }: Q = 0.11, F = 2.9, r = 0.393, s = 3.5, n = 18 The description has improved compared to the preceding case, especially for Z′ term, which is now the best descriptor.
Even if the improvement is noteworthy nevertheless it remains unsatisfactory.Interestingly, note the low s value of these new descriptions compared with the preceding case.Let us determine which compounds endanger this last description when the five compounds with a ≠ 1 have been excluded.For the following nine compounds, 1PTb, 1BTb, OA, A, Hypo-X, X, Iso-G, G, and UA, the description improves and begins to be decent compared with the previous cases {Z′ Sol }: Q = 0.273, F = 12, r = 0.799, s = 2.9, n = 9, <u> = 2.9, u = (3.5, 2.3) { 0 S v }: Q = 0.295, F = 14, r = 0.820, s = 2.8, n = 9, <u> = 2.7, u = (3.8,1.6) {(M) 2 }: Q = 0.272, F = 12, r = 0.797, s = 2.9, n = 9, <u> = 2.8, u = (3.5, 2.1) If we delete UA, OA, and 1PTb from this description it improves showing that we have detected a further three 'bad' compounds {Z′ Sol }: Q = 2.015, F = 92, r = 0.979, s = 0.5, n = 6, <u> = 7.6, u = (9.6,5.7) {(M) 2 }: Q = 1.333,F = 41, r = 0.954, s = 0.7, n = 6, <u> = 5.1, u = (6.4,3.8) The Z′ term is the best descriptor here, while the squared molar mass enhances its quality but also its gap from Z′ Sol .Let us see how much the description improves if to these six optimal compounds we add, now, the five compounds with a ≠ 1: a(7PTp) = 4, a(1ETb, 7ETp, Cf) = 2, a(7ITp) = 1.5, {Z′ Sol }: Q = 0.454, F = 9355, r = 0.9995, s = 2.2, n = 11, <u> = 51, u = (97, 4.7) {(aM) 2 }: Q = 0.198, F = 1786, r = 0.997, s = 5.0, n = 11, <u> = 24, u = (42, 5.2) There is a very interesting improvement in r, F and the utility, while s and consequently Q worsen.Nevertheless the modeling of these eleven compounds can be considered good, especially the one achieved using the Z′ term, which exceeds by far the modeling quality of the suprasquared molar mass.Now let us check if the nine excluded compounds with a = 1, are really the poor ones together with UA, OA, and 1PTb.The description for the following excluded compounds, Tp, C, 7I8MTp, 7B8MTp, 5MeC, T, 7BTp, U, Tb, is, in fact, deceptive {Z′ Sol }: Q = 0.115, F = 0.58, r = 0.28, s = 2.4, n = 9 {(M) 2 }: Q = 0.005, F = 0.001, r = 0.013, s = 2.5, n = 9 Even here Z′ is the most interesting descriptor while the squared molar mass is a very bad descriptor.From all these models we can infer that we need further experimental data to achieve a satisfactory modeling for twelve compounds, and we especially need data that explain their behavior in solution.The 'poor behavior' of these twelve compounds disappears when they are combined with the other compounds to give a class of twenty-three compounds.In this case their 'poor behavior' is averaged out by the others 'good behavior'.Notice that for most of these descriptions the Z′ term is the optimal or the nearly optimal descriptor performing better than the squared or suprasquared molar mass.It is not even possible to develop a semiempirical term with T fus and ∆H fus for purine and pyrimidine bases as the complete set of these values for these bases is also missing.

CONCLUSION
The 'incomplete data' issue in modeling of the solubility of amino acids and purine and pyrimidine bases uncovers one of the main problems in QSAR/QSPR studies: the need for additional collateral data on 'nearby' properties to achieve an optimal modeling.An anomalous modeling can normally be uncovered by the large value of the standard deviation of the estimates, s, of the description, which can even be larger than many experimental values.Sometimes the underestimated statistic, s, is much more efficient than any other kind of statistic (inclusive of the leave-one-out method) for detecting 'anomalous' situations.Incomplete information can be of two types, information totally missing, that is as in the case of amino acids and information partially missing that is the case of the purine and pyrimidine bases.The only way is to introduce an undifferentiated weighting parameter.This parameter, in the case of amino acids solubility, to make up for that missing information, can be freely interpreted as an association constant based on the experimental results taken from a series of solubility values of purine and pyrimidine bases that were also studied.Even after the introduction of the weighting parameter in the case of some amino acids, and after the introduction of the association constant in the case of some bases a poor value of the standard deviation of the estimates, s, is detected.underlining the fact that a complete set of data about the given compounds behavior in solution of is missing.Incomplete information includes not only data on the association phenomena in solution, but also data on ∆H fus , which deprive us of the possibility of building semiempirical terms.Nevertheless, modeling of the properties of compounds whose collateral experimental data are either totally or partially missing is always worthwile.In fact, it offers interesting hints not only about the quality and quantity of the incomplete information, but also suggests the practical possibility of defining supramolecular basis descriptors that can take care of some non-covalent interactions.Clearly, there is here the risk of ending up with a circular reasoning of the kind: the model does not work, a new parameter is introduced to make it work, and finally it works.To avoid this, the new parameter (i) should have a clear physical meaning, (ii) should at least have been detected in some cases at least, and (iii) should be used parsimoniously, until further evidence, i.e., new experimental data are at hand.This study on 'imperfect' information has also shown that the CI-GTBI method is not able to model everything, as has been suggested because it claims that they mimic or can be mimicked by random numbers.Apart from the fact that it is not possible to mimic any property whatsoever (Kier & Hall, 1986) with random numbers, such a possibility would deprive the random numbers of their random character, as they are either random or they show trends and therefore are no more random.Let us end this paper with the wise words of E.T.Bell (Taine, 1964), "Things in the real universe don't all fit together like the pieces of a puzzle".

Table 1 .
Solubility of amino acids, Sol, in grams per kg of water (T=25°C); Solubility, Sol, of purines and pyrimidines bases in grams per 1000 ml of water at the given temperature (in parenthesis)

Table 4 .
Molecular connectivity indices, χ, for 23 Purine and Pyrimidine bases* * For an explanation of the names see footnote of table1Data Science Journal, Volume 1, Issue 2,August 2002, 210

Table 6 .
The experimental (Sol) and calculated (Sol calc ) solubility values of amino acids, and their calculated solubility with leave-one-out method (Sol loo ).The experimental (Sol) and calculated (Sol clc ) solubility values of bases, and their calculated solubility with the leave-one-out method (Sol loo ).In parenthesis are the corresponding molar mass (M) values.Ra stands for the ratio Sol clc / Sol.Also shown are the assumed association a values (see text).