Published as: G.J. Kleywegt and T.A. Jones, Model-building and refinement practice, Methods in Enzymology, 277, 208-230 (1997).

© Academic Press, 1997

Good Model-building and Refinement Practice

Gerard J. Kleywegt * and T. Alwyn Jones

Department of Molecular Biology,

Biomedical Centre,

Uppsala University,

Box 590,

S-751 24 Uppsala,

SWEDEN.

* Author to whom correspondence should be addressed.

Running title: Good Model-building and Refinement Practice

To appear in Methods in Enzymology, edited by R.M. Sweet and C.W. Carter Jr.

HTML Version produced by Erling Wikman, Uppsala


INTRODUCTION

An initial model built into an experimental map, or in a poorly phased molecular replacement map, will usually contain many errors. In order to produce an accurate model, it is necessary to carry out crystallographic refinement as well as rebuilding at the graphics display. These steps are carried out in a cyclic process of gradual improvement of the model. Depending on the size of the structure, the automatic (refinement) or the manual (rebuilding) part may be rate-limiting. Refinement programs change the model in order to improve the fit of observed and calculated structure-factor amplitudes. Many different refinement programs exist (see Kleywegt & Jones [1] for some history), and most contemporary programs use reciprocal-space methods. Due to the limited resolution typically obtained in biomacromolecular crystallography, the relatively scarce experimental data is augmented by chemical information, for instance concerning bond lengths and angles. Rebuilding the model at an interactive graphics workstation is necessary to remove errors that cannot be remedied by the refinement program. Such errors often require a major new interpretation of parts of the maps and, at present, can only be done by intelligent crystallographers at a 3D workstation.

Errors in structures come in different classes of gravity [1-3]. At its worst, all or part of a protein trace may be wrong. Methods have been developed to help identify such models; they make use of our knowledge of protein structure [4,5]. In less severe cases, the local main chain and side chains may be wrong. If we divide the regions making up the first model of a molecule into three classes, good, bad, and ugly, then the perfect refinement would result in a final model in which all atoms fall into a fourth category, excellent. Unfortunately, due primarily to lack of resolution, this situation rarely occurs and we must usually be satisfied with a model that lacks ugly regions, and contains a high percentage of excellent ones. In order to arrive at such a final model, both good refinement and good rebuilding practices are necessary.

In the past few years, software for refinement and rebuilding of crystallographically determined macromolecular structures has become ever more powerful and easy to use [6]. However, better software does not automatically lead to better models [1]. A powerful refinement program can be used to create a model which adequately explains the experimental observations, but it can just as easily be abused to create a model which contains errors and artefacts introduced by fitting the model to errors in the experimental data. The difference between a carefully refined model and an over-fitted model, as measured by the root-mean-square (RMS) distance between corresponding CA atoms, may well exceed 1 Å [1,7,8]. A good model is one which makes sense (e.g., with respect to stereo-chemistry, temperature factors, etc.), adequately explains the experimental data, and uses the smallest number of parameters to achieve this. Good model building and refinement practice aims to produce such a model. It requires the use of appropriate techniques and strategies during both the (re)building and the refinement stages.

Model refinement has been a personalised affair for which laboratories have their preferred strategies, programs, etc. This has resulted in models with distinctive features of both the groups concerned and the software used. In this paper we propose our own views on how a macromolecule should be refined, and argue that the present practices in the community are often far from optimal, especially in cases where only low-resolution data is available [1].

All refinement programs nowadays use empirical restraints or constraints to ensure that a reasonable structure ensues during the refinement steps. This can result in a model with good stereo-chemical properties, but also a model in which molecules related by non-crystallographic symmetry (NCS) are forced to have similar (restrained) or identical (constrained) conformations. Nevertheless, unless special precautions are taken, over-fitting the data (i.e., adjusting the model in a manner which is not warranted by the quantity and/or quality of the experimental data) is almost guaranteed to take place, resulting in a model with a low R-factor, but concomitantly low accuracy. Because of the limited resolution of the diffraction data in a typical macromolecular crystallographic study, the number of parameters in the model is often similar to, or even smaller than, the number of experimental observations, which renders the risk of over-fitting extremely high. "Popular" methods to push the conventional R-factor down include ignoring NCS (otherwise the single most powerful method to reduce the number of degrees of freedom in the model), refining individual temperature factors and modelling alternative conformations at resolutions where this is not warranted, removal of data (using resolution and F/sigma(F) cut-offs) and inclusion of spurious entities (such as solvent molecules). These methods either reduce the number of experimental observations or increase the number of parameters in the model, and therefore invite the refinement program to fit error terms, and sometimes this over-fitting may even mask gross errors in the model [7,8].

In rebuilding, the experimental map (if available) should always be kept, and at each stage one should try to re-interpret it in the light of the current model, and using the current 2Fo-Fc, Fo-Fc and other maps. One should keep in mind the kind of errors that might still be present in the model, and try to locate places in the map that could be the result of such errors. While rebuilding a model, the accumulated knowledge concerning macromolecular structures should be used to locate places in the current model that deviate from our expectations and previous experience (as pertaining to quality of the fit to the map, stereo-chemistry, preferred conformations and environments of residues); such deviations could indicate local errors.

The aim of model-building and refinement should be to construct a model which adequately explains the experimental observations, while making physical, chemical and biological sense. It is a fact of life that low-resolution data can only yield low-resolution models. The refinement process, in particular, should therefore always be tailored for each problem individually, keeping in mind the amount, resolution and quality of the data. This means that at low resolution individual temperature factors and occupancies should not be refined, and that NCS-related copies of the molecule(s) have to be assumed to be identical. Even though it is evident that NCS-related molecules will usually display small differences, one simply does not have the data to prove this at low resolution. A model which enforces strict NCS may be less precise (e.g., with respect to surface side chain details), but due to the much larger ratio of experimental observations to adjustable model parameters, it is likely to be a more accurate description. The distinction between precision and accuracy is an important one, but unfortunately the two concepts are often confused. Precision is related to level of detail, accuracy to how close to the "truth" something is. For instance, the number 4.987453637 is a very precise, but not very accurate approximation of the number PI; the number 3.14, on the other hand, is a not very precise, but much more accurate estimate. In the case of multiple protein models derived by solution nuclear magnetic resonance (NMR) techniques, a tight clustering means that the structures are precise, but it doesn't say anything about whether or not they are close to the real structure [9,10]. In the case of protein crystallography, a 3 Å structure with individual temperature factors, unrestrained NCS and hundreds of water molecules built in may seem very precise, it is doubtful whether the atoms on average are even within 1 Å from their actual positions. Similarly, a hydrogen-bonding distance reported as 2.83 Å for a 3 Å structure is quite precise, but not necessarily accurate. Low-resolution data precludes the production of a precise model; it does however not hamper one to produce a model which adequately describes the data. In other words, even at low resolution one can build accurate models as well as the data allows, but only high-resolution data may yield a model which is both accurate and precise. Even high-resolution data alone is no guarantee for an accurate model [1,8]; in addition one must use sensible refinement and rebuilding procedures, and monitor the quality of the model throughout.

Refinement should always start with a small number of degrees of freedom (the "null-hypothesis"). This means, for example, that if NCS is present, it should initially be constrained; the addition of water molecules should be postponed until the protein model is essentially complete and correct; the initial temperature-factor model should be conservative (e.g., by refining only one or two temperature factors per residue).

The model that results from a refinement round should be checked carefully against our expectations based on what we have learned about macromolecules ("quality control"). This process should not be carried out once, prior to writing the paper, but in every cycle. In that way, local errors can be detected more easily, and remedied as they occur. Standard checks should include the ideality of the geometry, the geometry of the main chain (Ramachandran plot), the adherence of the structure to database structures (e.g., the side chain conformations and the peptide oxygen orientations), and of course the fit of model and map (e.g., the component-based real-space R-factor). In addition, one should verify that the structure satisfies various common-sense rules of thumb (such as "NCS-related molecules are very similar", "bonded atoms and NCS-related atoms have similar temperature factors", "peptide bonds are planar", "arginine residues are involved in salt links or hydrogen bonds", etc.). Also, when one builds a new bit of structure, databases can and should be used to construct the main chain, and side chains should be added in one of their preferred rotamer conformations. While rebuilding, high-resolution data may and will reveal a few places in the structure where a side chain does not have a rotamer conformation, or where the peptide oxygen points in a different direction than would be expected on the basis of a comparison with a database. These may be regions of potential biological interest, but only if the crystallographic data permits this should liberties be taken with the model.

In the case of higher resolution data, the number of degrees of freedom can be increased gradually (and with restraint(s)) once the constrained refinement has converged. This means that one may use NCS restraints, instead of constraints, add water molecules, refine individual (but restrained) temperature factors, etc. The most useful indicator available today of whether or not the inclusion of additional degrees of freedom actually leads to a better model (i.e., a better description of the experimental data), is Brünger's free R-factor [11-14], the value of which is highly correlated with the mean absolute phase error of the model.


Figure 1

Figure 1. Overview of the structure-building process for crystallographically determined macromolecular structures. The items inside the boxed area are collectively known as one macro-cycle.


Figure 1 shows a schematic view of the refinement and rebuilding process (with NCS present). The "loop" inside the boxed area we refer to as a macro-cycle (i.e., one cycle of map calculations, quality control, rebuilding and refinement). Ten years ago, one would typically go through dozens of such cycles to produce the "final" model. In our experience, good models can now be produced in a fraction of the time (typically, 5 to 10 macrocycles). For small proteins (up to ~200 residues) at a typical resolution of 2-2.5 Å, one macro-cycle need not take more than one day if adequate computer facilities (access to a 3D graphics workstation and a fast number-crunching workstation) are available. This means that within one to two weeks after the initial model became available, either built in an multiple isomorphous replacement (MIR) map or the result of solving a molecular replacement (MR) problem, the structure can be completely refined. Of course, in case of larger structures or considerably higher resolution more time will be required to arrive at the final model.

In the following discussion, use of X-PLOR [15] and O [16] for refinement and rebuilding, respectively, is assumed. Both programs provide powerful tools, available today, to help build good models. We shall discuss:


INITIAL MODEL AND DATA QUALITY

Elsewhere in this volume [3] the process of building an initial model into an MIR map using the tools available in O [16] is described. These tools include:

In the case of an MR problem, one will often need to "mutate" some of the residues (with the Mutate commands in O), build loops with insertions or deletions (using the Baton and Lego commands), and select different side-chain conformations. Since this is mostly a rebuilding problem, most of the tools discussed in the section about this topic can also be used to generate an initial model for refinement.

The success of the refinement, and, ultimately, the quality of the final model is critically dependent upon the quality of the crystallographic data. While processing, scaling and merging the data, one should keep the following in mind:

Unfortunately, there is no agreed standard in the literature for presenting the quality of the diffraction data and, hence, the effective resolution of the study [21]. Bart Hazes has suggested [personal communication, 1995] to use an "effective resolution", defined as the resolution at which the actual number of observed, unique reflections would have constituted a 100 % complete dataset. In more than one case, we have found the calculation of this number a sobering experience.

To decide at which shell to cut off the resolution, we nowadays tend to use the following criteria for the highest shell: completeness > 80 %, multiplicity > 2, more than 60 % of the reflections with I > 3 sigma(I), and Rmerge < 40 %. In our opinion, it is better to have a good 1.8 Å structure, than a poor 1.637 Å structure. Moreover, over-estimating the resolution of the data is bound to lead to problems with the refinement later (for an example of this, see the Rfree section). In the case of complexes which are isomorphous to a previously solved structure, proper data processing may make the difference between observing fragmented blobs of density or nice well-connected density for the ligand, substrate or inhibitor. A recent data-processing error in our laboratory illustrates this. Cellobiohydrolase I [22] (CBHI) was crystallised in the presence of a beta-blocker in the hopes of obtaining a complex between the two [23]. Crystals were isomorphous to those obtained earlier, and data were collected on our new in-house R-AXIS IIc image plate system, and processed with the R-AXIS software [24] to a resolution of 1.8 Å. It was noted that the data was rather incomplete in the medium resolution shells. Nevertheless, SA refinement using Rfree was carried out and the refinement progressed well. Unfortunately, in the active site only isolated blobs of density were found and this left at least four possibilities for fitting the ligand. SA refinement did not yield better density for any of these possibilities. The original image plate data were then more carefully and more critically re-processed with DENZO [19] and scaled and merged with programs in the CCP4 package [20]. Reprocessing increased the overall completeness from 78 % to 99 %, and the completeness in the shells between 7.5 and 2.5 Å from ~65-70 % to ~96-100 %. Most important, however, was the appearance of beautiful, well-connected density in the active site. Unfortunately, the density showed unambiguously that not the beta-blocker, but rather beta-octylglucoside (part of the crystallisation solution) had been bound by the protein. In the case of complexes, high quality data is very important since one basically wants to obtain structural information about a small molecule using protein-crystallographic techniques. Another example of this is discussed in the section about maps.


REFINEMENT

In this discussion, we will focuss on two aspects. The first is the use of the free R-factor to (a) monitor the success or failure of a refinement step, (b) test alternative hypotheses (e.g., when deciding on the most appropriate temperature-factor model), and (c) to detect gross errors. The second aspect is the use of SA refinement, with which we have had very good experiences. Subsequently, we will briefly discuss force fields and dictionaries and several other issues related to macromolecular structure refinement.

Rfree. Recently, Brünger introduced a cross-validation scheme based on the so-called free R-factor, or Rfree [11-14]. The idea is to set aside a small fraction of the data (the "test set") which is not used in the refinement, but for which an R-factor is nevertheless calculated all the time. Comparing the values of the conventional and free R-factor tells something about the extent to which one has over-fitted the data as well as about the quality of model and data. The refinement program will use any degree of freedom it is given to reduce the discrepancy between observed and calculated structure factors. However, since the data is afflicted by error, and since the model is not an exhaustive description of all scattering matter in the crystal (space- and time-averaged), this easily leads to a situation in which the errors are compensated by erroneous changes to the model. Even today, many people have a fixation on low R-factors; only slowly is it beginning to dawn that conventional R-factors can be made arbitrarily low by including more and more degrees of freedom into the model [1]. In the case of photo-active yellow protein, the grossly incorrect initial model of this protein [25] was said to result in part from the power of SA refinement in reducing the conventional R-factor without actually improving the model [26]. In fact, plenty of structures have been refined in the past with a model that contained more adjustable parameters than there were experimental observations [27]. If one uses Rfree to monitor the refinement, however, such over-fitting of the data can be detected: if a refinement step in which more parameters are adjusted than in the previous cycle does not lead to a significant drop in Rfree, one has over-modelled the data.

A particularly striking demonstration of the poor performance of the conventional R-factor as an indicator of the correctness of a model (at least at low resolution) was recently given by intentionally tracing the structure of cellular retinoic-acid-binding protein (CRABP) type II (previously solved at 1.8 Å resolution [28]) backward, and refining this "model" using data to only 3 Å resolution [1,8]. Using an "established" refinement protocol, the conventional R-factor came down as low as 0.214, and this model had what is usually termed "excellent stereo-chemistry". The free R-factor, on the other hand, could not be fooled; it ended up at a value of 0.617, slightly worse than the value expected for a random set of scatterers. A consequence of the fact that the conventional R-factor is not correlated with the accuracy of a model (unless the data-to-parameter ratio is high) is that coordinate error estimates derived from conventional Luzzati plots [29] are meaningless. We therefore proposed that this quantity be estimated from an Rfree Luzzati plot [28] instead. In the case of the backward-traced CRABP model, the estimated coordinate error based on a Luzzati plot using the conventional R-factor is ~0.35 Å, whereas that based on the free R-factor is "infinite", which, at least in spirit, is more accurate. There are other indications that, at least at low resolution, an Rfree -based Luzzati plot gives a more accurate estimate of coordinate error than one based on conventional R-factors [11].

The free R-factor can be used to tune the refinement protocol for each individual case. For example, to find out if refinement of individual temperature factors is warranted, one can do the refinement both with grouped and with individual temperature factors. If the model with individual temperature factors does not have a considerably lower Rfree, one can conclude that with the current model and the present dataset, temperature factors are best modelled by group. We have done this experiment in the case of alpha-2u-globulin [30] (A2U), a structure with four-fold NCS for which we had collected a 2.5 Å dataset. Using a near-final model, grouped temperature-factor refinement (GTFR) yielded an Rfree of 0.272, and individual temperature factor refinement an Rfree of 0.275. On the other hand, in the case of cellular retinol-binding protein [31], GTFR using 2.1 Å data yielded an Rfree of 0.256, and individual temperature factor refinement an Rfree of 0.248. In fact, what we do here is to use Rfree to test the validity of various alternative hypotheses regarding a model. Another example of this involves the structure of P2 myelin protein [31,32] which has three molecules in the asymmetric unit. The hypothesis was that one of the three molecules in the asymmetric unit (molecule "C") has a higher overall temperature factor than the other two. To test the hypothesis, the structure was re-refined with strict NCS and grouped temperature factors [T.A.J., unpublished results, 1995] and, subsequently, an overall temperature factor shift was refined for each of the three molecules. The refinement of only three extra parameters resulted in a drop in both R (from 0.266 to 0.260) and Rfree (from 0.317 to 0.309; no ligand or water molecules were included at this stage). The temperature factor shift was -2.2 Å2 for molecule "A", -4.2 Å2 for molecule "B" and +10.6 Å2 for molecule "C", indicating that the hypothesis was appropriate.

Analogously, one can check if replacing NCS constraints by restraints yields a significantly better model for the data. This test we have also carried out with A2U, the results of which are shown in Table I. Clearly, Rfree indicates that the data at this resolution is best modelled by assuming the four monomers to be identical. Note that if no NCS restraints at all are used, the RMS difference between the monomers goes up to more than 1 Å, a value often seen for low-resolution structures which have been refined without the use of NCS or guidance by Rfree [1]. Also note that even this model has "excellent stereo-chemistry" as judged by the deviations from ideal bond lengths and angles (which are often the only "quality indicators" included in papers, in particular -though not exclusively- in the more prestigious journals ...). However, since the model contains four times as many adjustable parameters and still leads to an increase in Rfree, this is a clear-cut case of over-fitting: the unrestrained model is not a good description of the data.


Figure 2

Figure 2. Example of the behaviour of R and Rfree during an unsuccessful refinement cycle. The example is the result of one of many SA protocols tried out while refining the complex of human immunoglobulin IgG and the C2 domain of protein G [A.E. Eriksson, G.J. Kleywegt, M. Uhlén, and T.A. Jones, Structure 3, 265 (1995)]. The refinement included rigid-body refinement, energy minimisation, a slow cool from 4,000 K, and more energy minimisation. The solid line shows the behaviour of Rfree; the dotted line that of the conventional R-factor. The RMS difference between both R-factors was 0.063 and their correlation coefficient only 0.274. Refer to the text for details.


If only low-resolution data is available, SA refinement is often not successful. Use of Rfree has helped us in one case to probe the limits of SA refinement. When we refined the structure of the complex between the Fc fragment of human immunoglobulin IgG and the C2 domain of protein G [33], we only had a dataset available with an effective resolution of ~3.5 Å. Using Rfree as a guide, we found that none of the many SA protocols we tried yielded an improved model: Rfree remained constant or even increased, even though the conventional R-factor easily dropped by 0.1 (see Figure 2 for an example). There is no general rule as to which resolution is the limit for SA refinement; for every structure, viability of SA refinement should be investigated by inspection of the behaviour of Rfree. This point is driven home by the refinement of T. reesei endoglucanase I [34] (EGI), initially at 4.0 Å. This structure was solved with MR techniques. Not surprisingly, initial attempts to apply SA refinement to a rough homology model failed miserably (Rfree refused to drop below 0.50). Then a map was calculated using a poly-alanine model of one of the probe molecules. This map was poor, but after 15 cycles of two-fold NCS averaging a spectacularly improved map was obtained. Using this map, ~75 % of the sequence could be assigned to the model, yielding a starting value of ~0.45 for both R and Rfree. After a 4,000 K Cartesian slow cool, Rfree had dropped to 0.39 (R to 0.28), and in the resulting averaged map another ~15 % of the model (60 residues) could be traced and built, indicating that the SA refinement had genuinely improved the model.


Figure 3

Figure 3. Over-fitting data monitored by the behaviour of the conventional and free R-factor during refinement of the 2.9 Å structure of holo CRABP type I without the use of NCS and with individual isotropic temperature factors [G.J. Kleywegt, T. Bergfors, H. Senn, P. Le Motte, B. Gsell, K. Shudo, and T.A. Jones, Structure 2, 1241 (1994)]. The large drop in the conventional R-factor leads to a poorer model and does not model the data better than the conservative model, since Rfree remains constant.


In yet another case, Rfree helped us to identify a problem with a dataset. While refining the structure of holo CRABP type I [28], we used a dataset that had been processed to 2.5 Å resolution. However, the refinement got stuck at an Rfree value of ~0.35, no matter what we tried. Since we trusted the model more than the data, we re-examined the original image plate data. It then turned out that the resolution limit of the data had been grossly over-estimated; more careful reprocessing yielded a dataset with a nominal resolution of only 2.9 Å, with relatively weak and incomplete data in the highest shells (effective resolution ~3.2 Å). However, using the two-fold NCS and by careful refinement the model readily refined against the reprocessed dataset. When we submitted the paper, one of the referees wondered if a structure with an R-factor of ~0.25 constituted a refined model (referring to the "25 % R-factor threshold" suggested by Brändén and Jones [2]). The same referee also suggested that the two molecules in the asymmetric unit might be different. Our initial response was that the R-factor was a result of proper refinement, rather than over-refinement, and that releasing the NCS would definitely yield different molecules, but not a more accurate model. In order to test this, we subjected our final model to two subsequent high-temperature SA cycles. In the spirit of the referee's comments, we did not employ the NCS, refined individual temperature factors, used a more restricted low-resolution cut-off and used full weight for the crystallographic pseudo-energy term. The "progress" of the refinement is shown in Figure 3, and the results are listed in Table II and are in complete agreement with our expectations. The conventional R-factor could easily be brought down to a more "traditionally observed" level, but Rfree did not decrease at all. This indicates that our conservative model adequately explains the data and that all additional model parameters introduce nothing but artefacts. Of course in reality the two molecules will be slightly different, but at 2.9 Å resolution a structure which assumes them to be identical is superior to one which tries to model the differences. The "observed" differences between the two molecules at this resolution are a direct result of noise-fitting (i.e., "modelling" the errors). This example also stresses the fact that the "25 % R-factor threshold" [2] should be rephrased in terms of Rfree. However, this will have to wait until more (correct and wrong) structures are available which have been refined with Rfree (we estimate that the Rfree threshold will be ~0.35). Analysis of the Protein Data Bank [35] (PDB) in May 1995 showed that only 62 (out of almost 3,000) X-ray structures included a free R-factor. The conventional and free R-factors of these structures are shown as a function of resolution in Figure 4. In the resolution range between 1.5 and ~3 Å, the conventional R-factors are more or less identical for all structures (~0.20), whereas the free R-factors (and, hence, the difference between the two) increase almost linearly with resolution.


Figure 4a

Figure 4b

Figure 4. Plot of the distribution of conventional (a) and free R-factors (b) for the 62 X-ray structures in the May 1995 version of the PDB [F.C. Bernstein, T.F. Koetzle, G.J.B. Williams, E.F. Meyer, M.D. Brice, J.R. Rodgers, O. Kennard, T. Shimanouchi, and M. Tasumi, J. Mol. Biol. 112, 535 (1977)] for which both values were reported.


Rfree can also be of help in deciding on a proper weight to use for the crystallographic pseudo-energy term [13]. The value calculated by X-PLOR in a so-called CHECK run tends to be too high (i.e., it weighs the X-ray term too heavily, leading to over-fitting and poorer geometry). Again, there is no ideal value, but running identical SA jobs with different scales for this weight is a good method of finding a proper weight. We tend to run three slow cool SA calculations from 4,000 K for every new protein structure we start to refine, one with full weight, one with half weight and one with 1/3 weight. The scale factor which yields the lowest Rfree is then used in subsequent refinement rounds.

We have noticed that some people don't like using Rfree or even claim that "it doesn't work with 3 Å data". This is understandable since people are used to getting cosmetically pleasing (but not necessarily meaningful) low R-factors, but it is also nonsense. It is exactly at low resolution that Rfree is most useful: at low resolution one has relatively little data (often also weak and incomplete) so that one is very close to a data-to-parameter ratio of one (and often below one). In these cases, the danger of over-fitting is obviously at its greatest, and this may even lead to masking of gross errors in the model [7]. We have also noticed that some people include Rfree calculations, but fail to "listen" to what Rfree is telling them. These are the cases where R and Rfree values are reported which differ by 0.1-0.15 (this has even lead to the claim that "Rfree is not a sensitive indicator of model quality"). Since the reflections used for the calculation of Rfree are not used in the refinement, one always expects to get lower conventional than free R-factors. However, the hallmark of good diffraction data and careful refinement is a small difference between the two. If the data is of high quality, and if the structure adequately models the data, then the structure factors used in refinement must be highly correlated to those not used, i.e. R and Rfree must have similar values. In our experience, with very good datasets we are able to obtain differences between R and Rfree of ~0.02; if the data is of poorer quality or if the resolution is very low, the difference may be as high as 0.05-0.08. In general, large differences indicate over-fitting or poor quality of the data. Naturally, these two phenomena are somewhat correlated: the poorer the data, the larger the errors in the observed structure-factor amplitudes, and the more room there is for the refinement program to fit these errors.

Another way of putting this: the conventional R-factor is extremely sensitive to changes in the precision of the model (the level of detail in which the model is described, i.e. the number of parameters in the model). Rfree, on the other hand, is a good measure for the accuracy of the model, i.e. the extent to which the model adequately explains the experimental observations. Over-fitting, then, is extending the precision of the model without improving its accuracy ("modelling the noise"). The "best" model (most adequate explanation of the data) is the one with the lowest possible value of Rfree (highest attainable accuracy; smallest phase errors), and the smallest possible difference between R and Rfree (the level of precision is warranted by the accuracy).

From the previous discussion it will be clear that we do not recommend so-called a posteriori evaluation of Rfree (i.e., performing one slow cool SA calculation with the final model, just to obtain a value for Rfree). Although a 4,000 K SA run will yield a value of Rfree close to that obtained if one had used Rfree throughout the refinement (see below), the resulting value of Rfree tends to be used only to complete a table in a publication, and not to evaluate the refinement strategy, or to assess the accuracy and the degree of over-fitting. The free R-factor should be used every step of the way as the single most important indicator available today of model accuracy and model improvement.

There are still several "undecided" issues with respect to the free R-factor (see also the discussions in Brünger [11] and Dodson et al. [36]). It is clear that in the case of NCS most of the reflections in the test set will be related to some of the reflections in the work set (unless special care is taken in the selection of the test set, e.g. by selecting them in thin resolution shells [1], or as small spheres or cones related by the G-function [P. Metcalf, personal communication, 1995; R.Read, personal communication, 1995]. This means that, in the worst case, serious errors and over-fitting could go undetected. In the best case, NCS may cosmetically reduce the difference between R and Rfree (on the other hand, the effect is probably small, and it also occurs when no NCS is present, due to the fact that bulk solvent introduces relations between neighbouring reflections). For example, the structure of the MS2-RNA complex [37], with ten-fold NCS, has an R-factor of 0.192 and an Rfree value of 0.209. The fact that the difference between R and Rfree is influenced by the degree of over-fitting, the presence and extent of NCS, and the quality of the data makes it difficult to give precise guidelines for acceptable magnitudes of this difference, but if the difference is large one should be cautious. In order to investigate if the relations between the reflections could be strong enough to make gross tracing errors indetectable, we have carried out some experiments with A2U since this structure has fourfold NCS. Again, we intentionally traced the structure backwards and refined it against 6-3 Å data using different NCS models and different ways to select the test reflections; the results are shown in Table III. Although the free R-factor did not go up to the level it attained in the case of the backward-traced CRABP type II structure (0.62), it was never lower than 0.46 (with constrained NCS; the maximum value was 0.55 with unrestrained NCS). Moreover, in agreement with the discussion about the difference between R and Rfree above, if the NCS was constrained or restrained, even the conventional R-factor could not be reduced to a satisfactory level (~0.35), whereas if the NCS was not restrained the R-factor easily dropped to a near-respectable level of ~0.27. Probably, in the latter case, inclusion of water molecules and further refinement (as was done in the case of the backward-traced CRABP type II structure) could have brought the conventional R-factor below the 25% threshold.

Spacegroup errors are unlikely to be detected by Rfree, since most of the test set reflections will have a symmetry-related reflection in the work set. Indeed, in the case of the original chloromuconate cycloisomerase structure (which was solved in spacegroup I4, but actually belongs to spacegroup I422 [7,38]), a posteriori Rfree calculations with different starting temperatures yield free R-factor values of 0.32-0.34, virtually independent of the starting temperature of the SA calculation.

On the other hand, the free R-factor does seem to be able to pick up largely or completely wrong structures [1]. For example, when the backwards-traced CRABP type II structure is refined with all data, and then subjected to a posteriori Rfree calculations, even a starting temperature of 500 K yields an Rfree value of 0.45. On the other hand, one needs to start the SA calculation at 4,000 K in order to approach the "real" Rfree value of 0.62. Other important unresolved issues with respect to the free R-factor include:

As an alternative to the use of Rfree, one could contemplate to use the statistical significance tests on the weighted R-factor, as described by Hamilton [39]. His idea was to test if a decrease of the R-factor is significant (often confused with "large") given the number of observations and the increase in the number of adjustable parameters in the model. However, this test has been used only once in protein crystallography as far as we know [V. Lamzin, personal communication, 1995]. One obvious difficulty lies in the fact that structures are sometimes refined with more parameters than reflections; another is the question as to how one should weight restraints relative to the diffraction data.

One has to realise that the free R-factor is a global statistic, i.e. it relates to the mean absolute phase error [13]. This means that local errors in a model may well go undetected if only Rfree is used to assess the correctness of a model. In particular, serious out-of-register errors will be hard to detect since, usually, only a fairly small number of scatterers is involved. In the process of refining the structure of EGI [34], we detected and corrected such an error, involving a stretch of 20 residues which were out-of-register with the density by 7 residues (an insertion had not been correctly accounted for). At that stage, Rfree was already at 0.302 (R 0.245), i.e. well below our usual skepticism threshold of 0.35. In general, out-of-register errors can only be detected by a combination of experience, (SA-omit) maps, common sense (regarding the environments of residues), temperature factors, databases (to detect unusual main-chain arrangements) and alignment with homologous structures (if any are available). One should always be on the look-out for such errors. When doubting the correctness of a part of the trace, suspect residues can be temporarily omitted from the model, or cut back to alanines.

SA. Simulated Annealing refinement [40-43] is a powerful method which we use in almost every refinement cycle. The major benefit of SA refinement lies in its large radius of convergence. If the model and the data are good, an SA calculation rarely does any harm to the structure. On the contrary, we find that if an SA calculation does lead to a poorer model (in terms of Rfree or map quality), this is a strong indication that something is wrong with the input model, or that there is a problem with the data.


Figure 5

Figure 5. Averaged difference density for Candida antarctica lipase B in complex with Tween-80 after refinement of the structure without SA [J. Uppenberg, N. Öhrner, M. Norin, K. Hult, G.J. Kleywegt, S. Patkar, V. Waagen, T. Anthonsen, and T.A. Jones, Biochemistry, in the press]. Note that the density for the Tween-80 molecule is virtually uninterpretable. Refer to the text for details. (Figure kindly provided by Dr. J. Uppenberg.)


A good example is the structure of a complex of Candida antarctica lipase B with a non-ionic detergent called Tween-80 [44] (a long, floppy and chemically ill-characterised compound). The protein structure had been solved at 1.55 Å resolution [45], but for the Tween-80 complex only 2.5 Å data was available. The protein structure in the complex was solved by Molecular Replacement and refined using strict two-fold NCS. SA had not been used for fear of ruining a well-refined structure by exposing it to low-resolution data. This had yielded a final model with a very low value for Rfree of 0.209 (R 0.187). Unfortunately, there was relatively poor density in the presumed location of the Tween-80 molecule, even after electron-density averaging, which made it impossible to build a model for the Tween-80 molecule (see Figure 5). We then carried out a slow cool from 4,000 K, using weak harmonic restraints on all atomic positions (with a force constant of 20 kCal mol-1 Å-2 for CA atoms, and 10 kCal mol-1 Å-2 for all other atoms). The rationale for this was to force the protein back into a structure very similar to the highly-refined model, while leaving some room for the structure to relax (i.e., to adjust its conformation to the new data), in the hope of getting better density for the Tween-80 molecule (which was, after all, the interesting part of the structure). The slow cool was very successful and converged back to a model very similar to the starting structure, but with a slightly lower value of Rfree and R. Most important, however, was the appearance of well-connected density in the difference map into which a crude partial model of the Tween-80 could easily be built. This model was then subjected to another slow cool from 4,000 K, again using weak harmonic restraints for all atoms, except those of the Tween-80 molecule: a force constant of 10 kCal mol-1 Å-2 was used for all main-chain atoms, 5 kCal mol-1 Å-2 for other atoms, but only 1 kCal mol-1 Å-2 for all residues within a 5 Å radius of the Tween-80 molecule. Again, the refinement was very successful (see Figure 6), yielding an Rfree of 0.188 (R 0.165). Density for the Tween-80 molecule in the 2Fo-Fc map after the slow cool is shown in Figure 7. Note that this case is a splendid illustration of the fact that even low-resolution structures with conservative assumptions (strict NCS, grouped temperature factors) can yield low (free) R-factors.


Figure 6

Figure 6. Example of satisfactory behaviour of R and Rfree during a refinement cycle. The example is taken from the refinement of Candida antarctica lipase B in complex with Tween-80 [J. Uppenberg, N. Öhrner, M. Norin, K. Hult, G.J. Kleywegt, S. Patkar, V. Waagen, T. Anthonsen, and T.A. Jones, Biochemistry, in the press]. The refinement included energy minimisation, a slow cool from 4,000 K, more energy minimisation and grouped temperature factor refinement. The solid line shows the behaviour of Rfree; the dotted line that of the conventional R-factor. The RMS difference between both R-factors was only 0.017 and their correlation coefficient 0.980. Refer to the text for details.


Figure 7

Figure 7. Density for a Tween-80 model after refinement with SA [J. Uppenberg, N. Öhrner, M. Norin, K. Hult, G.J. Kleywegt, S. Patkar, V. Waagen, T. Anthonsen, and T.A. Jones, Biochemistry, in the press]. One refinement cycle yielded interpretable difference density for a part of the Tween-80 molecule. A crude model was built and refined in a second SA-refinement cycle (see Figure 6). The density for the Tween-80 after this second cycle is shown (compare to Figure 5). Refer to the text for details.


If sufficient computer resources are available, we recommend experimenting with parallel and serial slow cools. We often use parallel slow cools to try different refinement protocols (e.g., different weights for the crystallographic term, different temperature factor models, various uses of NCS), and select the one that yields the lowest value of Rfree (and, in case of a tie, the one with the smallest difference between R and Rfree). Serial slow cools often give better results than a single slow cool (an extra drop of 0.01-0.015 in Rfree) [42], provided that temperature factors were refined after the first round. As for which data to include, in our experience it is best to use data to the highest resolution limit from the start, rather than, for example, to start refining to 3.2 Å resolution first and gradually increasing the resolution limit in subsequent cycles.

In every refinement round (slow cool from 4,000 K, energy minimisation, temperature factor refinement) we plot the behaviour of R and Rfree as a function of "progress of refinement". The two curves should be very similar in overall shape, e.g. a sharp drop in R during temperature factor refinement should be accompanied by a similar drop in Rfree. Figure 6 shows an example of such behaviour, whereas Figure 2 demonstrates the "progress" of a dramatically unsuccessful refinement.

From release 4.0 onward, X-PLOR will contain a facility to use torsion MD [40,46], and we expect significant benefits from this. Carrying out SA refinement in torsion-angle space rather than in Cartesian coordinate space reduces the number of degrees of freedom [46] from 3N (where N is the number of atoms) to something of the order of N/3. Initial results indicate that this, plus the fact that SA refinement can now be carried out at considerably higher temperatures (up to 10,000 K), increases the radius of convergence of SA from ~1.2 Å to ~1.7 Å [46]. In addition, one obtains near-ideal bond lengths and bond angles "for free".

Of course there are still limitations to what SA refinement can do. In Cartesian coordinate space, the RMS radius of convergence is of the order of ~1.2 Å [42], although local changes of up to ~8 Å are possible [42,47]. Torsion MD protocols may extend the radius of convergence to ~1.7 Å [46]. SA refinement will not make changes that require the breaking of covalent bonds (which is often done temporarily during manual rebuilding). Nor can SA refinement optimise the fit of individual residues. For example, the side-chain oxygen and nitrogen atoms of Asn and Gln residues, and the rings of His residues, often end up flipped by 180 degrees (a careful human would take the whole hydrogen-bonding network into account to decide on the proper orientation). Also, SA refinement does not always produce rotamer-like side-chain conformations (although these could be enforced by applying strong restraints on dihedral angles). Finally, gross errors (in tracing, connectivity and sequence) cannot be fixed by a refinement program. For these reasons, manual rebuilding of protein structures is still necessary at present.

Force fields and dictionaries. Every refinement program nowadays uses geometric and other restraints to augment the X-ray data. As for geometry, the best set of bond and angle parameters available today is that developed by Engh and Huber [48] based on an analysis of small-molecule crystal structures from the Cambridge Database [49] (CSD), which, through the efforts of John Priestle, are now available for the most widely used refinement and rebuilding programs [50]. When used in combination with a reduced weight for the crystallographic pseudo-energy term, protein models with good stereo-chemistry are virtually guaranteed (but note that good stereo-chemistry says absolutely nothing about how well the structure models the data; see the discussion of Table I above). As more protein structures are solved at atomic resolution, better dictionaries can be expected in the next few years [51].

However, problems remain in deriving "ideal" parameters for non-protein entities. Writing an X-PLOR topology and parameter file for a ligand, for example, is a cumbersome, time-consuming and error-prone process. What is often not realised is that, in fact, one has to specify exactly what one would like the ligand to look like (with the exception of torsions around freely rotatable bonds) in the absence of any crystallographic information. This means that every sp1 and sp2 carbon gives rise to a "flatness" restraint, that the chirality of every chiral carbon must be restrained, etc. If one is lucky, the structure of a ligand has been solved separately and can be retrieved from a database such as the CSD. In that case, we suggest to derive all ideal values from this structure and to use heavy weights for the restraints. In other cases, one may be able to find the structure of a common co-factor or ligand in another structure in the PDB (we have created a collection of several hundred such small molecules). If no structure is available, one will have to use "rule-of-thumb" values, or resort to quantum-chemical or molecular mechanics calculations. In the case of low-resolution data, it may be best to almost constrain the ligand (by using very heavy weights for bond lengths and angles) so that it effectively has only a few degrees of freedom left (freely rotatable carbon-carbon bonds, for instance). Again we must emphasise the problem with low-resolution data. If the ligand dictionary allows ring puckering, for instance, the refinement program is invited to take liberties (in this fashion even aromatic rings can easily be "refined" into a non-planar conformation).

To remove much of the tedium and human error from the dictionary-generation process, we have written a small program [G.J.K., unpublished results, 1994] which, given the structure of a small molecule in PDB format, automatically generates:

Prior to including the ligand in crystallographic refinement, one should always use energy minimisation without the X-ray term of only the ligand. The result shows the structure that the refinement program will attempt to produce when the complete model is refined with inclusion of the experimental data.

Other issues.

* Data. One popular way to get lower R-factors is to manipulate the low-resolution cut-off of the data. If one does not include an explicit bulk-solvent model, a cut-off of ~8 Å seems reasonable. As for the sigma cut-off, we tend to use all observed reflections nowadays. As for the Rfree partitioning, a fraction of 5 - 10 % of the data (with a maximum of ~2,000 reflections) is usually sufficient to make meaningful use of Rfree.

* Temperature factors. As is the case with geometry, temperature factors should be refined using constraints and restraints, guided by Rfree. This may seem self-evident, but there are dozens of structures [27] in the PDB which have RMS Delta-B values for bonded or NCS-related atoms that exceed 10 Å2. After a substantial rebuild, we tend to reset temperature factors, either by resetting all values to a single, average value, or by limiting the extremes (e.g., by resetting all B values lower than 10 &Aring2 to 10 Å2, and those exceeding 50 Å2 to 50 Å2). In the case of strict NCS, we usually end up refining grouped temperature factors at any resolution. The reason is probably that grouped temperature factors are not subject to restraints on bonded atoms, which means that high temperature factors can be assigned to just one or two residues in a loop if these residues do not obey exact NCS. In the case of restrained individual temperature factors, such high values can only be assigned by the refinement program if they are propagated through a stretch of residues which could give a false impression of the extent of the area in which the NCS breaks down. Very high temperature factors are usually caused by either a significant deviation from the assumption of strict NCS, or by non-existent density. This often pertains to solvent entities, but occasionally even the protein may suffer from this. For example, while refining the structure of human alpha class glutathione S-transferase [52] we observed strange behaviour for residue 103. The data extended to 2.6 Å so the structure (a pair of dimers) was refined with strict four-fold NCS and grouped temperature factors, and electron-density averaging was used prior to rebuilding. Residue 103, situated in an alpha-helix, had been built as an aspartic acid. It was situated in a fairly internal position without a salt link. Even with averaging, the density never showed any branching from the main chain. After GTFR, the main-chain atoms obtained a B value of 2 Å2 (the lowest value allowed during refinement), whereas the side chain was assigned a B value of 65 Å2. We therefore re-checked the sequence and found that this residue was actually a glycine. This reinforced our belief in the quality of the model and the accuracy of the temperature factors (not necessarily their precision). A good way to learn how to detect local errors, in particular for inexperienced crystallographers, might actually be to include a number of them deliberately. For instance, one could introduce an out-of-register error, or change a number of residues (e.g., alanine to leucine, which would lead to "include maps" as opposed to "omit maps"), and check if they are clearly identifiable during the refinement and rebuilding process. By monitoring the behaviour of the model and the maps in such areas, one would also learn to recognise undeliberate errors elsewhere in the model. (The cynic will note that this practice is already widespread when it comes to water molecules, but usually not for the present purpose.)

* Extended models. An average model of a protein structure, including ligands, co-factors, tightly bound waters, etc., does not give a complete description of the observed data. The data are time-averaged and space-averaged intensities, plus error terms and deviations due to crystal defects, disorder, mobility, absorption, decay, etc. Some of these effects can be modelled. Best known is the use of alternative conformations and refinement of occupancies. These should, however, only be used at high resolution, and with Rfree as a control to check if the model actually improves [53]. Bulk solvent can be modelled in myriad ways; however, it has been shown that a simple flat solvent model (with only two adjustable parameters: a bulk density and a bulk temperature factor) gives the best results in terms of Rfree [54]. Using a bulk-solvent model, one can include data to very low resolution (~30 Å), which may improve the density at the surface of the molecule [55].

For a short period of time, time-averaged MD looked to become popular as a way of modelling dynamic and space-averaged effects [56]. However, it has been shown recently [57] that ensemble-averaged MD with only a few separate copies of a structure gives results superior to time-averaged MD with hundreds or even thousands of structures. As with non-crystallographic symmetry, however, at low resolution the "null-hypothesis" has to be that all molecules are equal. Only when one has sufficiently high data-to-parameter ratios (i.e., high-resolution data) can one even begin to contemplate such procedures.

What is desperately needed for a better understanding of the remaining "R-factor gap" between small molecule and large molecule structure determinations are studies at very high resolution (better than ~1.2 Å) [58] on medium-size and large proteins and inter-laboratory validation experiments (commonly applied to techniques in analytical chemistry). This will teach us something as to how important anisotropic motion is, how different NCS-related protein molecules really are, how frequent conformational heterogeneity is, etc., and how much of this variation can be explained by variations in refinement and rebuilding practices and programs.


QUALITY CONTROL

Quality control is now an integral part of our model-building and refinement process, i.e. it is not something that is done only once, a posteriori, for the sole purpose of filling in some tables in the publication. Quality control entails the use of our knowledge with respect to the structure of macromolecules to find places in the model which need special attention and are possibly in error. Zou and Mowbray [59] have described the benefits that can be attained by the use of empirical knowledge (as embodied in databases) during protein rebuilding and refinement.

In judging the quality of a model produced by the refinement program we use the following criteria:

Naturally, assessing all these criteria for each and every residue during each and every rebuilding session is cumbersome. Therefore we use a program called OOPS [64] to aid in this process. The program makes use of O's ability to generate and use so-called residue and atom properties (these are explained in more detail in the accompanying chapter [3]). The idea behind OOPS is to calculate the values of some of the quality indicators in O before starting the rebuild, and to generate others on the fly from the coordinate file of the present model. OOPS gathers and integrates all this information on a per-residue basis. Criteria that can be checked include pep-flip values, RSC values, real space (RS)-fit values, suspiciously low or high temperature factors and occupancies, phi,psi values, peptide planarity and CA chirality; there is also a provision to check up to ten user-defined criteria. In addition, the present model can be compared to a previous model and all residues which have changed considerably during refinement (in terms of RMS distance, RMS Delta-B, RMS occupancy change, RMS Delta-phi, Delta-psi and/or RMS Delta-chi1, Delta-chi2) are flagged as being worthy of closer scrutiny during the subsequent rebuilding session. If a residue scores poorly for any of the criteria that the user wishes to check, or differs considerably from the previous model, a small O macro is generated which:

This means that the user is taken from one bad or suspect residue to the next and is told what may be wrong with each of them. Especially in the later stages of refinement, when the protein model does not change very much any more, this can save enormous amounts of time. Instead of having to look at each and every residue in turn, draw the maps, etc., only to find that in 9 out of 10 cases the residue is okay, the user can now focus all attention on the five or ten percent of the residues which may actually need to be adjusted or rebuilt.

In addition to this, OOPS produces plots of various properties as a function of residue number, statistics for all properties, and a residue-by-residue "critique" of the structure. The plots can be used to reveal areas where the structure is particularly bad, the statistics are useful to judge the overall quality and to decide which cut-off values to use, and the residue-by-residue listing is saved in an electronic notebook file which can be edited during the rebuilding session.

It should be pointed out that there are two types of residue which give rise to violations: those that are wrong (errors), and those that actually do have an unusual conformation ("outliers-for-a-reason"). The latter type is often found in the interesting places in a structure, for example in a ligand- or substrate-binding site [65]. They can be recognised as such by very convincing density, and a tendency to return to the same conformation, even after rebuilding and SA refinement. Residues that are in error, on the other hand, almost always have problematic (poorly fitting or absent) density and will either maintain their rebuilt conformation after SA refinement (indicating that the error has been fixed), or they will end up in yet another conformation (usually, this happens for surface and loop residues which are poorly defined by the data). As a rule-of-thumb, for a well-refined, high-resolution model, one would expect <2 % "outliers-for-a-reason" in the Ramachandran plot, ~1-2 % residues with unusual peptide orientations, and ~5-10 % residues with non-rotamer side-chain conformations [8,27].

In the early stages, while the model is still crude, we prefer to use very strict criteria and to check all residues (for example, often a side chain which has a reasonable RSC value can nevertheless be replaced by a rotamer that fits the density equally well or even better). In later rounds, the criteria can be relaxed and only suspicious-looking residues need to be checked.


MAPS

For rebuilding we tend to use 3Fo-2Fc, 2Fo-Fc and Fo-Fc maps. Fo-Fc difference maps should be contoured both at positive and negative levels. In trouble areas, SA-omit maps can be calculated [66]. Standard omit maps are not a good idea, since it may be impossible to tell whether the re-appearance of density is real or due to model-bias [66]. After an SA-omit run, we calculate both 2Fo-Fc and Fo-Fc omit maps. The density in the omitted area should be very similar for both maps. During difficult refinements, we occasionally use systematic SA-omit maps. In this procedure, all residues are omitted in turn (using stretches of 5-10 residues at a time) and they are rebuilt in the resulting SA-omit map. Naturally, if an experimentally phased map is available, it should be consulted during rebuilding as well.

In the case of NCS, we invariably use our real-space electron-density averaging programs [67,68]. Averaging is a well-established and very powerful method for map improvement. Not only does it often improve the density in areas where the unaveraged map has no visible density at all, it also helps in the identification of regions where the NCS breaks down. In our experience, deviations from NCS are often much smaller than one would expect on the basis of published structures which were refined without making use of the NCS [27]. For example, in (NCS) disordered loops, there are often only one or two residues for which even after averaging no density is visible.

One sometimes has multiple non-isomorphous crystal forms which means that cross-crystal averaging can be used to obtain better maps. For instance, in the refinement of the complex between acetylcholinesterase and fasciculin II [69] we averaged maps of two crystal forms (one at 3.0 Å and the other at 3.2 Å resolution) in order to confirm the correctness of the trace and the side-chain orientations in a crucial region of the protein-protein interface for which there was poor density in the 3.0 Å map.


Figure 8

Figure 8. Structure of TTNPB (a) and "Compound 19" (b) [G.J. Kleywegt, T. Bergfors, H. Senn, P. Le Motte, B. Gsell, K. Shudo, and T.A. Jones, Structure 2, 1241 (1994)].


If one carries out refinement carefully, one may also confidently "listen" to the density when it insists that something is wrong, and safely assume this really to be the case, rather than an artifact introduced by over-fitting. For example, recently we solved the structure of CRABP type II in complex with all-trans-retinoic acid at 1.8 Å [28]. This structure was used to solve a complex of the same protein with another ligand at 2.2 Å resolution. Due to a communication problem with our collaborators who supplied the ligand, we thought that the ligand we had used was TTNPB (see Figure 8a), and initially built this into the density, even though the fit was not perfect. After more, seemingly successful SA refinement, the density still failed to cover the whole ligand, and three strong peaks stubbornly showed up in the difference maps: a positive one at a distance of ~1.5 Å from C6 (suggesting a methyl-like substituent), and two negative ones in the positions of atoms C22 and C23 (suggesting that both methyl groups were not really there). An SA-omit map (leaving out the ligand) was calculated, and on the basis of that density we were convinced that the ligand was not TTNPB. After talking to our collaborators, we found out that the actual ligand was a molecule called "Compound 19" (see Figure 8b). With only the covalent structure of this compound, we could easily build a model that fitted the density (see Figure 9) and complete the refinement of the structure. It is important to realise that in the case of complex structures, where one basically uses protein-crystallographic methods to determine a small-molecule structure, the density is the only guide one has to confirm the presence of the assumed compound or detect that something is amiss (see also the example of the CBHI complex discussed earlier). No quality indicator exists that specifically tracks down errors at this level (unless the error affects nearby protein residues, e.g. by forcing them into very unusual conformations); since the number of scatterers is usually small, even Rfree may behave seemingly normally when something is wrong. High temperature factors sometimes indicate that something is wrong, but they may also occur for a variety of other reasons (e.g., mobility, disorder, or low occupancy). In order to be able to rely on the density, good data, careful refinement, and a healthy dose of skepticism are a conditio sine qua non.


Figure 9

Figure 9. SA-omit Fo-Fc map with the manually built and energy-minimised model of Compound 19 overlaid [G.J. Kleywegt, T. Bergfors, H. Senn, P. Le Motte, B. Gsell, K. Shudo, and T.A. Jones, Structure 2, 1241 (1994)].


REBUILDING

Manual rebuilding of a structure is somewhat of a "black art", best learned by practising it on a lot of different structures. Nevertheless, there are a number of simple questions that one has to ask oneself all the time, for every residue in turn. The answers to these questions will determine what action has to be taken:


Figure 10a

Figure 10b

Figure 10. Example of a case where a non-rotamer side-chain conformation was built in a low-resolution (3.0 Å) map, which can easily be replaced by a rotamer conformation which fits the density equally well, if not better. (a) A non-rotamer conformation for a leucine residue. The RSC-fit value, as calculated by O, is 2.09 Å, with chi1=-70 degrees and chi2=-25 degrees. (b) A better-fitting rotamer for the same residue. The RSC-fit value is 0.63, with chi1=-38 degrees and chi2=175 degrees (close to the values for the most common leucine rotamer, namely chi1=-60 degrees and chi2=180 degrees).


In the following, we shall discuss violations of specific criteria, their possible causes, and possible remedies (using the tools in O). Again, all rebuilding should be followed by regularisation to restore proper stereo-chemistry (with the Refi_zone command [70]).


FINAL MODEL

When refinement and rebuilding of the structure has converged, a final refinement round can be carried out using all diffraction data and employing only energy minimisation and temperature-factor refinement. After this, a final assessment of the quality of the structure has to be carried out. Factors to be taken into account are similar to those that should be checked in every macro-cycle. In addition, one may estimate the average coordinate error, for example from a Luzzati plot [29] using Rfree rather than the conventional R-factor [28]. If NCS is present and has not been constrained, differences between NCS-related molecules should be analysed skeptically [27]. A particularly sensitive way of analysing differences between NCS-related molecules is by comparing the main-chain and side-chain torsion angles of corresponding residues [1,8,27,62]. In principle, one could assess the adequacy of a model by plotting the distribution of (|Fo|-|Fc|)/sigma, which should have a mean of zero and a standard deviation of one. Unfortunately, the fact that our models are incomplete (and, consequently, the fact that the true scale factors for the structure-factor amplitudes are unknown), precludes such an analysis.

One potentially useful quality check that hasn't been explored in depth is that of a "real-space free R-factor". Since the number of reflections used for Rfree calculations is usually small, calculating maps with only these reflections is not very useful. However, one can omit each and every residue, water molecule, etc., in turn (in batches of 5-10 residues), calculate an SA-omit map, and assess how well the residue's density is predicted by the rest of the structure by evaluating the real-space fit of the calculated and the omit map. Assuming that most of the model bias has been removed by the SA calculation, this would even enable one to quantify the extent of model bias on a per-residue basis by comparing the real-space fit using an ordinary 2Fo-Fc map and that using the SA-omit 2Fo-Fc map. In the case of CRABP type II [28], the backward-traced structure at 3.0 Å has an average RS-fit correlation coefficient of 0.65 and an average RS R-factor of 0.36. Using the protocol outlined above (omitting 5 residues at a time), the average "free" RS-fit correlation coefficient goes down to 0.47, whereas the average "free" RS R-factor increases to 0.45.

How does one assess the quality of published structures ? If no coordinates (and structure factors) are available, one is dependent on what is written in the publication [8]. The first thing to check is the quality of the data: what is the multiplicity, Rmerge, completeness and I/sigma(I) ratio, both overall and in the highest resolution shells ? If complete tables of these quantities are included, one may roughly assess the effective resolution of the data (as opposed to the Bragg spacing of the single highest-resolution reflection). The second most important thing to do is to assess the quality and strategy (if any) of the refinement. Even in the absence of details one can usually guess how the refinement was carried out. If Rfree is not mentioned, it was probably not used; if it is mentioned, check if it was used throughout, and if the difference between R and Rfree is small. If no mention is made of NCS constraints or restraints, one may safely assume that the NCS was not employed during refinement. Temperature factors have been refined for individual atoms, unless specifically stated otherwise. If the Ramachandran plot is not shown or mentioned, it may have been very poor (or never produced).

If coordinates are available, many of the quality criteria can be easily checked [8]. However, for a full evaluation of the structure, observed structure-factor amplitudes are badly needed. Only with those can one check if the structure is an adequate model for the data or not, if necessary by re-doing the refinement. We therefore strongly recommend that structure factors be deposited together with coordinates.

Finally, when "validating" a model it is important to realise that any property which has been constrained or heavily restrained during refinement, and any property which has been closely monitored during rebuilding cannot be used as an independent criterion to assess (or "proove") the quality of the model. For instance, most refinement programs operate by minimising the difference between observed and calculated structure-factor amplitudes; therefore, the value of the conventional R-factor is hardly an independent quality criterion. Similarly, most refinement programs tightly restrain bond lengths, bond angles and certain (improper) torsion angles; therefore, low RMS deviations from ideal geometry cannot be waved around as proof of the quality of the structure. Also, if side-chain conformations are monitored, and rotamers are used in the rebuilding, a low fraction of residues with non-rotamer conformations is not necessarily a hallmark of a correctly traced structure. With the widespread use of the program ProCheck and its pretty output, a standard phrase has begun to creep up in papers describing protein structures: "the model has a quality better than expected for structures at this resolution". Again, this is a rather meaningless statement if the structure was refined using the Engh and Huber parameter set, since (with the exception of the Ramachandran plot) almost all criteria which ProCheck assesses have been restrained during the refinement. In fact, apart from the Ramachandran plot, both the backward-traced structure of CRABP type II [1] and that of A2U are of "better than average" quality according to ProCheck. For this reason, it is important to have one or two independent quality checks which are not applied in the refinement and rebuilding process, but only to assess the final model.


ACKNOWLEDGMENTS

This work was supported by the the Swedish Natural Science Research Council and Uppsala University. The many fruitful (and sometimes heated) discussions with other crystallographers from Uppsala, Dr. Eleanor Dodson (York) and the other members of the ESF-funded Biotech group, the participants in the York meeting on statistical validators in protein crystallography [36], and in particular Dr. Axel Brünger (Yale, New Haven) are gratefully acknowledged. We are also grateful to Dr. Randy Read (Edmonton) for his suggestion to investigate the effect of relationships between reflections in the case of NCS by refining a backward-traced structure with NCS. We would further like to thank Dr. Christina Divne (Uppsala) for allowing us to report her experiences in the refinement of cellobiohydrolase I complexes, and Dr. Jonas Uppenberg (Uppsala/Montpellier) for providing us with his lipase B/Tween-80 data.


REFERENCES

1. G.J. Kleywegt and T.A. Jones, Structure 3, 535.

2. C.I. Brändén and T.A. Jones, Nature 343, 687 (1990).

3. T.A. Jones and M. Kjeldgaard, this volume.

4. R. Lüthy, J.U. Bowie, and D. Eisenberg, Nature 356, 83 (1992).

5. G. Vriend and C. Sander, J. Appl. Cryst. 26, 47 (1993).

6. E.E. Lattman, Proteins Struct. Funct. Genet. 18, 103 (1994).

7. G.J. Kleywegt, H. Hoier and T.A. Jones, Acta Cryst. D, in the press.

8. G.J. Kleywegt and T.A. Jones, in "Making the Most of Your Model" (W.N. Hunter, J.M. Thornton, and S. Bailey, Eds.), p. 11, SERC Daresbury Laboratory, Daresbury, UK, 1995.

9. Y. Liu, D. Zhao, R. Altman, and O. Jardetzky, J. Biomol. NMR 2, 373 (1992).

10. D. Zhao and O. Jardetzky, J. Mol. Biol. 239, 601 (1994).

11. A.T. Brünger, this volume.

12. A.T. Brünger, Nature 355, 472 (1992).

13. A.T. Brünger, Acta Cryst. D49, 24 (1993).

14. A.T. Brünger, G.M. Clore, A.M. Gronenborn, R. Saffrich, and M. Nilges, Science 261, 328 (1993).

15. A.T. Brünger, "X-PLOR: a system for crystallography and NMR", Yale University, New Haven , CT (1990).

16. T.A. Jones, J.Y. Zou, S.W. Cowan, and M. Kjeldgaard, Acta Cryst. A47, 110 (1991).

17. T.A. Jones and M. Kjeldgaard, in "From First Map to Final Model " (S. Bailey, R. Hubbard, and D.A. Waller, Eds.), p. 1, SERC Daresbury Laboratory, Daresbury, U.K., 1994.

18. T.A. Jones and S. Thirup, EMBO J. 5, 819 (1986).

19. Z. Otwinowski, DENZO and SCALEPACK, unpublished programs.

20. Collaborative Computational Project Number 4, Acta Cryst. D50, 760 (1994).

21. V. Luzzati and D. Taupin, J. Appl. Cryst. 17, 273 (1984).

22. C. Divne, J. Ståhlberg, T. Reinikainen, L. Ruohonen, G. Pettersson, J.K.C. Knowles, T.T. Teeri, and T.A. Jones, Science 265, 524 (1994).

23. C. Divne, J. Ståhlberg, and T.A. Jones, to be published.

24. M. Sato, M. Yamamoto, K. Imada, Y. Katsube, N. Tanaka, and T. Higashi, J. Appl. Cryst. 25, 348 (1992).

25. D.E. McRee, J.A. Tainer, T.E. Meyer, J. van Beeumen, M.A. Cusanovich, and E.D. Getzoff, Proc. Natl. Acad. Sci. USA 86, 6533 (1989).

26. G.E.O. Borgstahl, D.R. Williams, and E.D. Getzoff, Biochemistry 34, 6278 (1995).

27. G.J. Kleywegt, Acta Cryst. D, in the press.

28. G.J. Kleywegt, T. Bergfors, H. Senn, P. Le Motte, B. Gsell, K. Shudo, and T.A. Jones, Structure 2, 1241 (1994).

29. V. Luzzati, Acta Cryst. 5, 802 (1952).

30. G.J. Kleywegt, J. Björklund, J. Uppenberg, D. Ogg, L.D. Lehman-McKeeman, J.D. Oliver, and T.A. Jones, to be published.

31. S.W. Cowan, M.E. Newcomer, and T.A. Jones, J. Mol. Biol. 230, 1225 (1993).

32. T.A. Jones, T. Bergfors, J. Sedzik, and T. Unge, EMBO J. 7, 1597 (1988).

33. A.E. Eriksson, G.J. Kleywegt, M. Uhlén, and T.A. Jones, Structure 3, 265 (1995).

34. G.J. Kleywegt, J.Y. Zou, C. Divne, I. Sinning, J. Ståhlberg, T.T. Teeri, G. Davies, and T.A. Jones, to be published.

35. F.C. Bernstein, T.F. Koetzle, G.J.B. Williams, E.F. Meyer, M.D. Brice, J.R. Rodgers, O. Kennard, T. Shimanouchi, and M. Tasumi, J. Mol. Biol. 112, 535 (1977).

36. E.J. Dodson, G.J. Kleywegt, and K.S. Wilson, Acta Cryst. D, in the press.

37. K. Valegård, J.B. Murray, P.G. Stockley, N.J. Stonehouse, and L. Liljas, Nature 371, 623 (1994).

38. H. Hoier, M. Schlömann, A. Hammer, J.P. Glusker, H.L. Carrell, A. Goldman, J.J. Stezowski, and U. Heinemann, Acta Cryst. D50, 75 (1994).

39. W.C. Hamilton, Acta Cryst. 18, 502 (1965).

40. A.T. Brünger and L.M. Rice, this volume.

41. A.T. Brünger, J. Kuryian, and M. Karplus, Science 235, 458 (1987).

42. A.T. Brünger and A. Krukowski, Acta Cryst. A46, 585 (1990).

43. A.T. Brünger, Annu. Rev. Phys. Chem. 42, 197 (1991).

44. J. Uppenberg, N. Öhrner, M. Norin, K. Hult, G.J. Kleywegt, S. Patkar, V. Waagen, T. Anthonsen, and T.A. Jones, Biochemistry, in the press.

45. J. Uppenberg, M. Trier Hansen, S. Patkar, and T.A. Jones, Structure 2, 293 (1994).

46. L.M. Rice and A.T. Brünger, Proteins Struct. Funct. Genet. 19, 277 (1994).

47. P. Gros, M. Fujinaga, B.W. Dijkstra, K.H. Kalk, and W.G.J. Hol, Acta Cryst. B45, 488 (1989).

48. R.A. Engh and R. Huber, Acta Cryst. A47, 392 (1991).

49. F.H. Allen, O. Kennard, and R. Taylor, Acc. Chem. Res. 16, 146 (1983).

50. J.P. Priestle, Structure 2, 911 (1994).

51. V.S. Lamzin, Z. Dauter, and K.S. Wilson, J. Appl. Cryst. 28, 338 (1995).

52. I. Sinning, G.J. Kleywegt, S.W. Cowan, P. Reinemer, H.W. Dirr, R. Huber, G.L. Gilliland, R.N. Armstrong, X. Ji, P.G. Board, B. Olin, B. Mannervik, and T.A. Jones, J. Mol. Biol. 232, 192 (1993).

53. G. Sheldrick and T. Schneider, this volume.

54. J.S. Jiang and A.T. Brünger, J. Mol. Biol. 243, 100 (1994).

55. D. Tronrud, this volume.

56. J.B. Clarage and G.N. Phillips, Acta Cryst. D50, 24 (1994).

57. F.T. Burling and A.T. Brünger, Israel J. Chem. 34, 165 (1994).

58. K.S. Wilson, in "From First Map to Final Model" (S. Bailey, R. Hubbard, and D.A. Waller, Eds.), p. 141, SERC Daresbury Laboratory, Daresbury, UK, 1994.

59. J.Y. Zou and S.L. Mowbray, Acta Cryst. D50, 237 (1994).

60. R.A. Laskowski, M.W. MacArthur, D.S. Moss, and J.M. Thornton, J. Appl. Cryst. 26, 283 (1993).

61. R.A. Laskowski, M.W. MacArthur, and J.M. Thornton, in "From First Map to Final Model " (S. Bailey, R. Hubbard, and D.A. Waller, Eds.), p. 149, SERC Daresbury Laboratory, Daresbury, U.K., 1994.

62. A.P. Korn and D.R. Rose, Prot. Engin. 7, 961 (1994).

63. C. Ramakrishnan and G.N. Ramachandran, Biophys. J. 5, 909 (1965).

64. G.J. Kleywegt and T.A. Jones, Acta Cryst. D, in the press.

65. O. Herzberg and J. Moult, Proteins Struct. Funct. Genet. 11, 223 (1991).

66. A. Hodel, S.H. Kim, and A.T. Brünger, Acta Cryst. A48, 851 (1992).

67. T.A. Jones, in "Molecular Replacement" (E.J. Dodson, S. Glover, and W. Wolf, Eds.), p. 91, SERC Daresbury Laboratory, Daresbury, UK, 1992.

68. G.J. Kleywegt and T.A. Jones, in "From First Map to Final Model" (S. Bailey, R. Hubbard, and D.A. Waller, Eds.), p. 59, SERC Daresbury Laboratory, Daresbury, UK, 1994.

69. M. Harel, G.J. Kleywegt, R. Ravelli, I. Silman, and J. Sussman, to be published.

70. J. Hermans and J.E. McQueen, Acta Cryst. A30, 730 (1974).


TABLE I. Tests of various NCS models at low resolution. [a]


Run	Force cnst	Temp.	Final	Final	RMSD	RMSB	RMSA
			(K)	R value	free R	(A) [b]	(A) [c] (degrees) [d]

1 "infinity" 3,000 0.243 0.267 0.0 0.010 1.099 2 300 3,000 0.251 0.270 0.015 0.004 0.680 3 200 3,000 0.250 0.269 0.020 0.004 0.675 4 100 3,000 0.248 0.269 0.035 0.005 0.681 5 75 2,000 0.247 0.272 0.045 0.004 0.673 6 50 2,000 0.245 0.272 0.060 0.004 0.675 7 25 2,000 0.241 0.271 0.095 0.005 0.700 8 20 2,000 0.239 0.274 0.11 0.004 0.685 9 15 2,000 0.237 0.271 0.14 0.004 0.707 10 10 1,500 0.235 0.273 0.17 0.004 0.710 11 5 2,000 0.231 0.274 0.24 0.004 0.717 12 2 2,000 0.227 0.276 0.40 0.004 0.718 13 "0" 3,000 0.225 0.285 1.13 0.005 0.762

[a] Results of using NCS constraints, restraints or no restraints with slow-cool SA protocols (temperature steps of -50 K; followed by energy minimisation) at 2.5 Å resolution. An intermediate model of A2U [e], after energy minimisation, was used as input (initial R 0.248, Rfree 0.269). Some of the calculations crashed when started at 3,000 K and were therefore run from lower initial temperatures. The run with an NCS force constant "infinity" used strict NCS and yielded the best model (for our data, at 2.5 Å resolution); the one with a constant of "zero" used no restraints at all and yielded the poorest model.

[b] Average RMS distance between main-chain atoms in molecule A and each of the other three molecules.

[c] RMS deviations from ideality of the bond lengths. [f]

[d] RMS deviations from ideality of the bond angles. [f]

[e] G.J. Kleywegt, J. Björklund, J. Uppenberg, D. Ogg, L.D. Lehman-McKeeman, J.D. Oliver, and T.A. Jones, to be published.

[f] R.A. Engh and R. Huber, Acta Cryst. A47, 392 (1991).


TABLE II. Results of using a "traditional" refinement protocol with low-resolution data. [a]


				Conservative model	Over-fitted model

NCS strict 2-fold not used Temperature factor model grouped individual Scale for X-ray term 0.5 1 Resolution range (A) 8.0 - 2.9 6.0 - 2.9 R,Rfree 0.251 / 0.320 0.169 / 0.323

Number of non-H atoms 1123 2246 Reflections F > 2 sigma(F) 6743 6316 Refined parameters 3657 8984 Data-to-parameter ratio 1.8 0.7

Average B protein (A2) 49.4 43.7 RMS Delta-B bonded (A2) n/a 6.9

RMSD NCS-related CA atoms (A) 0.0 0.50 RMSD NCS-related all atoms (A) 0.0 0.99 RMS Delta-B NSC CA atoms (A2) 0.0 10.5 RMS Delta-B all NCS atoms (A2) 0.0 12.0

RMS dev. bonds (A) 0.009 0.010 RMS dev. angles (degrees) 1.56 1.63 RMS dev. dihedrals (degrees) 26.9 27.0 RMS dev. impropers (degrees) 1.25 1.41

Bad contacts 0 3 Pep-flip outliers 3*2 7 RSC outliers 10*2 32 Ramachandran outliers 1*2 2 % Ramachandran most favoured 82 78 Overall G-factor [b] +0.13 +0.06

[a] The structure of holo CRABP type I [c] was refined at 2.9 Å resolution with strict two-fold NCS and grouped temperature factors. The final model was then subjected to two 4,000 K SA calculations without any NCS constraints or restraints and with full weight for the crystallographic pseudo-energy term, and individual isotropic temperature factors were refined. The resulting structure is clearly inferior to the more conservative model.

[b] R.A. Laskowski, M.W. MacArthur, and J.M. Thornton, in "From First Map to Final Model " (S. Bailey, R. Hubbard, and D.A. Waller, Eds.), p. 149, SERC Daresbury Laboratory, Daresbury, U.K., 1994.

[c] G.J. Kleywegt, T. Bergfors, H. Senn, P. Le Motte, B. Gsell, K. Shudo, and T.A. Jones, Structure 2, 1241 (1994).


TABLE III. Effect of NCS on Rfree for incorrect models. [a]


	NCS:		Constrained	Restrained	Unrestrained
Test set:
Random			0.365 / 0.465	0.347 / 0.484	0.268 / 0.522

Thin shells 0.360 / 0.477 0.348 / 0.485 0.266 / 0.552

Thick shells 0.361 / 0.470 0.351 / 0.531 0.267 / 0.531

[a] Results of experiments with an intentionally backward-traced model of A2U [b] to assess the effect of relationships between reflections in the case of NCS on the value of the free R-factor for grossly incorrect models. The structure was subjected to SA refinement using data with F>2sigma(F) between 6.0 and 3.0 Å. Three different methods of selecting 10 % test reflections were tried: random, in (15) thin resolution shells and in (5) thick shells. For each set of reflections, three different NCS models were tested: constrained (data-to-parameter ratio ~2.5), restrained and unrestrained (data-to-parameter ratio ~0.6). In all cases, "R-factor reducing tricks" were used (limited data, full weight of the crystallographic term and isotropic temperature-factor refinement). In all calculations, the initial R-factor was ~0.54 and the initial free R-factor ~0.55. The table shows the values of the conventional and free R-factor, respectively, for each refinement.

It is clear that no amount of over-fitting brings the free R-factor down to a "respectable" level. Also, with careful refinement (constrained or restrained NCS), not even the conventional R-factor can be "fooled". If the NCS is unrestrained, on the other hand, the conventional R-factor approaches the realm of respectability. Finally, the use of thin resolution shells of test reflections does appear to "uncouple" the conventional and free R-factor somewhat. Thick resolution shells should probably not be used, since they introduce sizeable resolution ranges which are systematically missing from the work set.

[b] G.J. Kleywegt, J. Björklund, J. Uppenberg, D. Ogg, L.D. Lehman-McKeeman, J.D. Oliver, and T.A. Jones, to be published.


Latest update at 24 November, 1999.