As noted on the main index page, this information is very important, particularly if you wish to collaborate with someone else on a project – and even if not, for your own sake and soundness of your mind, do yourself the favor of taking (some of) the advice offered on this page to heart…
While working on an “old dataset” (doesn't everybody have those shelved treasures somewhere hidden on the attic, that pop up every now and then when a secondary, tertiary or even more remote re-analysis of some old data has to be done?), I once again was faced with a situation I have come to dread:
The person who collected the data initially couldn't make up their mind about how to enter (code) the data in a systematic way, which makes batched (scripted/automated) analysis difficult (and sometimes impossible), and also highly increases the chance for making an error (for instance if data is coded arbitrarily or, even worse, ambiguous!)
It might take you a while to get accustomed to a more rigorous way of coding data, but I promise that you will have the following benefits (if done properly at least):
And while the last reason is of course the least important (when it comes to producing good results, at least), it might be the strongest motivator after all! ;)
Take a moment (or two) and think about (and decide on) the following items:
There is no golden rule, but I would advise to rather store all data that is collected (versus dropping data prematurely only to discover later that it is needed). With harddisk space being abundantly available these days, there is really no excuse for removing potentially interesting data from a dataset simply to “save space” (especially since Excel has this feature of hiding column, which makes data much easier to handle for later analyses).
It is (hopefully) obvious that NONE of your general data files should contain clearly identifiable data (names, addresses, email addresses). But even so, sometimes you even have to be careful with what else needs to be coded. For instance, if you are running a study with HIV-positive smokers, you might not want to say in your data what the HIV status is directly. Instead you could enter a serial number and store that information separately so that you don't need to remove/mask this portion when you share your data with someone else.
Having received training in how to create databases, there are three major approaches:
If all your data comes from one “source” (e.g. is entered manually into Excel), this is of less importance. But if you are using several different sources (e.g. self-report ratings from a logfile written by a stimulation presentation program, such as EPrime, combined with per-trial physiological measures, such as GSR peak-to-peak response), you should think about how you are going to combine this data in a way that no errors occur during the procedure!
Sometimes you wish to store only parts of the data (simpler processing due to reduced amount of data, or for sharing with a collaborator). In either case, you should make sure that it is easy to separate required from non-required information by separately coding different properties in each record.
1
, 2
, or 3
; naturally you must keep a record of which number refers to which condition! The advantage is that languages such as Matlab have more powerful operators available (a simple ==
is enough instead of using strcmpi
for instance); if you do use text tokens (e.g. male
and female
make sure that the spelling is constitent)'hm_12_left_nofix
' and another is identified as 'fm_17_right_fix
', separate along the underscores into different columns and, if useful for further analysis, replace unique text identifiers with numbers (see above)pt57_react.xls
, pt91_react.xls
, and pt142_react.xls
, but instead rename the first to pt057_react.xls
and pt091_react.xls
respectivelyType of use | The Good, | the Bad, | and the Ugly… |
---|---|---|---|
folder name | 'HandGestures ' | either of 'hg ' (too short) or 'hand gestures with moco s91 (2005-1114) ' (too long, spaces, what the f#%& is moco? and s91? is that a date?) | 'hgmcs91-05-11-14 ' (you WILL need some other record later, which a good name would have prevented) |