Miscellaneous Resources

Organizing Information in Spreadsheets

for the Humanities

Lesson Level: Beginner

Last Updated: 2021-08-13

Created By: Theo Acker

Spreadsheets are a wonderful tool for gathering, viewing, sharing, and working with information. As with any tool, learning to use them effectively can make a big difference. In this tutorial we will discuss the structure of an effective spreadsheet, how to deal with certain tricky values you may work with, and some other tips and features for working with spreadsheets.

Learning Objectives

  • Describe the basic structure of a spreadsheet.
  • Determine how to organize a complicated dataset.
  • Avoid common pitfalls when creating tables or spreadsheets.
  • Handle tricky information, like dates and null or unknown values.

Spreadsheet Structure

We will look at several sample spreadsheets to see how we can organize different types of information. Before we look at anything too complex, however, we need to define the basic "data structure" of a spreadsheet.

The Book Sample spreadsheet is a very basic example of standard table listing books.

A quick note on terminology: I am using spreadsheet and table interchangeably here, since a well-organized spreadsheet is a table.

Book Sample
Title Author Date Publisher
The Visual Display of Quantitative Information Edward R. Tufte 1983 Graphics Press
Why We Sleep Matthew Walker 2017 Scribner
The Design of Everyday Things Don Norman 2013 Basic Books
The Angel and the Assassin Donna Jackson Nakazawa 2020 Ballantine Books
New Digital Worlds Roopika Risam 2018 Northwestern University Press
The Eye of the World Robert Jordan 1990 Tor Books
Weapons of Math Destruction Cathy O’Neil 2016 Crown

There are three components of table structure that are important to note: Columns, rows, and cells.

Book Sample table with the Author column, Design of Everyday Things row, and Don Norman cell highlighted

A column lists some variable, or type of information. In this sample, that information includes title, author, date, and publisher. We could create more columns if we wanted to collect other information about the books - number of pages, ISBN number, subject, and so forth - anything you can think of that you would want to record.

A row lists one observation, or one "item" that we are observing. That "item" could be anything from an observation from hands-on research to information about a concept or physical object. Often, as is the case here, the first row provides column headers instead. This is important for making it clear what each column contains. In this sample, the items we are listing are books, so each row will contain one (and only one) book. This is one of the essential rules for organized spreadsheets: Each row contains information for only one object.

Each intersection of a row and column is a cell. A cell contains the value of the variable specified by the column for the item specified by the row. For example, the cell highlighted here contains the information specified by the Author for the observation of the third item - the book The Design of Everyday Things. In other words, this cell states that the author of The Design of Everyday Things is Don Norman.

The standard rule here is that there should only ever be one value per cell. While this sample conforms to the rule, there is a potential problem area in the author column. You are probably familiar with lots of books and articles that have more than one author. If I want to add a book with multiple authors to this spreadsheet, I have to choose between ignoring all authors other than the first, having multiple authors (values) in a single cell, or restructuring the sheet. As we will see, the last option is typically the best choice, although it will likely take more work. That is why it is important to plan out what information you have and how you will organize it early!

Artifacts with Tags

Another fairly common attribute that shares the same difficulty as author is the tag attribute. Tweets, photographs, artifacts... all kinds of things can have any number of tags.

The Skyrim Museum Sample is a shortened and simplified version of some information that could be uploaded to the content management system Omeka. Omeka allows you to build a basic website with a searchable database containing images or other objects and their metadata (the information about those objects). It is commonly used for archives, and it works well for a variety of digital projects.

This is an image from the Omeka site showing two of the items it contains. (screenshot) If you are interested, you can explore the Skyrim Museum site to get an idea of what Omeka can do.

Omeka website "Browse Items" page showing 2 of 12 items: Crown of Barenziah and Wuuthrad

This site features a collection of items from the video game Skyrim that would be classified as cultural heritage artifacts, at least if they existed in the real world. Our sample dataset contains a few of these items.

Skyrim Museum Sample
Name Creator Date Original Format Tags
Rahgot Unknown dragons Merethic Era Game Object armor,dragons,enchanted,Nordic
Golden Claw Unknown Unknown Game Object dragons,Nordic,puzzles
Mehrunes' Razor Mehrunes Dagon Unknown Game Object Daedric,dagger,enchanted,one-handed,weapon
Auriel's Bow Anuiel Dawn Era Game Object bow,enchanted,prophecy,two-handed,weapon
Wuuthrad Yngol Merethic Era Game Object axe,Nordic,two-handed,weapon

In this dataset we have name, creator, date, original format, and tags. You may have noticed that the Tags column is breaking the rule of one value per cell. This illustrates one of the rare cases when it is good to break the rule. When uploading this data to Omeka, the Omeka importer will recognize the Tags column and separate out the list of tags as individual items. This means that the end result will not have all the tags mashed together like we see here.

That is great for importing data into Omeka. It is not so great for other potential uses. What if we want to analyze the data outside of Omeka using some other type of software? For a simple example, maybe we want to count how many objects have the "enchanted" tag. It is simple enough to count by hand, since this is such a tiny dataset. In a larger dataset however, I might have closer to five hundred items than five. With such a dataset, there would be no easy way to count the number of objects with any given tag.

Other potential avenues for analysis include:

  • How many tags does each item have?
  • What are the most common tags?
  • Which tags tend to show up together?

With the current structure, we have no reasonable way to get this information.

So what is the solution?

Variable List of Generic Columns

The first option we will consider is adding a list of however many generic columns we end up needing, as shown below.

Name Creator Date Original Format Tag 1 Tag 2 Tag 3 Tag 4 Tag 5
Rahgot Unknown dragons Merethic Era Game Object armor dragons enchanted Nordic
Golden Claw Unknown Unknown Game Object dragons Nordic puzzles
Mehrunes' Razor Mehrunes Dagon Unknown Game Object Daedric dagger enchanted one-handed weapon
Auriel's Bow Anuiel Dawn Era Game Object bow enchanted prophecy two-handed weapon
Wuuthrad Yngol Merethic Era Game Object axe Nordic two-handed weapon

Here we have created five columns so that each tag is in its own cell and column. There are several major issues with this. First, the number of tags a given object has is variable. At the moment we only need five columns, since the most tags any of these items have is five, but we cannot count on that. We might end up with an item that has ten or even twenty tags. In addition, this structure does not help us at all in analyzing the data. Consider the "simple" question of how many objects have the "enchanted" tag. Again, it is easy enough to count by hand with a dataset of only five objects. If I had a larger dataset and wanted to use some software to find this number, however, it would be rather complicated to explain what I needed to the software. I would have to list out each tag column (Tag 1, Tag 2, Tag 3, and so on) and tell the program to check each of them, as "enchanted" could show up in any of them.

This might seem like a minor hassle with the data we have. Listing the five columns is only slightly annoying, but imagine if there were twenty columns instead, or if we wanted to make a chart showing how many times each tag shows up. While these are not the only issues, they are enough to indicate that this is not the solution.

Long List of Specific Columns

Our second option is a slightly different take on the first. We still create extra columns, but this time we create a specific column for each potential value. In this case, we need a column for each different tag used in the dataset.

Name Creator Date Original Format armor axe bow Daedric dagger dragons enchanted Nordic one-handed prophecy puzzles two-handed weapon
Rahgot Unknown dragons Merethic Era Game Object yes no no no no yes yes yes no no no no no
Golden Claw Unknown Unknown Game Object no no no no no yes no yes no no yes no no
Mehrunes' Razor Mehrunes Dagon Unknown Game Object no no no yes yes no yes no yes no no no yes
Auriel's Bow Anuiel Dawn Era Game Object no no yes no no no yes no no yes no yes yes
Wuuthrad Yngol Merethic Era Game Object no yes no no no no no yes no no no yes yes

This structure improves our ability to perform some types of analysis, though it causes issues with others. More importantly, however, this is simply not feasible. We have to add one column for each tag. With only five items, each of which has at most only five tags, we already have thirteen different possible tags. With a full dataset, we would have a lot more. There are some cases where adding specific columns like this is a good solution (we will see one in a different example), but this is clearly not one of those.

Multiple Spreadsheets

Finally, we have our only real option: Splitting the data into two different spreadsheets/tables.

The first table, Objects, contains all of the artifact information except for tags. In fact, it is literally the same table we started with, but with the Tags column removed.

Objects
Name Creator Date Original Format
Rahgot Unknown dragons Merethic Era Game Object
Golden Claw Unknown Unknown Game Object
Mehrunes' Razor Mehrunes Dagon Unknown Game Object
Auriel's Bow Anuiel Dawn Era Game Object
Wuuthrad Yngol Merethic Era Game Object

We can then deal with the Tags column in a second spreadsheet. Here, each row pairs one object and one tag. Every object gets one row for each of its tags. For example, Auriel's Bow has five tags, so it gets five rows in this new table.

Tags
Name Tag
Rahgot armor
Rahgot dragons
Rahgot enchanted
Rahgot Nordic
Golden Claw dragons
Golden Claw Nordic
Golden Claw puzzles
Mehrunes' Razor Daedric
Mehrunes' Razor dagger
Mehrunes' Razor enchanted
Mehrunes' Razor one-handed
Mehrunes' Razor weapon
Auriel's Bow bow
Auriel's Bow enchanted
Auriel's Bow prophecy
Auriel's Bow two-handed
Auriel's Bow weapon
Wuuthrad axe
Wuuthrad Nordic
Wuuthrad two-handed
Wuuthrad weapon

The two spreadsheet connect by using the artifact name. In the fifth row of the Tags table, we pair the Golden Claw with the dragons tag. Then we pair it with Nordic, and then with puzzles. Because the "Name" variable is unique to each of our objects - for example, we only have one object with the name Golden Claw - we can use this to clearly identify which object a given tag belongs to. There is no question that the three tags just mentioned (dragons, Nordic, puzzles) all belong to the object called Golden Claw.

Working with multiple related sheets can take some getting used to, but it opens up a world of possibility. We can count how many times any tag shows up in this column. We could load this data into a visualization program like Tableau and make a bar chart to see which tags are the most or least common. The list goes on.

Survey Data

Our next example is a spreadsheet based on some actual survey data I worked with. I am providing two tables here - the short version, so you can view it more easily, and the long version.

Survey Data - "Short" Version
HHID Year Month Day RegionCode Livestock Members ID_01 ID_02 ID_03 ID_04 ID_05 ID_06 ID_07 ID_08 ID_09 ID_10 ID_11 ID_12 Age_01
1 2019 3 25 15874 1 poultry 4 1 2 3 4 34
2 2019 3 25 15874 2 cows, 1 ox 4 1 2 3 4 20
3 2019 3 27 15874 5 poultry, 2 cows, 2 goats, 1 ox 4 1 2 3 4 42
4 2019 3 28 10553 3 cows, 1 ox 5 1 2 3 4 5 40
5 2019 3 30 10553 1 goat 2 1 2 27
6 2019 4 2 10553 1 ox 6 1 2 3 4 5 6 30
7 2019 4 2 10553 1 goat, 1 ox 4 1 2 3 4 57
8 2019 4 2 10553 2 goats, 1 ox 12 1 2 3 4 5 6 7 8 9 10 11 12 67
9 2019 4 3 10553 2 oxen 1 1 65
10 2019 4 3 14521 1 cow, 2 goats 3 1 2 3 33

The first column is the HHID, the ID for each household. We also have columns for the year, month, and day the data was collected, a region code, a list of livestock owned by the household, and the number of members within the household. It is after this Members column that the data starts to get... interesting. Starting with the ID columns, information has been collected about each member of the household. The data in these columns relates not to the household in general, but only to one of the members within the household. ID-01 is the ID for the first member, ID_02 is the ID for the second member, and so on.

Survey Sample
HHID Year Month Day RegionCode Livestock Members ID_01 ID_02 ID_03 ID_04 ID_05 ID_06 ID_07 ID_08 ID_09 ID_10 ID_11 ID_12 Age_01 Age_02 Age_03 Age_04 Age_05 Age_06 Age_07 Age_08 Age_09 Age_10 Age_11 Age_12 Sex_01 Sex_02 Sex_03 Sex_04 Sex_05 Sex_06 Sex_07 Sex_08 Sex_09 Sex_10 Sex_11 Sex_12 Mar_01 Mar_02 Mar_03 Mar_04 Mar_05 Mar_06 Mar_07 Mar_08 Mar_09 Mar_10 Mar_11 Mar_12 EduLvl_01 EduLvl_02 EduLvl_03 EduLvl_04 EduLvl_05 EduLvl_06 EduLvl_07 EduLvl_08 EduLvl_09 EduLvl_10 EduLvl_11 EduLvl_12 EduYr_01 EduYr_02 EduYr_03 EduYr_04 EduYr_05 EduYr_06 EduYr_07 EduYr_08 EduYr_09 EduYr_10 EduYr_11 EduYr_12
1 2019 3 25 15874 1 poultry 4 1 2 3 4 34 24 3 1 male female female male married married primary none none none 2 0 0 0
2 2019 3 25 15874 2 cows, 1 ox 4 1 2 3 4 20 22 2 0 male female female female married married secondary secondary none none 8 12 0 0
3 2019 3 27 15874 5 poultry, 2 cows, 2 goats, 1 ox 4 1 2 3 4 42 36 11 4 male female male male married married secondary secondary secondary none 14 12 6 0
4 2019 3 28 10553 3 cows, 1 ox 5 1 2 3 4 5 40 39 11 6 4 male female male female male married married secondary none secondary primary none 6 0 6 1 0
5 2019 3 30 10553 1 goat 2 1 2 27 16 male female married married secondary secondary 12 9
6 2019 4 2 10553 1 ox 6 1 2 3 4 5 6 30 24 1 1 59 25 male female female male male male married married married never primary primary none none primary primary 5 6 0 0 4 6
7 2019 4 2 10553 1 goat, 1 ox 4 1 2 3 4 57 26 24 18 male female female female widowed never never never secondary secondary higher secondary 13 14 16 10
8 2019 4 2 10553 2 goats, 1 ox 12 1 2 3 4 5 6 7 8 9 10 11 12 67 63 35 38 4 2 19 15 30 3 29 3 male female female male male male male female female male female female married married married married never never married married primary primary higher higher none none higher secondary higher none higher none 1 5 18 18 0 0 14 7 18 0 18 0
9 2019 4 3 10553 2 oxen 1 1 65 female widowed primary 2
10 2019 4 3 14521 1 cow, 2 goats 3 1 2 3 33 24 2 male female male married married primary secondary none 4 14 0

One of the basic rules of an organized spreadsheet is that each row contains information about a single observation. This example is breaking that rule, since each row contains information about multiple observations (one household and multiple members). That is a serious issue.

This is also an excellent example of why the basic rule that each row contains information about a single observation is so important. Not only is this data unusable for analyzing the members of the household, but it is also highly impractical and nearly impossible to work with. In this minimized dataset, each household may have up to twelve members, so each piece of information collected about the members requires twelve columns. Keep in mind that this is a greatly reduced sample. The datasets I worked with had room for around 30 to 40 household members, with a lot more pieces of information collected about each. That is a lot of scrolling!

In the previous example, we looked at artifacts with variable numbers of tags. This example is similar, as it involves households with variable numbers of members. The big difference is that the tags involved only one word or phrase, while multiple pieces of information have been collected about each household member. Fortunately, the solution works in exactly the same way.

First, we make one sheet with all the household member information removed.

Households
HHID Year Month Day RegionCode Livestock Members
1 2019 3 25 15874 1 poultry 4
2 2019 3 25 15874 2 cows, 1 ox 4
3 2019 3 27 15874 5 poultry, 2 cows, 2 goats, 1 ox 4
4 2019 3 28 10553 3 cows, 1 ox 5
5 2019 3 30 10553 1 goat 2
6 2019 4 2 10553 1 ox 6
7 2019 4 2 10553 1 goat, 1 ox 4
8 2019 4 2 10553 2 goats, 1 ox 12
9 2019 4 3 10553 2 oxen 1
10 2019 4 3 14521 1 cow, 2 goats 3

Then we make another sheet that connects each set of member information to a household, using the household ID as the identifier. We start with the HHID column, and then we have ID, Age, etc. Each set of twelve columns becomes a single column.

Members
HHID ID Age Sex Married EduLvl EduYear
1 1 34 male married primary 2
1 2 24 female married none 0
1 3 3 female none 0
1 4 1 male none 0
2 1 30 male married secondary 8
2 2 22 female married secondary 12
2 3 2 female none 0
2 4 0 female none 0
3 1 42 male married secondary 14
3 2 36 female married secondary 12
3 3 11 male secondary 6
3 4 4 male none 0
4 1 40 male married secondary 6
4 2 39 female married none 0
4 3 11 male secondary 6
4 4 6 female primary 1
4 5 4 male none 0
5 1 27 male married secondary 12
5 2 16 female married secondary 9

It is worth noting that the original spreadsheet required space for twelve members to be provided for each household. For households with fewer than twelve, that means leaving a lot of blank space. If a household have more than twelve, we would have to add another column for each type of information we wanted to collect. With this spreadsheet, there is no wasted space. Each household gets however many rows it needs. The first household only has four members, so it only has four rows. If a household had twenty members, there would be no issue giving it twenty rows.

This new format also means that we now have options for analyzing and otherwise working with this information.

Splitting Columns

Going back to the households data, when we looked at the option of creating columns for each specific tag using the Skyrim Museum sample, I mentioned that we would get to look at an example of when it does make sense to split one column into multiple, rather than to add another sheet or table.

Households
HHID Livestock
1 1 poultry
2 2 cows, 1 ox
3 5 poultry, 2 cows, 2 goats, 1 ox
4 3 cows, 1 ox
5 1 goat
6 1 ox
7 1 goat, 1 ox
8 2 goats, 1 ox
9 2 oxen
10 1 cow, 2 goats

The livestock column in the households table above is clearly breaking our rule of one value per cell. If we followed the same solution we did for tags with the Skyrim data and household members for the survey data, we would need to create a new sheet called Livestock. In this sheet, we would record household ID, type of livestock, and number. While this would certainly work, we are only collecting data for four types of livestock: Poultry, cows, goats, and oxen. It is therefore entirely reasonable to split the livestock column into four.

Better Households
HHID Poultry Cows Goats Oxen
1 1 0 0 0
2 0 2 0 1
3 5 2 2 1
4 0 3 0 1
5 0 0 1 0
6 0 0 0 1
7 0 0 1 1
8 0 0 2 1
9 0 0 0 2
10 0 1 2 0

Now we have columns for each type of livestock, with each cell containing the number of that type of livestock owned by that household.

Important Note: This solution assumes that we have collected data about all four types of livestock for each household, and simply did not record any zeros. That is, when it said "2 cows 1 ox" it meant "0 poultry, 2 cows, 0 goats, 1 ox". If we are working with a dataset that already exists, while it is likely that this is the case, we cannot assume that it is. Maybe in one household it was not possible to record the number of poultry. Assuming such unknowns (or "null" values) are zeros can cause problems for our analysis, as we will discuss shortly.

This example illustrates one reason to never have more than one value per cell. When we have four columns, it is very sensible to record the zeros. It would feel kind of odd to specifically add "0 poultry" to the single column we had before.

Tricky Values: Dates and Nulls

We have now looked at three different datasets to examine the basic structure we want to give our information when organizing it in spreadsheets. There are, however, other issues that may come up when creating or working with spreadsheets. We will visit a few of the most important ones, starting with a brief discussion of how to deal with two types of tricky values: Dates and nulls.

Dates

One reason dates can be tricky is that they can be represented in a varietry of ways. A few examples:

  • March 12
  • March 12, 2007
  • 03/12/07
  • 2007-03-12

When working with dates, we must therefore be as specific, clear, and consistent as possible.

Spreadsheet software, like Excel, can further complicate matters. Excel is very opinionated in how it deals with dates.

  • Excel removes leading zeros from day and month. This isn't a big deal, though I find it makes the dates a bit harder to read. Example: 01/04/2021 becomes 1/4/2021
  • Excel may show the date in a different format, visually, depending on its settings. No data is lost, but it may look confusing. Example: 01/04/2021 becomes 1/4/2021 but is displayed as 1/4/21, Jan-21, or in some other format
  • Excel automatically adds the current year to input it perceives as a date without a year. Obviously if the date did not occur in the same year it was entered into Excel, this would be a problem. Example: 01/04 becomes 1/4/2021 (in the year 2021)
  • Excel automatically adds a day to input it perceives as a date without a day. By default, the day chosen is the first. Example: 01/2021 becomes 1/1/2021

While the first two may be minor annoyances or may not even bother you at all, the last two are highly problematic. If your input is not a date, but just resembles one, the solution (in Excel, at least) is to put a single quotation mark in front of the date, or to specifically change the Number Format for the cell(s) to Text. The latter is generally a better option since you don't enter extra data, but make sure to change the Number Format before you enter data. For example, if I enter 08/07/2019 and then change Number Format to text, the cell value becomes 43684.

Number Format dropdown in Excel. Text is highlighted.

If your input is a date, the safest option is to have three separate columns, one each for year, month, and day. Whether or not you store the full date in another column, this is an excellent approach that eliminates ambiguity, and that makes it easier to analyze. After all, a date could be considered its own entity, or it could be considered three values, and we don't want to store multiple values per cell. When possible, I recommend having a date column and separate columns, since some analysis software can work directly with dates, but you don't want to have to rely on the date column.

Dates
Date Year Month Day
8/7/19 2019 8 7
8/12/19 2019 8 12
11/2/19 2019 11 2
1/1/20 2020 1 1
1/14/20 2020 1 14
2/5/20 2020 2 5
4/3/20 2020 4 3
4/10/20 2020 4 10
4/25/20 2020 4 25

Some of the dates in the date column shown here could be ambiguous (especially out of context), but since there are also columns for the year, month, and day, that is not a problem. Providing three columns becomes especially important if you are working with dates that you only know part of. If you just know the month and the year, you cannot store it in a date column that uses a format with the day, because you do not know the day. If you are using three separate columns, you can add the year and month to the appropriate columns and treat the day as a null value.

Null Values

Sometimes you will have missing data. Maybe something wasn't collected, or wasn't applicable, or it had to be thrown out for some reason. There isn't normally any need to specify the reason - whether the interviewer forgot to ask the question, or the interviewee was unable to answer it or something else entirely, it doesn't make much (or any) difference. The data is still missing.

What is the difference between zero, unknown, and null?

Zero is a known quantity. It is just as valid information as any other number.

Unknown and null are a bit more similar. In many cases they will amount to the same thing. Occasionally, however, you may find a situation where the fact that a value is unknown is meaningful. For example, consider the Skyrim Museum sample we looked at earlier.

Skyrim Museum Sample
Name Creator Date Original Format Tags
Rahgot Unknown dragons Merethic Era Game Object armor,dragons,enchanted,Nordic
Golden Claw Unknown Unknown Game Object dragons,Nordic,puzzles
Mehrunes' Razor Mehrunes Dagon Unknown Game Object Daedric,dagger,enchanted,one-handed,weapon
Auriel's Bow Anuiel Dawn Era Game Object bow,enchanted,prophecy,two-handed,weapon
Wuuthrad Yngol Merethic Era Game Object axe,Nordic,two-handed,weapon

Some of these artifacts have no known date of creation or creator. It is relevant that the date and creator have been researched and are truly unknown. Thus, these have been included as data. If, instead, I had not yet managed to research all of the dates or creators, the missing values would be recorded as nulls. They might also be unknown, or they might have a value that I simply had not recorded.

As for how to deal with null values, I have included a slightly modified chart I found in one of the Data Carpentry lessons (Data Organization in Spreadsheets for Social Scientists: Formatting Problems) that discusses a number of possible values.

Dealing with Null Values
Null Value Problems Recommendation
0 Indistinguishable from a true zero Never use
Blank Hard to distinguish values that are missing from those overlooked on entry. Hard to distinguish blanks from spaces, which behave differently. Best option
-999 or 999 Not recognized as null by many programs without user input. Can be inadvertently entered into calculations. Avoid
NA or na Can also be an abbreviation (e.g. North America), can cause problems with data type (turn a numerical column into a text column). NA is more commonly recognized than na. Good option
N/A An alternate form of NA, but often not compatible with software Avoid
NULL Can cause problems with data type Good Option
None Uncommon. Can cause problems with data type Avoid
No data Uncommon. Can cause problems with data type, contains a space Avoid
Missing Uncommon. Can cause problems with data type Avoid
- or + or . Uncommon. Can cause problems with data type Avoid

The first one is very important. Never use zero as a null value. Ever. Zero is data. For example, if I am recording the number of years of education a person has received, zero is a legitimate answer. If I also use zero to record any missing values, I am no longer indicating that those values are missing.

I'll jump down one to another numeric option: -999 or 999. This is also not a good idea. It may not always be confusing the way zero can be - I am definitely not going to say that someone had -999 or 999 years of education, but it is not recognized as null by most software and may mess up any calculations or analysis you attempt.

The second option here is the one you generally want to go with. For clarification, this option means leaving the cell entirely blank, rather than writing the word "Blank" in the cell. This is a safe option, as empty cells will be avoided in analysis and are widely recognized as null by software you might use. It also makes sense. The value is missing, so nothing is recorded.

If there is some reason that you can't leave a cell blank, for example, if you are using software that doesn't recognize blanks as nulls, NA or NULL (all caps for both) are the other decent options. However, leaving the cell blank is normally best.

Consistency

Dates and nulls are some of the trickiest values you are likely to regularly encounter, but all values can become tricky without consistency. When you are organizing data, it can be useful to define exactly what information will be accepted in each column. For certain types of text data, you may want to create a standardized vocabulary.

A standardized vocabulary is a pre-defined list of words or phrases that are acceptable inputs for a given column. If you have heard the term "controlled vocabulary" before, this is a very similar concept.

Let's consider an example. I want to know the mythological creature preferences of Texans and Oklahomans. I have people fill out a survey, indicating their state, and their favorite mythological creature. A sample of the results is shown above. After collecting the results, I analyze them to find a favorite creature for Texas and for Oklahoma. I find that Texans prefer unicorns, while Oklahomans prefer dragons. Here is a section of the dataset (the full dataset can be found on the OSF page).

Mythical Creature Survey
State FavoriteCreature
Oklahoma dragon
OK unicorn
TX dragon
Texas unicorn
OK dragon
OK dragon
Oklahoma dragon
Texas unicorn
Texas dragon
OK unicorn
Oklahoma dragon
Texas dragon
Oklahoma dragon
OK unicorn

The results for the full survey are as follows:

Myth Survey Results
State Dragon Unicorn
Oklahoma 7 4
Texas 4 6

Unfortunately, there is a problem with this dataset: I forgot to standardize my states. In the state column, sometimes the state is spelled out, and sometimes the state code is used. My analysis, however, assumed that the states would be spelled out, so I missed all the results for OK and TX. Let’s see how that changes my conclusions.

Myth Survey - Full Results
State Dragon Unicorn
Oklahoma 7 4
Texas 4 6
OK 4 5
TX 4 1
Oklahoma + OK 11 (7 + 4) 9 (4 + 5)
Texas + TX 8 (4 + 4) 7 (6 + 1)

For the spelled out states, more Oklahomans prefer dragons, and more Texans prefer unicorns. For the state codes, the reverse is true. If I combine all Oklahoman responses and all Texan responses, I find that a preference for dragons knows no state boundaries.

This is a silly example, but hopefully it illustrates the perils of not standardizing what values a given column will accept.

It is useful to define the values to be used for a given column somewhere in your notes included with the data, but you can go a step further. Both Excel and Google Sheets (and presumably other spreadsheet programs) provide ways to set these values, which can help prevent mistakes and typos. This feature is called data validation.

Data Validation

Data validation is an extremely useful feature. You can use it to specify a set of accepted text values, as mentioned above, but it can do a lot more, too. For example, you could use data validation to ensure that all entries are positive numbers less than 100. Since the focus of this tutorial is not how to use features in specific spreadsheet programs, I will not go into the details here, but I have included links with more infomration below.

Microsoft support has information on using data validation in Excel for Windows, Mac, or Web.

Data validation in Google Sheets will be reasonably similar. The linked instructions are for creating an in-cell dropdown with data validation. For other types of data validation, follow steps 1 and 2, but choose a different option instead of "List from a range."

This is a great tool, and I recommend playing around with it on your preferred spreadsheet editor.

Final Tips

Finally, here are a few miscellaneous tips that may make your life easier.

  1. Keep a separate copy of your raw data. If you need to make modifications or apply fancy formatting, do so using a copy. This ensures you will always have a clean source file to work with, not to mention that you will have a backup in case something goes wrong.
  2. Store data as CSVs. This isn't essential, but if you are keeping a separate copy of your raw data, a CSV format is a good way to store it. It is an especially good format for sharing data.

    CSV stands for comma separated values. While your computer will automatically try to open a CSV file using spreadsheet software, which is very convenient for viewing it, ultimately a CSV file is a text file. Please note: If you have multiple sheets in a spreadsheet workbook, you will need to store each one as a separate CSV file.

  3. In general, be wary of what formatting you apply. While some formatting can be very useful, such as changing font size to make the text easier to read, there are two things to keep in mind.
    1. CSVs do not store formatting, so make sure you are not using formatting to convey information.
    2. Avoid destructive formatting, like merging cells. This kind of formatting can cause problems saving and working with the spreadsheet in the future.
  4. When naming files, columns within a spreadsheet, or anything else, consider using a naming convention system. There isn't one right way to do this - the primary goal is clarity and consistency. If you don't have one, I recommend considering this naming convention system.
  5. Freeze the top row of your spreadsheet so you don't have to scroll up and down between the column headers and the rest of the data.
    1. Freeze rows in Excel
    2. Freeze rows in Google Sheets
  6. When possible, learn the keyboard shortcuts for operations you tend to use a lot. There are a ton of shortcuts (I've provided links to lists for Excel and Google Sheets), so it isn't practical to learn all of them, but learning the ones you will make the most use of can save you time and energy. I have definitely made good use out of keyboard shortcuts for selecting an entire row or column, deleting a row or column, and switching between spreadsheet tabs.
    1. Keyboard shortcuts in Excel
    2. Keyboard shortcuts in Google Sheets

You can find all of the curriculum materials for this tutorial (and associated workshop) on the Spreadsheets for Humanities OSF page.

If you are interested in learning more about organizing and working with data, here are a few recommendations:

Back to Top