Learning Objectives
Following this assignment students should be able to:
- implement quality control for data entry in spreadsheets
- import and create data in an SQL database
- understand the basic query structure of SQL
- execute SQL commands to select, sort, group, and aggregate data
Reading
-
Introduction
-
Basic Queries
Exercises
-- Data entry validation in Excel --
You’re starting a new study of small mammals at the NEON site at Ordway-Swisher. Create a spreadsheet in Excel for data entry. It should have four columns: Year, Site, Species, and Mass.
Set the following data validation criteria to prevent any obviously wrong data from getting entered:
- Year must be an integer between 2015 and 2025.
- Site should be one of the following
A1
,A2
,B1
,B2
. - Species should be one of the following
Dipodomys spectabilis
,Dipodomys ordii
,Dipodomys merriami
. - Mass should be a decimal greater than or equal to zero but less than or equal to 500 since mass is measured in grams in this study and nothing bigger than half a kilogram will possibly fit into your Sherman traps. Change the error message on this validation criteria to explain why data is invalid and what the valid values are.
Save this file as
yourname_ordway_mammal_data.xlsx
.-- Importing Data --
Hand entering data is great if you’re collecting your own data and need to enter it yourself, but it’s a pretty terrible way to use already available data, especially if it’s more than a few dozen lines long. This example will walk you through how to get data that already exists into SQLite.
- Download the main table for the Portal LTREB mammal survey database. It’s kind of large so it might take a few seconds. This database is published as a Data Paper on Ecological Archives, which is generally a great place to look for ecology data.
- Create a new database by clicking on
New Database
in theDatabase
drop down menu. Select a file name, likeportal_mammals.sqlite
, and location. - Click on the
Import
icon. - Click on
Select File
and navigate to where you saved the data file and select it. - Select
CSV
. You’ll notice that you can also import from other SQL or modify theFields separated
orenclosed by
. You’ll want to make sure to selectFirst row contains column names
. - Click
OK
when it asks if you want to modify the data. - Name the table that you are importing into
Surveys
. - Identify the type for each field, using the
Data Type
drop-down menus. If it is not obvious if the data type is anINTEGER
orVARCHAR
for each variable, check the metadata. Important: if you specify the wrong data type it can cause some data to not be imported and/or prevent you from doing some kinds of data manipulations. - Select
recordID
as thePrimary Key
and clickOK
. - Click
OK
when it asks if you are sure you want to import the data. - Now import the plots, and species tables.
-- Creating Tables --
Create a new table in the Portal Mammals database called
FieldNotes
. The table should include information on the date of sampling that the notes apply to, designed in such a way as to allow it to be linked to the main table, and a large text field for entering notes.I would definitely recommend using VARCHAR for the notes field because you have no idea how extensive future notes might be and the notes will probably vary wildly in length.
[click here for output]-- Adding Records --
Add the following note to the
FieldNotes
table in the Portal database for the date04
/01
/1963
:"Just completed the April 1963 census of the site. The region is teeming with Dipodomys spectabilis. Using the time machine to conduct trapping prior to the start of the study is working out great!"
What’s a Dipodomys spectabilis?
[click here for output]-- Updating Records --
Add the following note to the
FieldNotes
table in the Portal database for the date10
/1
/2015
:"Vegetation seems to have returned to normal for this time of year. The landscape isn't exactly green, but there is a decent amount of plant activity and there should be enough food for the rodents to the winter"
As soon as you’ve added it you realized that this was the update for October 2014, not October 2015. Update the record so that it contains the appropriate value.
[click here for output]-- SELECT --
For this and many of the following problems you will create queries that retrieve the relevant information from the Portal small mammal survey database. As you begin to familiarize yourself with the database you will need to know some details regarding what is in this database in order to answer the questions. For example, you may need to know what species is associated with the two character species ID or you may need to know the units for the individual’s weight. This type of information associated with data is called metadata and the metadata for this dataset is available online at Ecological Archives.
- Write a query that displays all of the records for all of the fields (
*
) in the main table. Save it as a view namedall_survey_data
. - We want to generate data for an analysis of body size differences
between males and females of each species. We have decided that we
can ignore the information related to when and where the individuals
were trapped. Create a query that returns all of the necessary
information, but nothing else. Save this as
size_differences_among_sexes_data
.
- Write a query that displays all of the records for all of the fields (
-- WHERE --
A population biologist (Dr. Undomiel) who studies the population dynamics of Dipodomys spectabilis would like to use some data from Portal, but she doesn’t know how to work with large datasets. Being the kind and benevolent person that you are, write a query to extract the data that she needs. She wants only the data for her species of interest, when each individual was trapped, and what sex it was. She doesn’t care about where it was trapped within the site because she is going to analyze the entire site as a whole and she doesn’t care about the size of the individuals. She doesn’t need the species codes because you’re only providing her with the data for one species, and since she isn’t looking at the database itself the two character abbreviation would probably be confusing. Save this query as a view with the name
spectabilis_population_data
.Scroll through the results of your query. Do you notice anything that might be an issue for the scientist to whom you are providing this data? You should! Think about what you should do in this situation…
You decide that to avoid invoking her wrath, you’ll send her a short e-mail* requesting clarification regarding what she would like you to do regarding this complexity. Dr. Undomiel e-mails you back and asks that you create two additional queries so that she can decided what to do about this issue later. She would like you to add a query to the same data as above, but only for cases where the sex is known to be male, and an additional query with the same data, but only where the sex is known to be female. Save these as views with the names
spectabilis_population_data_males
andspectabilis_population_data_females
.*Short for elven-mail
[click here for output] [click here for output] [click here for output]-- ORDER BY --
The graduate students that work at the Portal site are hanging out late one evening drinking… soda pop… and they decide it would be an epically awesome idea to put together a list of the 100 largest rodents ever sampled at the site. Since you’re the resident computer genius they text you, and since you’re up late working and this sounds like a lot more fun than the homework you’re working on (which isn’t really saying much, if you know what I’m saying) you decide you’ll make the list for them.
The rules that the Portal students have come up with (and they did spend a sort of disturbingly long time coming up with these rules; I guess you just had to be there) are:
- The data should include the
species_ID
,year
, and theweight
. These columns should be presented in this order. - Individuals should be sorted in descending order with respect to mass.
- Since individuals often have the same mass, ties should be settled by
sorting next by
hind foot length
and finally by theyear of capture
.
Since you need to limit this list to the top 100 largest rodents, you’ll need to add the SQL command
[click here for output]LIMIT 100
to the end of the query. Save the final query as100_largest_individuals
.- The data should include the
-- DISTINCT --
Write a query that returns a list of the dates that mammal surveys took place at Portal with no duplicates. Save it as
[click here for output]dates_sampled
.-- Missing Data --
Write a query that returns the
[click here for output]year
,month
,day
,species_id
, andmass
for every record were there is no missing data in any of these fields. Save it asno_missing_data
.-- GROUP BY --
Using GROUP BY, write a query that returns a list of dates on which individuals of the species Dipodomys spectabilis (indicated by the
[click here for output]DS
species code) were trapped (with no duplicates). Sort the list in chronological order (from oldest to newest). Save it asdates_with_dipodomys_spectabilis
.-- COUNT --
Write a query that returns the number of individuals of all species combined (
[click here for output]total_abundance
) in each year, sorted chronologically. Include the year in the output. Save it astotal_abundance_by_year
.-- SUM --
Write a query that returns the number of individuals of each species captured in each year (
[click here for output]total_abundance
) and thetotal_biomass
of those individuals (the sum of the weight column). The units for biomass should be in kilograms. Include theyear
andspecies_id
in the output. Sort the result chronologically by year and then alphabetically by species. Save asmass_abundance_data
.