# Who was the string puller? - Data quality and aggregation functions

A week later the local police found the thieves and had them arrested. The picture was already sold on the black market two days after the coup, but for a reasonable chance for mitigation of punishment they disclosed the buyers and the picture could be brought back to the museum where it belongs.

## Data quality

But for the court trial, further evidence needs to be collected, and the question who actually initiated the fraud is open. For this, one may conclude, that the original initiator kept more money than the henchmen. So, have a look at the bank data of our thieves.

29.) Which tables are containing bank data?

30.) How are these tables connected with each other and with the names table? Which primary/foreign keys are there? Hint: Often it is a good idea to use pen and paper to draw a diagram denoting the connections. One possible way to to this is to represent tables by a list of their columns and to draw a line for every primary key-foreign key relationship. Ask a coach for details.

31.) Which columns can you use to get the connection between names and the account table? Hint: Again, using pen and paper might be very helpful.

32.) Get the account numbers of our thieves.

Depending on the actually query, you might be surprised: Philipp does not have a bank account? But you are almost sure he has! This is a very common situation, that you do not get the results you expect. There are some possibilities:

1. The SQL-query was not formulated in a suitable way
2. The data does not exists
3. The data is corrupt

Our experience with large databases is that the most common reason is the first one, followed by the third reason. So unclean data is a common issue. What can be done in this situation? Try not to compare the exact names with `=`, but try the `like`-comparator.

33.) Get a list of account details taking into account the different ways to spell Sarah or Philipp.

To deal with unclean data one sometimes has to be creative and to try different things to find the data one is looking for. Once again: If there is a primary key/foreign key connection, use it! These columns have extra checks by the database, so the probability that these are corrupt is much, much lower than for other columns. If there are no such primary-key columns, it is usually a bad database design which should be a quite rare exception.

## Who got the money? - Aggregate Functions

If you want to calculate a quantity from several values from a column you use aggregate functions, e.g. average values or sums. One aggregate function we have already learned about: The `count` function. Important other functions are:

• `avg()` Returns the average value
• `count()` Returns the number of rows
• `first()` Returns the first value
• `last()` Returns the last value
• `max()` Returns the largest value
• `min()` Returns the smallest value
• `sum()` Returns the sum

Even if it does not make much sense, we compute the average of all phone numbers:

`select avg(phone_number) from phone_contract;`

34.) What is the sum of all transactions of Sarah?

35.) What is the sum of all transactions of Sarah between October 23 and October 25?

We could do now the same for Philipp. But we want to learn a few more useful commands first. SQL has a nice command for cases like the one, where you want to use an aggregate function for subsets according to values in another column:

```select column_name, aggregate_function(column_name)
from table_name
where column_name operator value
group by column_name;
```

For instance, if you want to know the sum of all transaction per account number, you can use

```select account_number_fk, sum(amount)
from bank_transaction
where date > "2014-10-22" and date < "2014-10-26"
group by account_number_fk;
```

Note that `group by` can also be used for more than one column.

These information would have provided an alternative way to identify suspected persons: Anyone who has gotten more than, for instance 20.000€, during the time period in question in their bank account would have been a suspect. In this case, we want to formulate a condition on the result of an aggregated function. Here, we cannot use a `where` clause, instead we use the `having` clause:

```select column_name, aggregate_function(column_name)
from table_name
where column_name operator value
group by column_name
having aggregate_function(column_name) operator value;
```

36.) Get a list of account numbers and sums of transactions for all accounts where the amount is greater than 10.000€ in the time in question. Who is probably the initiator of the crime?

Have you found the criminal mastermind? If yes, you might want to learn about some advanced topics.

Solutions

Back to start