Tag Archives: law

EU Data Retention: A really big pile of logs

Legalize it: Keeping logs of all voice and digital communications has been mandatory in the EU under the directive 2006/24/EC , forcing member states to bring into effect domestic laws enforcing local service providers to comply. Whether this is good or bad for us is another discussion, but it sure is not good for carriers and ISPs.

Data that should be retained are primarily call records from phone switches and MSCs (whom did you call, when, from which location, duration of call, mobile phone ID and other stuff), emails sent and received (headers only, no content), in some cases, visited URLs and lots of other related stuff, like your personal information that is required to bind your phone number or IP address to your real name and home address. This is data required by law enforcement authorities to track you down if you do really bad bad things, Winston Smith.

The trouble is that such data are produced in massive quantities. Each phone call generates one or two CDRs, a few hundred bytes long. Each email a few lines in a log file and so on. Multiply these by the number of subscribers of a carrier or ISP and you have figures in the order of a few gigabytes per day. All these data must be stored in a safe place so that when the Law knocks on your door and requests the whereabouts of a Bad Guy, the service provider delivers all relevant information in a few days. Now, try and run something like:

$ gzcat logs/from_everywhere/*.gz | grep $BAD_GUY_PHONE | awk '{print $7","$23","$12 }' > /tmp/calls.csv
$ gzcat hlr_logs/from_everywhere/*.gz | grep $BAD_GUY_IMSI | awk '{print $1","$4","$32}' > /tmp/loc.csv
$ gzcat crm_export/*.gz | grep $BAD_GUY_NAME | awk '{print $3","$4","$8","$23","$7 }' > /tmp/info.csv

on gigabytes of (compressed) data, then import the CSV files to excel to try to correlate them and produce some meaningful information for the authorities… Excel will probably explode before your brain does.

The question is, is there any cool software out there that can automate this process? Let’s do a 3-minute analysis.

The lifecycle of call data retention looks like this: First, data are collected from all sources, sanitized and ingested into a safe data repository. After a predefined data expiration period (say one year) information should be automatically expunged from the database (minus the records that under investigation). At any time, the system should produce information required by law enforcement authorities in a timely and accurate manner, without direct human manual intervention on data.  Data should be archived, protected (encrypted) and be immune to alteration of any kind.

What kind of software would do the job? Certainly not conventional relational databases. Importing a few gigabytes every day in your Oracle database will try your DBA insane and the database itself doing nothing more that updating indexes and taking up disk space, let alone the fact that you need an epic disk array to handle the load. What about using your Security Information Management application? Well, SIM can do a good job in finding in real time suspicious security events from your antivirus, IPS and firewall logs, but cannot handle the massive daily data volume and accumulated information. A distributed cloud database? Maybe, if you are Google or Amazon…

Actuyally, there is software that is built for this job. It all starts with the database. What we need here is a database that can support complex queries involving joins from a number of tables, that is very efficient with read-only transactions, can talk plain old SQL and can ingest tons of data in a flash. On top of this database, you need an application that can mediate and sanitize data, implement a query and data retrieval interface that leaves out human intervention and can produce reports tailored to the needs of state authorities. The end result is a compact system that utilizes low cost commodity storage (SATA drives) and a 2- or 4-way x86 server for data ingestion and retrieval, that is rated at ingesting ~30GB of data per day and at the same time satisfy all requirements for archiving, data encryption, compression and retrieval.