Querying Data
Description
This guide provides example queries to analyze network flow data collected by Galileo. It assumes you have already set up Galileo and have access to the DuckDB command line interface. It covers various aspects of network traffic, including top talkers, protocols, applications, source and destination IPs, ports, TCP flags, flow duration, size, and count analysis.
Getting Started
To get started with querying the data collected by Galileo, you need to have the DuckDB command line tool installed. You can find the installation instructions for your operating system in the DuckDB documentation. Once you have DuckDB installed, you can start the command line interface by running the following command in your terminal:
$ duckdb
You can then run the example queries provided below to analyze the network flow data stored in the Parquet files generated by Galileo.
When executing queries on a local store, you can filter the scope of the queries using two mechanisms:
- Specify year, day, hour intervals by changing the path in the read_parquet() function of FROM clause.
To narrow the scope of your query replace
year
withmonth
,day
, orhour
as needed. For example, to limit the scope of a query to a specific month:
SELECT * FROM read_parquet('/var/flow/year=2025/month=6/*/*/*.parquet');
or day
SELECT * FROM read_parquet('/var/flow/year=2025/month=6/day=23/*/*.parquet');
- Specify the scope by changing the criteria in the WHERE clause.
To narrow the scope of your query replace
year
withmonth
,day
, orhour
as needed.
To query the entire dataset, you can use:
SELECT * FROM read_parquet('/var/flow/*/*/*/*/*.parquet') WHERE stime >= timezone('UTC', NOW()) - INTERVAL 1 MINUTE;"
or a combination of both methods, for example, to query the last 24 hours of data:
SELECT * FROM read_parquet('/var/flow/year=2025/month=6/*/*/*.parquet') WHERE stime >= timezone('UTC', NOW()) - INTERVAL 24 HOUR;
Example Queries
The examples queries below are designed to work with a local storage instance. These queries cover various aspects of network traffic, including top talkers, protocols, applications, etc.
If you are using cloud storage, i.e. MotherDuck or S3 compatible bucket, you can replace the read_parquet
function with the appropriate path to your cloud storage.
For example, if you are working S3 storage, you can use the following syntax:
SELECT * FROM read_parquet('s3://your-bucket-name/path/to/your/parquet/files/*/*/*.parquet')
Or, if you are working with MotherDuck:
ATTACH 'md:';
USE md_database_name;
SELECT
time_bucket(INTERVAL '1 day', stime) AS time_day,
COUNT(*) AS flows_per_day
FROM flow
GROUP BY time_day
ORDER BY time_day
LIMIT 10;
Queries
Flows Per Day
SELECT
time_bucket(INTERVAL '1 day', stime) AS time_day,
COUNT(*) AS flows_per_day
FROM read_parquet('/var/flow/year=2025/*/*/*/*.parquet')
GROUP BY time_day
ORDER BY time_day
LIMIT 10;
Top Talkers
SELECT
saddr AS source_address,
SUM(sbytes + dbytes) AS total_bytes,
SUM(sbytes) AS bytes_sent,
SUM(dbytes) AS bytes_received,
COUNT(*) AS flow_count,
AVG(sbytes + dbytes) AS avg_bytes_per_flow
FROM read_parquet('/var/flow/year=2025/*/*/*/*.parquet')
GROUP BY saddr
ORDER BY total_bytes DESC
LIMIT 10;
Top Protocols
SELECT
proto AS protocol,
COUNT(*) AS flow_count,
SUM(sbytes + dbytes) AS total_bytes,
AVG(sbytes + dbytes) AS avg_bytes_per_flow
FROM read_parquet('/var/flow/year=2025/*/*/*/*.parquet')
GROUP BY proto
ORDER BY total_bytes DESC
LIMIT 10;
Top Applications
SELECT
ndpi_appid AS application,
COUNT(*) AS flow_count,
SUM(sbytes + dbytes) AS total_bytes,
AVG(sbytes + dbytes) AS avg_bytes_per_flow
FROM read_parquet('/var/flow/year=2025/*/*/*/*.parquet')
GROUP BY ndpi_appid
ORDER BY total_bytes DESC
LIMIT 10;
Top Source IPs
SELECT
saddr AS source_ip,
COUNT(*) AS flow_count,
SUM(sbytes + dbytes) AS total_bytes,
AVG(sbytes + dbytes) AS avg_bytes_per_flow
FROM read_parquet('/var/flow/year=2025/*/*/*/*.parquet')
GROUP BY saddr
ORDER BY total_bytes DESC
LIMIT 10;
Top Destination IPs
SELECT
daddr AS destination_ip,
COUNT(*) AS flow_count,
SUM(sbytes + dbytes) AS total_bytes,
AVG(sbytes + dbytes) AS avg_bytes_per_flow
FROM read_parquet('/var/flow/year=2025/*/*/*/*.parquet')
GROUP BY daddr
ORDER BY total_bytes DESC
LIMIT 10;
Top Source Ports
SELECT
sport AS source_port,
COUNT(*) AS flow_count,
SUM(sbytes + dbytes) AS total_bytes,
AVG(sbytes + dbytes) AS avg_bytes_per_flow
FROM read_parquet('/var/flow/year=2025/*/*/*/*.parquet')
GROUP BY sport
ORDER BY total_bytes DESC
LIMIT 10;
Top Destination Ports
SELECT
dport AS destination_port,
COUNT(*) AS flow_count,
SUM(sbytes + dbytes) AS total_bytes,
AVG(sbytes + dbytes) AS avg_bytes_per_flow
FROM read_parquet('/var/flow/year=2025/*/*/*/*.parquet')
GROUP BY dport
ORDER BY total_bytes DESC
LIMIT 10;
TCP Flags Analysis
SELECT
saddr AS source_address,
daddr AS destination_address,
uflags[5] AS sreset,
uflags[6] AS dreset,
COUNT(*) AS flow_count,
FROM read_parquet('/var/flow/year=2025/*/*/*/*.parquet')
WHERE proto='tcp' and uflags[5]='R' or uflags[6]='r'
GROUP BY ALL
ORDER BY flow_count DESC
LIMIT 10;
Flow Duration Analysis
SELECT
saddr AS source_address,
daddr AS destination_address,
dur as duration,
AVG(dur) AS avg_flow_duration,
COUNT(*) AS flow_count
FROM read_parquet('/var/flow/year=2025/*/*/*/*.parquet')
GROUP BY ALL
ORDER BY avg_flow_duration DESC
LIMIT 10;
Flow Size Analysis
SELECT
saddr AS source_address,
daddr AS destination_address,
SUM(sbytes + dbytes) AS total_bytes,
AVG(sbytes + dbytes) AS avg_bytes_per_flow,
COUNT(*) AS flow_count
FROM read_parquet('/var/flow/year=2025/*/*/*/*.parquet')'
GROUP BY ALl
ORDER BY total_bytes DESC
LIMIT 10;
Flow Count Analysis
SELECT
saddr AS source_address,
daddr AS destination_address,
COUNT(*) AS flow_count,
AVG(sbytes + dbytes) AS avg_bytes_per_flow
FROM read_parquet('/var/flow/year=2025/*/*/*/*.parquet')
GROUP BY ALL
ORDER BY flow_count DESC
LIMIT 10;
Flow Count by Application
SELECT
ndpi_appid AS application,
COUNT(*) AS flow_count,
SUM(sbytes + dbytes) AS total_bytes,
AVG(sbytes + dbytes) AS avg_bytes_per_flow
FROM read_parquet('/var/flow/year=2025/*/*/*/*.parquet')
GROUP BY ndpi_appid
ORDER BY flow_count DESC
LIMIT 10;
Flow Count by Protocol
SELECT
proto AS protocol,
COUNT(*) AS flow_count,
SUM(sbytes + dbytes) AS total_bytes,
AVG(sbytes + dbytes) AS avg_bytes_per_flow
FROM read_parquet('/var/flow/year=2025/*/*/*/*.parquet')
GROUP BY proto
ORDER BY flow_count DESC
LIMIT 10;