Follow

How can I tell if all data has been ingested

Question

How can I tell if all data has been ingested?

Summary

This "How To" article provides the steps on how to verify whether your data has been ingested. The high-level overview of steps is listed below:

  • Step 1: Verify CSV has been read by Flume (OPTIONAL)
  • Step 2: Verify count of raw events in Elasticsearch via Kibana
  • Step 3: Verify events consumed by Kafka
  • Step 4: Verify relations generated in HBase

NOTE: It is only possible to confirm that data for a historical (or, static) dataset has been fully ingested. For streaming data, ingestion is continual and as a result, does not have an end. You must also know the data type being ingested, in addition to the number of events that is in the dataset being ingested.

The following nodes will need to be accessed (via web or SSH to the respective nodes):

  • REPORTING (web)
  • STREAM (SSH)
  • ANALYTICS (SSH) 

Steps

NOTE: This information is only useful for CSV data ingest using that is Flume

Step 1: Verify the CSV has been read by Flume (OPTIONAL)

If you are ingesting from CSV, Flume will mark files as read. Please note, that this preliminary check only indicates that the CSV file has been read by Flume. Below are steps on to validate if a CSV file has been read by Flume:

  1. List the contents of the directory where your data is stored (eg. /data/auth) using the following command:
    • EXAMPLE: ls /data/auth/
  2. Verify that “.COMPLETED” has been appended to the CSV filename:
    • EXAMPLE: auth_export_1515773640.csv.COMPLETED

In Step 1, we gain an understanding of whether Flume has read a CSV file. This is only useful however if you are performing a CSV data ingest. 

Step 2: Verify count of raw events in Elasticsearch via Kibana

To begin our verification of whether ingest has completed, we will check the number of raw events ingested into Elasticsearch using Kibana. In Kibana, you can search, view, and interact with data stored in Elasticsearch indices.

  1. In a web browser, navigate to Reporting UI URL:
    • http<s>://<Reporting_Node_FQDN>/search
  2. Log in with the credentials for your tenant. The default Reporting Admin username/password is:
    • username: admin
    • password: password
  3. In Discover, click the ▼ to select the appropriate index name/pattern. Depending on the data type being ingested, the index name/pattern will differ.
    • EXAMPLE: If ingesting web proxy data, the index pattern will look like:
      • interset_webproxy_rawdata_<tid>
        • NOTE: If no index name/pattern is configured, Kibana will prompt to add one. Please reference Interset documentation for details on how this can be configured.
  4. In the top-right corner, where the time-range filter is defined, adjust the filter to represent the time-range reflected in your dataset.
  5. Verify that the count of raw events matches the number of events in your dataset.

In Step 2, we gain an understanding of how many events have been ingested into Elasticsearch. 

Step 3: Verify events consumed by Kafka

By default, each Kafka topic is partitioned, and each partition contains an ordered and immutable sequence of events that is continually appended to. All events contained in a partition is assigned an ID number called the offset, which uniquely identifies each event. Below are steps that will check the offsets to see how many events have been consumed by the primary event topic, and the consumer groups for Elasticsearch and HBase.

  1. SSH to the STREAM NODE(s) as the Interset User
    • ssh interset@<STREAM_NODE_FQDN>
  2. Type in the following command to check how many events have passed through the primary event topic:
    • /usr/hdp/current/kafka-broker/bin/kafka-run-class.sh kafka.tools.GetOffsetShell --time -1 --topic interset_<ds>_events_<did>_<tid> --broker-list <stream_node_FQDN>:6667
    • The response will return the number of events across the partitions. Below is an example:
      • {metadata.broker.list=<stream_node_FQDN>:6667,request.timeout.ms=1000,client.id=GetOffsetShell, security.protocol=PLAINTEXT}
        • interset_netflow_events_0_0:2:762309
        • interset_netflow_events_0_0:5:762309
        • interset_netflow_events_0_0:4:762309
        • interset_netflow_events_0_0:7:762309
        • interset_netflow_events_0_0:1:762309
        • interset_netflow_events_0_0:3:762309
        • interset_netflow_events_0_0:6:762309
        • interset_netflow_events_0_0:0:762309
    • Please make note of the sum of partitions; in this example, the sum is 6,098,472.
  3. Type in the following command to check how many events have passed through the consumer group for Elasticsearch:
    • /usr/hdp/current/kafka-broker/bin/kafka-consumer-groups.sh --describe --zookeeper <master_node_FQDN>:2181 --group interset_<ds>_events_<did>_<tid>_es_group
    • The response will return the number of events across the partitions. Below is an example:
      • GROUP TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG OWNER
        • interset_netflow_events_0_0_es_group interset_netflow_events_0_0 0 762309 762309 0 interset_netflow_events_0_0_es_group_<master_node_FQDN>-1508775115951-a27e16ad-0
        • interset_netflow_events_0_0_es_group interset_netflow_events_0_0 1 762309 762309 0 interset_netflow_events_0_0_es_group_<master_node_FQDN>-1508775115951-a27e16ad-0
        • interset_netflow_events_0_0_es_group interset_netflow_events_0_0 2 762309 762309 0 interset_netflow_events_0_0_es_group_<master_node_FQDN>-1508775115951-a27e16ad-0
        • interset_netflow_events_0_0_es_group interset_netflow_events_0_0 3 762309 762309 0 interset_netflow_events_0_0_es_group_<master_node_FQDN>-1508775115951-a27e16ad-0
        • interset_netflow_events_0_0_es_group interset_netflow_events_0_0 4 762309 762309 0 interset_netflow_events_0_0_es_group_<master_node_FQDN>-1508775115951-a27e16ad-0
        • interset_netflow_events_0_0_es_group interset_netflow_events_0_0 5 762309 762309 0 interset_netflow_events_0_0_es_group_<master_node_FQDN>-1508775115951-a27e16ad-0
        • interset_netflow_events_0_0_es_group interset_netflow_events_0_0 6 762309 762309 0 interset_netflow_events_0_0_es_group_<master_node_FQDN>-1508775115951-a27e16ad-0
        • interset_netflow_events_0_0_es_group interset_netflow_events_0_0 7 762309 762309 0 interset_netflow_events_0_0_es_group_<master_node_FQDN>-1508775115951-a27e16ad-0
      • The sum of the current-offsets should equate to the output of the previous step (eg. 6,098,472).
  4. Type in the following command to check how many events have passed through the consumer group for HBase:
    • /usr/hdp/current/kafka-broker/bin/kafka-consumer-groups.sh --describe --zookeeper <master_node_FQDN>:2181 --group interset_<ds>_events_<did>_<tid>_hbase_group
    • The response will return the number of events across the partitions. Below is an example:
      • GROUP TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG OWNER
        • interset_netflow_events_0_0_hbase_group interset_netflow_events_0_0 0 762309 762309 0 interset_netflow_events_0_0_hbase_group_<master_node_FQDN>-1508775138887-1bf158ab-0
        • interset_netflow_events_0_0_hbase_group interset_netflow_events_0_0 1 762309 762309 0 interset_netflow_events_0_0_hbase_group_<master_node_FQDN>-1508775138887-1bf158ab-0
        • interset_netflow_events_0_0_hbase_group interset_netflow_events_0_0 2 762309 762309 0 interset_netflow_events_0_0_hbase_group_<master_node_FQDN>-1508775138887-1bf158ab-0
        • interset_netflow_events_0_0_hbase_group interset_netflow_events_0_0 3 762309 762309 0 interset_netflow_events_0_0_hbase_group_<master_node_FQDN>-1508775138887-1bf158ab-0
        • interset_netflow_events_0_0_hbase_group interset_netflow_events_0_0 4 762309 762309 0 interset_netflow_events_0_0_hbase_group_<master_node_FQDN>-1508775138887-1bf158ab-0
        • interset_netflow_events_0_0_hbase_group interset_netflow_events_0_0 5 762309 762309 0 interset_netflow_events_0_0_hbase_group_<master_node_FQDN>-1508775138887-1bf158ab-0
        • interset_netflow_events_0_0_hbase_group interset_netflow_events_0_0 6 762309 762309 0 interset_netflow_events_0_0_hbase_group_<master_node_FQDN>-1508775138887-1bf158ab-0
        • interset_netflow_events_0_0_hbase_group interset_netflow_events_0_0 7 762309 762309 0 interset_netflow_events_0_0_hbase_group_<master_node_FQDN>-1508775138887-1bf158ab-0
      • The sum of the current-offsets should equate to the output of the previous step (eg. 6,098,472). If the outputs are not the same, you will need to wait further until ingest completes.

In step 3, we gain an understanding of how many events have passed through Kafka. 

Step 4: Verify relations generated in HBase

After we have confirmed that the same number of events have passed through Kafka, we can query HBase to verify the relations generated for the ingested dataset. To understand how Interset uses ingested data, it is important to explain the concept of Observed Entity Relation Minutely Counts and how it is used. At a high level, the OERMC is the number (count) of interactions (relation) between two entities at a given time. For example, joshua takes X number of times from project Y.

For a known dataset, the number of relations generated should always be the same. Please note, there is also a 1:n mapping for events and relations. That is, a single event may generate many relations. Therefore, querying HBase for the number of relations is only helpful when ingesting the same dataset more than once. For example, when ingesting a given dataset across various tenants or systems. Below are steps on how verify the relations generated for the ingested dataset:

  1. SSH to the ANALYTICS NODE as the Interset User
    • ssh interset@<ANALYTICS_NODE_FQDN>
  2. Type in the following command to navigate to the /opt/interset/analytics/bin directory:
    • cd /opt/interset/analytics/bin
  3. Type in the following command to load the Phoenix Console:
    • ./sql.sh --action console --dbServer <server>
  4. Once the Phoenix console has loaded, run the following query:
    • SELECT COUNT(*) FROM OBSERVED_ENTITY_RELATION_MINUTELY_COUNTS where TID ='<tid>';
  5. The response will return the number of relations observed, example below:
    COUNT(1)
    42208690
     1 row selected (3.061 seconds)

In the example above, Interset has observed 42,208,690 relations. If we ingest the same dataset elsewhere, this should be the number of relations observed as well.

Applies To

  • Interset 5.4.x or higher
Was this article helpful?
0 out of 0 found this helpful
Have more questions? Submit a request

Comments

Powered by Zendesk