Data Export
Overview
This describes files shared in Raw Data Export, which exports messages published
by users on Viafoura widgets, along with other information such as users and likes on a daily
basis. The files are generated at approximatly 6am UTC every day.
There are three main files shared by default:
- comments content
- user_information
- container_ids
There are a few optional files that we can provide upon request:
- moderation_assessments
- user_assessments
- poll_actions
- engage_time
This document is intended to give more context about the data contained in each column of
those files, and to provide information on how to access the data and connect information
across files.
Getting Access
We offer two standard options to enable access to raw data export in our S3 bucket:
- S3 access via client IAM Role: If the client could provide us with an AWS IAM role, we
can allow access to client specific S3 bucket in our data export account. Specifically, the
role would have permissions to list objects in the bucket ("s3:ListBucket") and download
each object ("s3:GetObject"). - SFTP: The client could access the bucket via SFTP , in which case we would require a
public key from the machine(s) that will access via SFTP.
More information would be shared with the client once everything has been set up on our end.
In most cases, daily files would be stored in the "RawData" folder and historical data would be
shared in the "HistoricalData" folder.
File Specifications
Daily files are generated based on raw data events collected on the previous day without
advanced processing. It is recommended that the client should create tables to store historical
data and update records as new data comes in as 1) some information could change over time,
such as username, and 2) some identifiers could appear on different days in different files,
making it hard to connect data across different files using just one day’s data.
In most of the files, each line is an action, uniquely identified by event_uuid field. Actions capture interactions performed by actors, and those actors might be users, moderators, applications and APIs (among others). Users refer to registered users unless stated otherwise.
All timestamps are in UTC.
Note that not all data events are captured so not all information would be available in the files.
For example, an updated username would only show up in the user_information file if the user’s logged in status changes.
Files are plain text extractions with the following characteristics:
● Format: CSV
● Delimiter: ","
● Text qualifier: """
● Contains header: True
● Encryption: None
● File name: Each file name would have a prefix based on file creation time in UTC in the
format of "YYYYMMDD_HHMMSS_" like "20231005_060126_"
Comments Content
This file contains all messages published by users, whether they are the main thread comment
(the message that started the sequence of replies) or a reply (a message linked to a main
comment), as well as other actions performed on the comments.
Fields information
Name | Description | Format |
---|---|---|
content_uuid | Id of the content item that was interacted on | UUID (Universally unique identifier) ex. 00000000-0000-4000-8000-036 8f9a80495 |
container_uuid | Id of the widget where the action took place | UUID (Universally unique identifier) ex. 00000000-0000-4000-8000-036 8f9a80495 |
payload_parent_uuid | The UUID of the post to which this is a reply, or the container_uuid if it is top level | UUID (Universally unique identifier) ex. 00000000-0000-4000-8000-036 8f9a80495 |
actor_uuid | ID of the user or actor who caused the interaction. This could be a User, a Moderator or an internal ID | UUID (Universally unique identifier) ex. 00000000-0000-4000-8000-036 8f9a80495 |
event_uuid | Unique identifier of the event | UUID (Universally unique identifier) ex. 00000000-0000-4000-8000-036 8f9a80495 |
site_name | Name of the site (or property) | String ex. https://www.somepage.com |
message_type | Defines the type of message where the action took place. | String Main posts have these values: livecomment _post chat_message liveblog_post livereview _post Reply posts have these values: reply_to_livecomment_post reply_to_chat_message reply_to_liveblog_post reply_to_livereview_post |
action | Action that generated the entry. | String There are several possible values and they are displayed in the Action table below. |
payload_metadata_origin_url | Display the entire URL | String ex. https://www.somepage.com |
timestamp | Timestamp of the event that results in an entry | Timestamp '2020-10-13 13:28:06.340' |
payload_content | The text published by the user | String ex. 'Best wishes.' |
List of actions available (not all may be in use)
Action | Meaning |
---|---|
pinned | when a post is pinned to the top of the thread |
updated | when a post is edited |
disliked | when a post is disliked or down voted by a user |
created | when a post is created |
picked | when a post has been selected, as an editor's pick |
unliked | when a like to a post is reverted |
disabled | when a message is removed from view and general public cannot see it anymore |
unpicked | when a post has been removed, as an editor's pick |
deleted | when a user has voluntarily deleted their own post |
liked | when a post is liked |
undisliked | when a dislike to a post is reverted |
flags_cleared | when a moderator clears flags of a post |
flagged | when a user flags a post |
spammed | when a post is marked as spam |
visible | when a post is visible to the general public (barring mutes, ghost bans) |
unpinned | when a post is unpinned from top of a thread |
User Information
This file provides additional information about the users that interacted with Viafoura widgets,
including username and third party id if available.
Fields information
Name | Description | Format |
---|---|---|
user_id | Id of the user that made the action | Big Int ex. 8081300019356 |
actor_uuid | ID of the user or actor who caused the interaction. This could be a User, a Moderator or an internal ID | UUID (Universally unique identifier) ex. 00000000-0000-4000-8000-0368f9a80495 |
third_party_id | This is the actor ID when user creates an account using third party services, such as Google, Facebook, LoginRadius | UUID (Universally unique identifier) ex. 00000000-0000-4000-8000-0368f9a80495 |
username | Displays the username chosen by the user | String 'Some Name' |
Container IDs
This file lists containers created or updated on a given day. A container is where comments
could be posted.
Fields information
Name | Description | Format |
---|---|---|
container_uuid | ID of the widget | UUID (Universally unique identifier) ex. 00000000-0000-4000-8000-0368f9a80495 |
site_name | Name of the site (or property) | String ex. https://www.somepage.com |
payload_container_id | Id of the widget in a different format | Big Int ex. 8081300019356 |
Moderation Assessments
This file provides information about moderation assessments on comments.
Fields information
Name | Description | Format |
---|---|---|
site_name | Name of the site (or property) | String ex. https://www.somepage.com |
assessment | Outcome of assessment. | String Possible Values: approved deferred rejected |
assessment_type | Type of assessment. | String Possible Values: content_moderation spam_detection flag_moderation |
content_container_uuid | UUID of container the content was in (livechat id, page id etc) | UUID (Universally unique identifier) ex. 00000000-0000-4000-8000-0368f9a80495 |
content_source_type | Type of content being moderated (see message_type in Comments Content) | String |
entity_uuid | UUID of content being moderated | UUID (Universally unique identifier) ex. 00000000-0000-4000-8000-0368f9a80495 |
provider | The entity performing this assessment. | String Possible Values: Human moderations have these values: console human Auto moderations have these values: automod_service automod (deprecated) keepcon (deprecated) Auto spam detection has this value: spam_service |
provider_decision | The decision made by the assessment service before settings were applied to it | String ex. 'approved' |
section_uuid | Unique site identifier | UUID (Universally unique identifier) ex. 00000000-0000-4000-8000-0368f9a80495 |
tags | Tags attached to the assessment | String '[{"confidence":0.000000000000 000e+00, "status":"confirmed", "tag":"personal_attack"}]' |
timestamp | Timestamp of the event that results in an entry | Timestamp '2020-10-13 13:28:06.340' |
User Moderations
This file provides information about user moderations including user bans, avatar moderations
and username moderations.
Fields information
Name | Description | Format |
---|---|---|
site_name | Name of the site (or property) | |
timestamp | Timestamp of the event that results in an entry | |
event_type | Event type. | String Possible Values: ban.user (user bans) user.moderate (avatar and username moderations) |
content_type | The type of content to which interaction is related. | String Possible Values: username avatar |
actor_uuid | ID of the user or actor who caused the interaction. This could be a Moderator or an internal ID | UUID (Universally unique identifier) ex. 00000000-0000-4000-8000-0368f9a80495 |
event_uuid | Unique identifier of the event | UUID (Universally unique identifier) ex. 00000000-0000-4000-8000-0368f9a80495 |
interaction_status | Status of username/avatar | String Possible Values: approved rejected |
Poll Actions
This file provides information about management of polls and engagements in polls.
Fields information
Name | Description | Format |
---|---|---|
timestamp | Timestamp of the event that results in an entry | Timestamp '2020-10-13 13:28:06.340' |
site_name | Name of the site (or property) | String ex. https://www.somepage.com |
container_uuid | UUID of container the content was in (livechat id, page id etc) | UUID (Universally unique identifier) ex. 00000000-0000-4000-8000-0368f9a80495 |
poll_uuid | Unique poll identifier | UUID (Universally unique identifier) ex. 00000000-0000-4000-8000-0368f9a80495 |
poll_title | Poll title, only available with management event_type | String |
user_id | ID of the user or actor who caused the interaction. This could be a registered or anonymous User, a Moderator or an internal ID | UUID (Universally unique identifier) ex. 00000000-0000-4000-8000-0368f9a80495 |
event_uuid | Unique identifier of the event | UUID (Universally unique identifier) ex. 00000000-0000-4000-8000-0368f9a80495 |
event_type | Event type. | String Possible Values: engagement (votes) management |
event_action | Event action. | Strting Possible Values: Possible values when event_type is 'engagement': vote Possible values when event_type is 'management': publish close delete |
voter_picked_option_uuid | Poll option picked by voter, only available when event_type is 'engagement' | UUID (Universally unique identifier) ex. 00000000-0000-4000-8000-0368f9a80495 |
poll_options | Array containing poll options (option order, text, uuid and votes received), only available when event_type is 'management' | Array[ { "option": "Monday" , "00000000-0000-4000-8000-0368f9a80495" , "order": 1, "votes": 0 }, { "option": "Tuesday" , "optionUuid": "00000000-0000-4000-8000-036 8f9a80496" , "order": 2, "votes": 0 } ] |
poll_close_timestamp | Timestamp of when poll should end (if applicable), only available when event_type is 'management' | Timestamp '2020-10-13 13:28:06' |
Engage Time
This file provides information about users time on site and time in comments.
Fields information
Name | Description | Format |
---|---|---|
day | Date of event | Timestamp '2020-10-13' |
site_name | Name of the site (or property) | String ex. https://www.somepage.com |
actor_uuid | ID of the user or actor who caused the interaction. This could be a User, a Moderator or an internal ID | UUID (Universally unique identifier) ex. 00000000-0000-4000-8000-0368f9a80495 |
view_uniqueid | First party tracking cookie ID, unique identifier of the anonymous user who caused the interaction | UUID (Universally unique identifier) ex. 00000000-0000-4000-8000-0368f9a80495 |
time_on_page | Time user spent on site (in milliseconds) | Big Int ex. 141192 |
time_in_comments | Time user spent in commenting widgets (in milliseconds) | Big Int ex. 141192 |
Entity Relationship Diagram
The Entity Relationship Diagram (ERD) the files.
data:image/s3,"s3://crabby-images/23485/23485f4b31f19aff17e39a8774f898a66843ae0d" alt=""
Frequently Asked Questions
How can we resolve user id to email addresses?
Both user_id and actor_uuid are internal to Viafoura. To map to users in your system, you can use username or third_party_id in user information files if available. We can also add a column for email address if required (sharing sensitive PII data should be avoided in most cases).
How do I generate a public key for data access through SFTP?
Please follow instructions in
https://docs.aws.amazon.com/transfer/latest/userguide/key-management.html#sshkeygen.
You need to share the public key with us and use the private key for data access.
How do I read multi-line comments?
The files we share are .csv files, so the file should be opened/read as .csv in order to be displayed or processed correctly. For example, you could open the file with Microsoft Excel, or programmatically read the file using CSV parsers.
Why do old articles show up in recent container_ids files?
Clients need to add comments/conversations code on a page for it to appear (see
https://documentation.viafoura.com/docs/new-conversations#step-2-add-the-conversations-c
ode-to-your-page). Comment containers could be missing if the code has not been deployed on a page. Container creation for comments is purely controlled by the client and we capture container creation events instantly.
Why do some actor_uuid appear in comments content file but not in user_information file on the same day?
The files are created based on different events as not all data fields are available in all the events, and they serve different purposes. The user_information file is only intended to provide additional user information such as username. If desired, one can create a table to store the latest user information per actor_uuid from all historical user_information files and use the table to look up user information.
Why do some container_uuid appear in comments_content file but not in container_ids file on the same day?
The files are created based on different events. Each day’s container_ids file only contains information on newly created or updated containers. One should be able to find the container_uuid if looking at all historical container_ids files as a whole. For example, if a user created a comment on 11 October in a comment container created on 10 October, the container’s id would appear in the comments content file for 11 October and container_ids file for 10 October.
Why is there a difference between the number of comments from the file
and the number on the website?
This could be attributed to a few different factors. First of all, it depends on how the number of comments is extracted from the files. Secondly, a comment could go through a few different moderation processes and its visibility could change depending on both content status (created and awaiting moderation/visible/disabled/spammed/deleted) and user status (deleted/banned). User status could affect the visibility of both the user’s comments and replies to these comments. It has never been considered in reporting the number of comments for analytics purposes as it could change with time and counting the number of comments
created is enough as a measure of user engagement.
Updated 4 days ago