DataStori | FAQs

FAQs

How DataStori Works | Data Security and privacy | Advanced Features

How DataStori Works

What is DataStori? Who uses it?



DataStori is a SaaS product which automates the ingestion of data from cloud applications (ERP, CRM, HR, finance and others) to a central data store. DataStori ensures the availability of regular, clean and reliable data to businesses, which is the first step in data-driven decision-making.

DataStori is used by large and mid-market enterprises worldwide to understand their data and make the best use of it to achieve their business goals.

How is DataStori different from other data connectors and data ingestion tools?



DataStori offers major benefits in functionality, security and pricing over other data connectors.

- DataStori creates data pipelines in real-time by reading source API documentation. Other tools have a library of connectors, with processes and lead times to add new ones. This makes it possible for customers to access thousands of applications which are not served by other connectors.
‍
- DataStori executes data pipelines in the customer's cloud. Data source and destination are both in the customer's cloud, and data never leaves their environment. This ensures that data handled by DataStori is always in compliance with the customer's data security and privacy policies.

- DataStori has transparent and cost-effective pricing. Because it runs serverless, DataStori spins up and shuts down infrastructure on-demand, keeping costs low. By executing pipelines in the customer's cloud unlike other connectors, DataStori eliminates an extra data hop through its cloud. This is a significant benefit both on cost and data security.

How does DataStori connect to an application and ingest data from it?



DataStori connects to applications using the data owner's (user's) login credentials, typically to the API or database. It then creates data pipelines from the source to the storage destination defined by the user. Finally, it runs the data pipelines on schedule or on-demand to ingest data.

In addition to application APIs and databases, DataStori can create data pipelines from CSV files attached to email and from SFTP folders.

What is a data pipeline? How does DataStori execute data pipelines?



A data pipeline is a component to copy specified data from a source to a destination using an integration. For example, a data pipeline can be built to copy the General Ledger table from NetSuite (source) to Azure SQL (destination). The integration specifies the data copy and automation parameters including data deduplication, pipeline run schedule, source columns, data backload and many others.

DataStori orchestrates data pipelines from its cloud (AWS US East-1 region) but executes them in the customer's cloud. Data source and destination are both in the customer's cloud, and data never leaves their environment. This ensures that customer data handled by DataStori is always in compliance with the customer's data security and privacy policies.

Where does DataStori store the ingested data? How do users consume it?



DataStori creates a Lakehouse in the customer's cloud and follows the Medallion architecture for data management. Files are written in the delta format and pushed to a data warehouse of the customer's choice, e.g., Azure SQL, Snowflake, PostgreSQL or any SQL Alchemy supported database.

Users can consume the ingested data from the Lakehouse or from the data warehouse. In addition to delta, DataStori can store data in Iceberg, Parquet or CSV formats in the Lakehouse.

Does DataStori transform the ingested data before storing it?



DataStori performs a limited set of data transformations. It dedupes and flattens incoming data and encrypts user-specified columns.

This makes the data ready for enrichment, AI based querying, custom analytics, reporting and any other end-use that the customer needs it for. These business processes are all downstream of DataStori and not part of the product.

Do customers need to buy servers, storage or any other software to run DataStori?



No, customers do not need to buy any IT infrastructure upfront to run DataStori. They need to set up a cloud account with AWS, MS Azure or GCP to provision servers and storage, and security and other services to comply with their data policies. All these components are directly licensed by the customer from the cloud services provider.

DataStori is built using serverless architecture. It spins up servers and other components in the customer's cloud to run pipelines and shuts them down after execution. This ensures that the provisioning matches demand, with minimal fixed cost when pipelines aren't running.

Are there any data size limits in DataStori? What about backloads of historical data?



No, DataStori is designed to scale server and storage infrastructure on-demand. It has run production data pipelines on tables as large as 100 GB and 30 million rows. Pipelines in DataStori can be configured and scheduled to backload multi-year data.

The only constraints to data load throughput are the rate limits imposed by source APIs or database connections. Breaking them down into smaller datasets resolves the matter.

How many data pipelines can a user run?



Users can set up and run as many data pipelines as they want. The number of data pipelines is only constrained by the parameters defined by the user's cloud services provider.

How does DataStori charge and what are the costs of running data pipelines?



DataStori charges on the number of application instances connected. This fee has two components - one-time onboarding and monthly licensing.

DataStori does not charge users on the volume of data ingested or the number of pipelines created and executed - these are part of the infrastructure costs that customers directly pay their cloud services provider.

Data Security and Privacy

How does DataStori ensure data security and confidentiality? Does it comply with the customer's security policies?



At all times, customer data resides and moves within their environment - source applications, SharePoint, email, SFTP folders, and destination storage.

DataStori orchestrates data pipelines from its cloud (AWS US East-1 region) but executes them in the customer's cloud. Data source and destination are both in the customer's cloud, and data never leaves their environment. A further level of security is that DataStori can encrypt user-specified columns from a data source or drop them from the final output.

All this ensures that customer data handled by DataStori is always in compliance with the customer's data security and privacy policies.

Data sources include application APIs and databases, emailed CSV files and SFTP folders.

Do users need to grant DataStori access to their cloud?



Yes, users need to allow DataStori access to their:

- Cloud infrastructure to spin up servers and other components
- Source application APIs or database from where data is to be ingested.

Are my API and application credentials secure in DataStori?



All your credentials are secure with DataStori. Application API tokens are encrypted using AES 256 and stored in the application database. They cannot be read by the DataStori admin or anyone else.

DataStori doesn't need any credentials to the destination storage, because the required permissions are assigned to the servers spun up for pipeline execution.

Security elements in DataStori include data encryption, virtual network, multi-factor authentication, detailed alerts and logging.

Can DataStori view business data?



No, DataStori cannot view business data. While DataStori orchestrates data pipelines from its cloud, the data movement from source application to storage destination is entirely in the customer's cloud. DataStori can only create and access the metadata on pipeline setup and execution.

Advanced Features

How does DataStori ensure data quality?



DataStori runs the following checks on all ingested data for every pipeline execution:
‍
- Data freshness test, to check when the data was last refreshed
- Primary key not null test
- Primary key uniqueness test, to ensure that there are no duplicates in the primary key.

In addition, DataStori has automated retries, logging and alerts to make data pipelines more robust.

How is data concurrency handled in DataStori?



By default, all data pipelines in DataStori have a concurrency of 1, i.e. at a given time only one instance of a pipeline can run. All the other triggered instances of that pipeline are queued. In addition, the output data is saved in delta format and supports ACID compliance.

How does DataStori handle schema changes?



In DataStori, schema evolution and tracking are automated. In addition, data and schema changes can be rolled back to a defined restore point if required.

Which API authentication protocols are handled by DataStori?



DataStori supports the following API authentication mechanisms:
‍
1. API Key
2. Basic Authentication
3. OAuth2 - Client Credentials and Authorization Grant flow
‍
In addition, DataStori can be extended to support custom authentication flows that a source application may have implemented.

Still have questions? See our product documentation or write to contact@datastori.io