Best Practices of Different Data Ingestion Options in Snowflake
Data ingestion, the process of obtaining and importing data for immediate storage or use in a database usually comes in two flavours — data ingested in batches & data streaming. Batch data ingestion batches imports data in discrete chunks at more pre-determined time slots. Real-time data streaming naturally follows an unpredictable ingestion schedule.
Snowflake’s Data Cloud solves many of the data ingestion problems that companies face in with below Data Ingestion options -
*Snowpipe Streaming is currently preview feature open to all accounts.
Let’s look at the benefits and best practices of each -
Benefits of COPY
~ Large scale bulk data ingestion
~ Scalable compute warehouse
~ Resilient error handling and retry
~ Deduplication support
~ Full support for semi-structured data on load
~ Pre-defined data format options
COPY Best Practices
Do
=> Leverage object path partitioning for effective listing and loading
=> Use the FILES or PATTERN option to specify the exact files when loading ad-hoc
=> Utilize schema detection for simple table creation and schema evolution for continuous data loading
=> Take advantage of file format and copy options such as ON_ERROR
Don’t
<> Upsized warehouses won’t load single files faster
<> Don’t over-complicate your large data loads; let Snowflake handle the job management
<> Don’t split or merge files unless your data source supports it natively; not worth building a separate tool or process
Benefits of Snowpipe
~ Continuously generate data is available for analysis in <1 minute
~ You only pay for the compute time yo use to load data
~ Avoid repeated manual COPY commands
~ Zero management. No indexing, tuning, partitioning or vacuuming on load
~ Error notifications for failure alerting
~ Serverless: No server to manage or concurrency to worry about.
Snowpipe Best Practices
Do
=> Utilize native cloud provider event filtering
=> Ensure optimal sizing (100–250 MB)
=> Utilize the same notification integration for multiple Azure/GCP pipes
=> Utilize error notifications for failure alerting
=> Check pipe status for debugging and health
Don’t
<> Ingest tiny files (1KB); use Snowpipe streaming where possible to bypass files
<> Build your own frankenstein auto-ingest solution or custom replication
<> Drop stages, tables, integrations on running pipes
<> Ignore COPY best practices
Benefits of Snowpipe Streaming
~ Data queryable in < 5 seconds after Snowflake ack.
~ Low cost for both fluid and trickle cases.
~ Direct Data Streaming: Ingest to tables over HTTPs, No files or pipe to manage.
~ Exactly-Once and Per-Channel Ordering: Get data in once and process in order.
~ High Throughput: GB/sec ingestion rates supported.
~ Low overhead: Minimal configuration and setup needed.
Snowpipe Streaming Best Practices
Do
=> Consider your business requirements to determine best ingestion cost/latency
=> Leverage Snowpipe Streaming for Snowflake’s Kafka Connector with a simple upgrade
Don’t
<> Build a streaming application just to read a large file or batch load existing data
<> Replace all batch ingestion; argument it instead with streaming
References:
https://docs.snowflake.com/en/user-guide/data-load-snowpipe-streaming-overview
https://docs.snowflake.com/en/user-guide/data-load-considerations
Follow and Clap if you like the content and feel free to ask if you have any questions in the comments. I will be more than happy to assist and guide you.