weatherhoogl.blogg.se - Redshift data types

The JDBC-based INSERT/UPDATE queries are only practical for small updates to Redshift tables. Furthermore, the use of JDBC to store large datasets in Redshift is only practical when data needs to be moved between tables inside a Redshift database. The reason being that JDBC provides a ResultSet based approach, where rows are retrieved in a single thread in small batches. While this method is adequate when running queries returning a small number of rows (order of 100’s), it is too slow when handling large-scale data. Prior to the introduction of Redshift Data Source for Spark, Spark’s JDBC data source was the only way for Spark users to read data from Redshift. This post discusses a new Spark data source for accessing the Amazon Redshift Service. Redshift Data Source for Spark is a package maintained by Databricks, with community contributions from SwiftKey and other companies.

Third party data sources are also available via. Spark users can read data from a variety of sources such as Hive tables, JSON files, columnar Parquet tables, and many others. The Spark SQL Data Sources API was introduced in Apache Spark 1.2 to provide a pluggable mechanism for integration with structured data sources of all kinds. This is a guest blog from Sameer Wadkar, Big Data Architect/Data Scientist at Axiomine.