How Dow Jones Uses Amazon Redshift

Published by admin on

Dow Jones & Company is a publishing and financial information company with products and services that help companies participate better in the market. Examples are Barron’s, Factiva and also the Wall Street Journal. They serve both businesses and consumers.

Colleen Camuccio is vice president of program management at Dow Jones. In his presentation at AWS re:Invent he talks about Dow Jones ‘ use of AWS and Amazon Redshift. Amazon Redshift is at the center of your stack to convert your data system from a cost center to a revenue generating Center.

In this post, we provide an overview of how Dow Jones implemented its new data platform with Amazon Redshift at the heart of it.

A fight for data
Large companies don’t usually start new projects from scratch, so why did Dow Jones decide to create a new data platform from scratch?

At Dow Jones, data users faced five problems when working with data.

Multiple versions of the truth
Limited performance visibility
Lost time spent hunting for data
Lack of information affects decision making
Inability to segment
Pre-cloud data challenges at Dow Jones
Users could not get their hands on the data they wanted to reach. With these issues in mind, Colleen and her team saw an opportunity. That opportunity was to use the cloud and turn data from a cost center into a revenue-generating center by creating a new world-class data platform.

Design of the new data platform
To plan the architecture and choose all the tools involved in creating their data platform, the team created a council of cloud technologists. The board includes Dow Jones experts, industry specialists, and AWS members to help design the architecture for the new platform.

Dow Jones Data Platform reference architecture
There are five core AWS technologies that underpin the architecture:

S3 as the data lake
EC2 to extract data in S3
EMR and Spark to process data
AWS Glue for organizing and partitioning data
Amazon Redshift as an analytics platform
These five technologies form the backbone of the Dow Jones data pipeline.

S3 as the data lake

S3 is the testing area for obtaining, standardizing and cataloging data. The goal is to collect, clean and key each relevant customer event for later use. Data in S3 is transformed into Parquet and normalized for consumption using self-service tools and analytics use cases.

EC2 to extract data in S3

Not every system that Dow Jones works with can place data directly on the platform through, for example, out-of-the-box ETL tools. To solve this data delivery problem, EC2 instances pull data from third-party servers, APIs, and sources.

EMR and Spark to process data

Amazon EMR is an AWS framework for processing big data workloads. EMR allows you to store data in S3 and run the calculation in a separate process. EMR provides native support for Apache Spark. The decision of when to use Spark vs Redshift for data processing depends on the use case.

Dow Jones uses EMR to process, massage and transform data, with different S3 cubes for individual steps and stages.

Data lake areas and S3 cubes
AWS Glue for organizing and partitioning data

End users access “data marts”, that is, aggregated data with Applied Business Rules. An example is a” demographic data market”, where Dow Jones summarizes and exposes profiles of a single user (e.g. cleaned up for different job titles for the same customer).

To tag, organize, and partition data for intuitive top-down access from S3, Dow Jones uses AWS Glue.

Amazon Redshift as an analytics platform

At the beginning of architecture planning, the decision came down to choosing between using Amazon Athena and Amazon Redshift for the analytics layer. They chose Amazon Redshift, for three reasons.

Permission. A key issue Dow Jones had to address was restricting access to confidential customer information and PII. Redshift allows users to set permissions on data by schema, table, and even fields, for example by using IAM roles.
Cost / Performance. With clean and normalized data already in S3, Amazon Redshift offers options to choose between cost and performance. To optimize cost, keep the data in S3, expose it as an external table, and use Redshift Spectrum to query the external table. To optimize performance, create a physical table in Redshift and use the COPY command to move data to the cluster.
Analysis tools. To create a bi layer, Redshift allows you to point to a place to analyze and access data in S3 and data marts. Custom dashboards allow you to join different datasets (e.g. customer data, clickstream data, third-party data). Users can use any tool of their choice to access the cluster.

Amazon Athena comparison with Amazon Redshift and Redshift Spectrum
Best practices for querying data
Amazon Redshift is amazing at adding large datasets. But with free scope access to a Redshift cluster, for example, for custom dashboards or reports, you still have to keep in mind that users end up writing Poor SQL.

Consider that Redshift is an analytical database (”OLAP“), and unilke transactional databases (”OLTP”) do not use indexes. Therefore, SQL statements that include a “SELECT *” can affect the query and overall cluster performance. Rather, users should select a specific column.

SQL query best practices in Amazon Redshift
The data team has addressed this issue by recommending best practices to its users by querying smaller datasets.

But users don’t always pay attention to these best practices, which is where our automated individual query recommendations come to the rescue. With individual query optimization recommendations, you can empower users to fine-tune their SQL queries.

New use cases for data
With the new platform up and running, Dow Jones is empowering the business with new data use cases. Three examples of new use cases.

Consumer publication Panel

A custom dashboard that links clickstream, subscription, membership, and demographics for Dow Jones consumer publications. With this dashboard, users can segment, filter, sort, and see who is reading what.

Advertising performance board

This dashboard provides analytics and information about the performance of ads and how users interact with them. The control panel unites datasets in eleven different sources by doing eleven different things, in a standard format.

Data visualization with B2B data

A 360 view of Dow Jones customers in the B2B space, combining clickstream behavior data with individual customer data.

To power those dashboards, the Redshift cluster hosts more than 118 TB of data. More than 100 users access and query data in Redshift in a self-service model.

With different competing workloads and hundreds of users writing queries, it’s crucial to set up workload management in Redshift.

The future of the Dow Jones Data Platform
All the work Dow Jones had put into creating its new data platform was done with the future in mind.

Beyond reporting, artificial intelligence and predictive analytics are the future of business. Dow Jones, as an industry leader, has to be at the forefront of this shift. That is one of the main reasons why they have prepared this data platform.

When designing the architecture, a key goal was to make your data ready for AI. Cleaning and preparing data is one of the most challenging and slow aspects of data science.

By creating a system that has data cleanup and preparation as part of the process, they have allowed their data scientists to focus on the work that generates results. The work of model building, model training, and model evaluation is where data scientists make a living, and that’s where Dow Jones wants its data scientists to devote their efforts. A key factor here is fast and efficient queries, as that reduces cycle times and increases the volume of iterations for training models.

AI, machine learning and predictive analytics is what Dow Jones wants its data platform to do. With Redshift as the aggregation layer, they are using Amazon SageMaker to build and train models for predictive analytics.

With the new data platform system in place, Dow Jones is now ready for the future of data. Through the use of AWS and Redshift, Dow Jones has successfully converted the overflow of data from many different sources from a cost center to a revenue-generating one.

Its mass of data from many different sources provides value for your business and customers in the present. For the future, they are prepared by having a system to organize and prepare their data for predictive analytics and machine learning.

Why Redshift, why create a cloud data warehouse?
When Dow Jones began the first steps to create this data platform, they chose Amazon Redshift as their technology base. Some of the key benefits of using a cloud warehouse like Redshift include:

Fast performance at low cost.
Open and flexible with existing business intelligence tools.
1/10 of the cost of traditional data stores.
Building your new cloud data platform is an obvious choice. The advantages of Amazon Redshift make it an easy choice for teams building new data platforms in the cloud. And guided with our query recommendations, you ensure that your SQL is always tuned to perfection for your data architecture.

 


0 Comments

Leave a Reply

Your email address will not be published. Required fields are marked *