The innovations continue at a rapid pace in SQL on Hadoop solutions with each vendor trying to outsmart the competition. In this second part of interview with Actian’s Emma McGrattan, we try to understand architecture of Actian Vortex’s SQL in Hadoop offering with particular focus on database/SQL layer named Vector. Emma is the Senior Vice President for Engineering at Actian and described the "Marchitecture" (as she likes to term it) in a conversation with Sachin Ghai. As per Emma, Actian Vortex product suite is among the fastest and most mature SQL 'in' Hadoop offering.
Actian Engineering has definitely put a lot of thought and innovation in the Vortex architecture. It is one of those products where the engineering team exactly knew the nuts and bolts of Hadoop as well as the cranks and shafts of database. It is rare currently to find an SQL offering which relies on HDFS as storage but still achieves enterprise grade resonant with the database category. Utilizing YARN more as an enabler and replication as an added blessing, Actian Vortex has made it’s Vector database off-Hadoop to seamlessly transition to Hadoop. With more production implementations in future months, we can expect hardware disk requirements to get further optimized and in-memory features to take center stage.
Read more to understand its technical and functional architecture.
Actian Engineering has definitely put a lot of thought and innovation in the Vortex architecture. It is one of those products where the engineering team exactly knew the nuts and bolts of Hadoop as well as the cranks and shafts of database. It is rare currently to find an SQL offering which relies on HDFS as storage but still achieves enterprise grade resonant with the database category. Utilizing YARN more as an enabler and replication as an added blessing, Actian Vortex has made it’s Vector database off-Hadoop to seamlessly transition to Hadoop. With more production implementations in future months, we can expect hardware disk requirements to get further optimized and in-memory features to take center stage.
Read more to understand its technical and functional architecture.
You mentioned about HDFS being one of bottlenecks for performance but what about YARN. Is YARN a bottleneck or an enabler for SQL on Hadoop solutions?
It's funny because it's both an enabler and can actually be a real challenge. Where it becomes a real challenge is the fact that YARN is not designed for long running tasks - it's designed for MapReduce jobs which are typically very short in duration. Typically for a database you expect to start the database and it should just keep running until you need to stop it. It's a long running task in that one would expect for a full quarter without ever coming down for maintenance and that's a challenge for YARN. We do some things within our products so that when we're using YARN, we can allow for the fact database is a long running task and be able to say I want to allocate more or less resources to the database. May be you are doing month end or quarter end processing and you want to give the database some additional cores from environment. You want to add in additional nodes that will speed up processing and then you want to remove those additional resources that you granted it when the quarter or the month is over. That's a challenge for YARN - so we needed to overcome that challenge because what we see is traditionally when you're dealing with EDW, you have periods of huge activity. What you want to do then is to give database as much resources you can spare and then take back afterwards.
We had to figure out how to do that with YARN and then the other thing that we need to figure out is how we drop priority of the database. It may be you've got the ETL task that you want to give priority to - you may want ETL processing to get the most of the resources but once the ETL processing is done, you want the database to get to full strength. That was another thing that we figured out with YARN. We were able to use YARN to get dynamic resource scheduling in terms of adding more resources and taking them away in a dynamic or elastic fashion. Typically when you're dealing with database, you are given a certain amount of memory that you're going to allow it to use over its lifetime and very rarely you want to shrink it. We were able to use YARN to figure out how to do that.
We also got the benefit of using YARN for some things like intelligent block replica placement. By default YARN will replicate the blocks that you write to HDFS and that's fantastic for database because you have that built-in failover capability. If you lose a disk there's another copy of the data after that you can cover from. But we were able to do was to tell YARN when we're writing block replicas that we want to tell where they are placed. If you're joining two tables and the tables are partitioned in such a way that the rows that you're joining aren't going to be co-located, we tell YARN we wanted to locate one of those replicas with the table that we are going to be joining it to. That way we were able to avoid any internal communication when doing complex query joins. YARN was hugely beneficial to do something like that. So yeah it's a double edged sword - it presents challenges for enterprise data warehouse but it also provides some capabilities that would be difficult if not impossible to implement and get those for free.
Taking example of join operation, can you give us insights into Vortex architecture?
As part of the platform we have built, we also have a visual workbench to figure out the data flow to take data into the Hadoop environment. It also gives a lot of capabilities around analytics. It is a simple point and click workbench that you drag some nodes on the screen and can build rich data science applications. The box number 1 on the figure is the visual workbench. The first thing we are doing here is reading data from Hadoop, maybe it's log file or smart meter data or something like that. If we follow this read it connects to our high-performance ingest, prep and load that's running on the name node in the cluster. That parallelizes read of source data - so we have these 4 green cylinders on the data nodes and that's just a data file that we're reading using data flow - read all parts in parallel and read that up into the name node in the cluster. Maybe at this point we may want to do some analytics and data science. We may want to do some ingest, prep and load as well and that's the scenario we're going to talk now because that's where our SQL on Hadoop capability comes in again.
So we've read the data in parallel using our data flow product and now decide that we want to load data in SQL in Hadoop (Vector) technology. We follow the number 2 we're connected to the database master node. We have a master worker architecture and the master node is responsible for preparing the query execution plan and then dividing up the workload to all of the data nodes that are participating in solving the query. The database that's running on the master node generates the query execution plan and if you follow the blue line on the diagram that will then connect to what we call the X100 server. We call it X 100 because we believe it to be a hundred times faster than anything else out there and that's been heart of SQL on Hadoop solution that runs on data nodes. This X100 server will write data to disk so we look at the Blue cylinders here and we are right to blue blocks on disk. We can run that X 100 server on a subset of the nodes and when you install the product you decide which data nodes are going to be participating in your workers set. I mentioned already that you can grow and shrink this depending upon your business requirement.
In this case we're writing with data out to 4 nodes and so we have got here 4 blue cylinders that are used to illustrate 4 partitions we are writing to. You will see written underneath here then is the block replica and these grey cylinders represent the fact that these four blocks are going to be replicated three times. For simplicity sake we want to join all of the data that is in Data Vector A with the Data Vector B. Rather than having to pull all of the data from Data Vector B over to node where the Data Vector A lives, we actually have a replica of that block that is local to us. That replica of B is right there and we are able to join with the data in there. HDFS has already guaranteed that's an exact replica of the data that is held in Vector B and we are able to join with and remove the need to communicate between the nodes.
We are talking of big data and each of these tables could represent billions or hundreds of billions of rows of data. We want to avoid having to move data around the cluster if at all possible. Using the block replicas to perform these joins is something that really yields great performance results. When we showed you earlier that slide where we compared performance with Impala, that's one of the reasons. If you look again to slide for queries that are over 30 times faster than Impala, and Impala spends a lot of time data moving data around between nodes. We can out-perform them because of other architectural benefits that we have but not having to move data around for complex joins is something that is bringing benefits in complex queries.
When you create the tables and you define your primary key foreign key relationships between the table, we use that as a hint as to partitioning the data and building up the location of the block replicas - for figuring out where we want to locate those replica so that we can benefit from a performance perspective. And the other thing that influences placement for the block replica is making sure that should we lose a node in the environment, our performance doesn't drop off the cliff. The loss of one node will impact performance because you're dealing with 3 nodes instead of 4 and you expect performance to be impacted in that scenario by 25 percent. You don't want all joins that you do subsequently to be remote joins and that comes in play also to determine how you place replicas. That's how we perform the joins.
If we look to architecture diagram and that 4th number that's there in Black is the standard SQL applications on BI tools that can just connect directly to the master node and you don't need to make any changes to the those applications. Without changing a line of code we can provide a scalable solution. We have customers that were previously on our Vector solution and they've moved now to Vector in Hadoop and they are seeing better than linear scalability. For our customers not having to make any changes to application and get a linear scalability is very exciting.
Is metadata for Vortex available to any other agents outside Vortex?
Today, you can access metadata through JDBC/ODBC and some of the APIs but it is not available as part of HCatalog. In the release which is coming up in the first quarter of 2016, we will actually be using HCatalog for storing our metadata. That will be externally accessible by more than just the standard database API. As part of that release, we will also be adding external table support and that will enable you to take any other data source and just register it as a table within the database. We will be able to join data that's held in the Actian SQL in Hadoop solution with any other data in the environment. Let's say you want to join our data with parquet data, you would need to actually load the parquet data in the environment using the Data Flow solution. But in the release that we have coming out in the first quarter of next year you can leave that data in Parquet. We prefer that you import the data into our solution because performance is a lot greater but what we do recognize a lot of people have standardized on parquet for data storage. (So we will provide) the ability to register those parquet file as tables within the database and then just treat them as Vector table.
How does Vortex support window functions and subqueries?
We have full sub querying and window function support - we have cube, roll-up, grouping set - all of the advanced analytics capabilities that are called out in the SQL language spec and have been incorporated within the product. They have been there for a number of years.
The Vector on Hadoop technology began life as a product Vector off Hadoop almost a decade ago and has been around for some time. It is an analytics engine but you know people take off the shelf tools like Cognos, MicroStrategy, Informatica, Tableau and so on and run it against the database. They have requirements like window framing, multiple window functions, grouping sets and so on - all of those are built into the product. The SQL language committee does not call out a specific analytic language set per se. So it is not like there's you can say "Oh! look, I'm SQL 2015 analytic compliant" because it's not the way the language is defined. But we do have all the analytics capabilities that are described by both the ANSI and the ISO SQL committees.
Is Vortex truly open source or is it just the API that is open?
Vortex is not open source but we are in the process of publishing a set of libraries that will enable other products to access data that's been written to disk by Vector. They don't need to go through the Vector engine to access the data. Projects that could benefit from storing data in Vector format can use that but also products that want to read data directly without having to go through Vector can do so. This is something that was raised by a couple of our early prospects where they didn't want to be tied into a proprietary solution on Hadoop. We did not want to open source Vector on Hadoop because what we do is quite unique. There's a lot of IP in there that we don't want to disclose at this point in time. We do want people to be able to store data in our format and read and write data to our file format. We're publishing a set of API that will enable people to do that without buying or using the product.
Comments
Post a Comment