Impala File Formats
After you start Impala, use the impala-shell or a JDBC or ODBC client to query data. You can query data stored in files, as well as data stored in HBase and MapR-DB tables. Impala depends on the Hive metastore to track table metadata. MapR-FS tracks the metadata of other files. Impala supports Text and Parquet file formats. If you want to query data using SequenceFile, RCFile, and Avro file formats, use Hive to load the data. Impala supports Snappy, GZIP, Deflate, and BZIP compression codecs.
The following table summarizes the supported Impala text formats:
|File Type||Format||Compression Codecs||Can Impala Create?||Can Impala INSERT?|
|Parquet||Structured||Snappy (default), GZIP||Yes||Yes. CREATE TABLE, INSERT, and query.|
|Text||Unstructured||Snappy, GZIP, BZIP||Yes, for CREATE TABLE with no STORED AS clause;|
default file format is uncompressed text with values separated
by ASCII 0x01 characters, typically represented a Ctrl-A
|Yes. CREATE TABLE, INSERT, and query.|
|Avro||Structured||Snappy, GZIP, deflate, BZIP2||No, create using Hive.||No. Query only. Load data using Hive.|
|RCFile||Structured||Snappy, GZIP, deflate, BZIP2||Yes||No. Query only. Load data using Hive.|
|SequenceFile||Structured||Snappy, GZIP, deflate, BZIP2||Yes||No. Query only. Load data using Hive.|
Impala SQL Dialect
Impala uses the SQL query language and is compatible with the Hive query language, HiveQL. You can use other languages, such as Java, to interact with Impala through ODBC and JDBC interfaces. The Impala SQL dialect supports a subset of SQL and HiveQL functions, statements, datatypes, operators, and built-in functions.
The Impala SQL dialect supports DML statements similar to the DML component of HiveQL. The Impala SQL dialect does not support UPDATE and DELETE statements, and it does not support the INSERT…VALUES syntax to insert a single row.
Refer to Supported and Unsupported SQL/HiveQL Language Features for a list of supported and unsupported functions, statements, datatypes, operators, and features.
Example: Running an Impala SQL Query
In this example scenario, download a customer CSV file and use the Hive shell to create a table and import customer data into the table and then run an Impala query on the table.
- Download the following CSV file to
Issue the following command from the hive-shell to import the CSV file and create a table:
Issue the following command in the hive-shell to load the customer data into the customers table:
- Issue the following command to start the Impala shell:
To connect to an instance of Impala, issue the following
impalad-hostwith the host name you have configured on a node running Impala:
- Issue the following command to query the data to find the total number of customers:
select count(*) from customers
The query returns the following result:
| count(*) |
| 501 |
Query MapR-DB and HBase Tables with Impala
You can use Impala to query data in MapR-DB and HBase tables. To query data in a MapR-DB or HBase table, create an external table in the Hive shell and then map the Hive table to the corresponding MapR-DB or HBase table. You can map a MapR-DB or HBase table to a Hive table with or without string row keys. When you create an external table in Hive, use the HBaseStorageHandler clause in the Hive CREATE TABLE statement to allow Hive to access data stored in the MapR-DB and HBase table. The HBaseStorageHandler has two important properties:
hbase.columns.mapping– this property specifies the Hive column to column family mapping
hbase.table.name– this can be the absolute path of the table or just the table name, depending on whether the table path is mapped
For more information about mapping processes, refer to Mapping Table Namespace Between Apache HBase Tables and MapR-DB Tables and Hive/HBaseIntegration.
Once you have mapped the MapR-DB or HBase table to Hive, you can query or insert into the table from Impala because Hive and Impala share the metastore database. Impala nodes cache table metadata if a table contains a large amount of data or has many partitions. Caching the metadata reduces runtime for future queries on the table.
If you insert new data into a Hive, MapR-DB, or HBase table, use the Impala shell to issue the REFRESH statement to refresh the data location cache. The REFRESH command only applies to the node that the Impala shell is connected to. If you route all SQL statements to the same node, you do not have to issue regular REFRESH statements when table data is updated on other nodes.
If you create, drop, or alter any external tables or databases, use the Impala shell to issue the INVALIDATE METADATA statement to refresh table structure metadata.
An Impala user must have read/write privileges for an HBase table. To grant permission to a user, issue the GRANT command to the user from the HBase shell. Refer to http://hbase.apache.org/book/hbase.accesscontrol.configuration.html for more information.
Example: Running an Impala Query on HBase/MapR-DB Tables
In this example scenario, a professor wants to know how many times a student clicks on Google from his webpage. He wants to use Impala to query the data in MapR-DB. One of his students offered to load the data into MapR-DB so he can access it. In order to complete the professor’s request, the student must use the HBase shell to create two MapR-DB tables that contain the following schema and then put data in the tables:
- account – id, name
- address – street, zipcode, state
- clickinfo – clickid, studentid, url, time
- iteminfo – itemid, quantity
Each bullet corresponds to a column family with a list of columns. In order to access the tables using Impala, the student must create external tables in Hive with mapped columns that match the MapR-DB columns.
If you would like to be the student, you can perform the following steps to help the professor:
Use the HBase shell to create two tables in MapR-DB: “student” and “clicks”. To create the tables, issue the following commands:
hadoop fs –lscommand on the table location to verify that the tables exist.
- Create external tables in Hive with the appropriate column mapping for the “student” and “clicks” tables using a string row key. Remember the two important properties for HBaseStorageHandler:
hbase.columns.mapping, which specifies Hive column to column family mapping.
hbase.table.name, which can be the absolute path of the table or just the table name, depending on whether the table path is mapped.
To create the external tables in Hive, run the following commands:
testdata.txtfile with the following content to add data into the “students” and “clicks” MapR-DB tables.
Press Control + Z to finish editing the file and pipe these commands to the HBase shell to insert the test data:
Scan the tables to verify that the data was inserted correctly. Run the following commands to perform the scan:
Use Hive to verify that the data was inserted into the tables. Issue the
SELECTstatement against students and clicks to verify the count in each table.
Since the Impala shell was running when you inserted the data, verify that the metadata is refreshed to make sure that Impala is aware of the new tables created.
From the Impala shell , issue the
INVALIDATE METADATAstatement to refresh the metadata.
To query the tables and to find out which students clicked on
google.com, run the following command from the Impala shell: