Quantcast
Channel: Active questions tagged consumer - Stack Overflow
Viewing all articles
Browse latest Browse all 97

dataproc serverless slow consume kafka topic

$
0
0

I use dataproc serverless with java API to read kafka topic. The topic has only 2 partitions.The topic receives 80 msg/sec.After reading messages I repartition to 100, transform the data then write to BQ.Since I repartition I see two stages one with 2 tasks to read data and a second one with 10 that transforms and writes the data.When havingstreaming.max.offsets.per.trigger=500streaming.min.offsets.per.trigger=100The task of reading is varying a lot in time between 7sec and 1min50.While the task that transforms an dwrites data takes around 10 sec.Any idea about why it's taking too much time to read data and how to optimize the code ?

Dataset<Row> dfr = spark                .readStream()                .format("org.apache.spark.sql.kafka010.KafkaSourceProvider")                .option("kafka.bootstrap.servers", kafkaServers)                .option("kafka.sasl.kerberos.service.name", "kafka")                .option("kafka.sasl.mechanism", "GSSAPI")                .option("kafka.security.protocol", "SASL_SSL")                .option("kafka.ssl.truststore.location", trustStoreName)                .option("kafka.ssl.truststore.password", truststorePassword)                .option("kafka.ssl.truststore.type", "JKS")                .option("startingOffsets", "latest")                .option("kafka.max.partition.fetch.bytes", "209715200")  // 200MB per partition                .option("kafka.fetch.max.bytes", "1048576000")            // 1000MB total                .option("subscribe", kafkaTopic)                .option("maxOffsetsPerTrigger", maxOffsets)                .option("minOffsetsPerTrigger", minOffsets)                .option("failOnDataLoss", "false")                .option("kafka.request.timeout.ms", 300000)                .option("kafka.session.timeout.ms", 60000)                .load();        Dataset<Row> dfr2 = dfr.selectExpr("CAST(topic as STRING) as topic", "CAST(key AS STRING) AS key","CAST(value AS STRING) AS xml","timestamp", "partition", "offset").repartition(10);        StructType outSchema = new StructType()                .add("key", DataTypes.StringType)                .add("topic", DataTypes.StringType)                .add("partition", DataTypes.IntegerType)                .add("offset", DataTypes.LongType)                .add("JSON_COL", DataTypes.StringType)                .add("DAT_MAJ_DWH", DataTypes.StringType);        // Create proper encoder - cast the result to Encoder<Row>        Encoder<Row> encoder = Encoders.row(outSchema);        Dataset<Row> jsonified = dfr2.mapPartitions(                (MapPartitionsFunction<Row, Row>) (Iterator<Row> it) -> {                    List<Row> out = new ArrayList<>();                    DocumentBuilder builder = XML_BUILDER.get();                    while (it.hasNext()) {                        Row r = it.next();                        String topic = r.getString(0);                        String key = r.getString(1);                        String xml = r.getString(2);                        int part = r.getInt(4);                        long offset = r.getLong(5);                        String json = null;                        try {                            builder.reset();                            // Pass the XML string, not the Document                            json = BusMessageXmlJson.toJson(xml);                        } catch (Exception ex) {                            System.err.println("XML parse error partition=" + part +" offset=" + offset +" msg=" + ex.getMessage());                        }                        String ts = r.getTimestamp(3).toInstant().toString();                        out.add(RowFactory.create(key, topic, part, offset, json, ts));                    }                    return out.iterator();                },                encoder        ).withColumn("DAT_MAJ_DWH",                date_format(to_timestamp(col("DAT_MAJ_DWH")), "yyyy-MM-dd'T'HH:mm:ss.SSSSSS")        ).select("key","topic","partition","offset","JSON_COL","DAT_MAJ_DWH");        StreamingQuery query = jsonified                .writeStream()                .queryName("spark-sdh-ndc-streaming-query")                .foreachBatch((batchDF, batchId) -> {                    // Write this batch to BigQuery using batch API                    batchDF.select("JSON_COL", "DAT_MAJ_DWH").write()                            .format("bigquery")                            .option("temporaryGcsBucket", tempBucket)                            .option("table", bigQueryTable)                            .option("createDisposition", "CREATE_IF_NEEDED")                            .option("intermediateFormat", "avro")                            .option("writeMethod", "indirect")                            .option("allowFieldAddition", "true")                            .option("allowFieldRelaxation", "true")                            .mode(SaveMode.Append)                            .save();                    // 2. Commit offsets to Kafka AFTER successful write                    commitOffsetsToKafka(batchDF, kafkaServers, trustStoreName, truststorePassword, consumerGroup);                    System.out.println("Batch " + batchId +" written successfully");                })                .option("checkpointLocation", checkpointPath)                .trigger(Trigger.ProcessingTime(triggerInterval))                .start();        System.out.println("Streaming query started successfully!");        System.out.println("Query ID: " + query.id());        System.out.println("Waiting for termination... (Ctrl+C to stop)");        query.awaitTermination();

enter image description here


Viewing all articles
Browse latest Browse all 97

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>